Gemma 4 21B-A4B-IT REAP β Q4_K_M GGUF
GGUF quantization of 0xSero/gemma-4-21b-a4b-it-REAP.
What is REAP?
Router-weighted Expert Activation Pruning β removes the 20% least important MoE experts (25 of 128 β 103 remaining) using calibration-based scoring. Unlike quantization, REAP removes entire experts while keeping the same active parameter count (~4B) per token.
Model Details
| Property | Value |
|---|---|
| Base Model | google/gemma-4-26b-a4b-it |
| Pruning | REAP 0.20 (103/128 experts) |
| Quantization | Q4_K_M (5.32 BPW) |
| File Size | 12.87 GB |
| Total Parameters | 20.77B |
| Active Parameters | ~4B per token |
| Context Window | 262,144 tokens |
| Architecture | Gemma4 MoE (30 layers) |
Performance
Tested on RTX 4070 Ti SUPER (16GB VRAM):
- Speed: 65-95 tokens/second
- VRAM Usage: ~14.8 GB (fits 16GB cards)
- Full GPU offload with (KV cache in system RAM)
How to Run (llama.cpp)
Conversion
Converted from BF16 safetensors β F16 GGUF using llama.cpp convert_hf_to_gguf.py, then quantized with llama-quantize to Q4_K_M.
- Downloads last month
- 1,026
Hardware compatibility
Log In to add your hardware
4-bit
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support
Model tree for saria-lh/gemma-4-21b-a4b-it-REAP-Q4_K_M-GGUF
Base model
0xSero/gemma-4-21b-a4b-it-REAP