Gemma 4 21B-A4B-IT REAP β€” Q4_K_M GGUF

GGUF quantization of 0xSero/gemma-4-21b-a4b-it-REAP.

What is REAP?

Router-weighted Expert Activation Pruning β€” removes the 20% least important MoE experts (25 of 128 β†’ 103 remaining) using calibration-based scoring. Unlike quantization, REAP removes entire experts while keeping the same active parameter count (~4B) per token.

Model Details

Property Value
Base Model google/gemma-4-26b-a4b-it
Pruning REAP 0.20 (103/128 experts)
Quantization Q4_K_M (5.32 BPW)
File Size 12.87 GB
Total Parameters 20.77B
Active Parameters ~4B per token
Context Window 262,144 tokens
Architecture Gemma4 MoE (30 layers)

Performance

Tested on RTX 4070 Ti SUPER (16GB VRAM):

  • Speed: 65-95 tokens/second
  • VRAM Usage: ~14.8 GB (fits 16GB cards)
  • Full GPU offload with (KV cache in system RAM)

How to Run (llama.cpp)

Conversion

Converted from BF16 safetensors β†’ F16 GGUF using llama.cpp convert_hf_to_gguf.py, then quantized with llama-quantize to Q4_K_M.

Downloads last month
1,026
GGUF
Model size
21B params
Architecture
gemma4
Hardware compatibility
Log In to add your hardware

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for saria-lh/gemma-4-21b-a4b-it-REAP-Q4_K_M-GGUF

Quantized
(9)
this model