Gemma 4 21B-A4B-IT REAP — Q4_K_M GGUF

GGUF quantization of 0xSero/gemma-4-21b-a4b-it-REAP.

What is REAP?

Router-weighted Expert Activation Pruning — removes the 20% least important MoE experts (25 of 128 → 103 remaining) using calibration-based scoring. Unlike quantization, REAP removes entire experts while keeping the same active parameter count (~4B) per token.

Model Details

Property	Value
Base Model	google/gemma-4-26b-a4b-it
Pruning	REAP 0.20 (103/128 experts)
Quantization	Q4_K_M (5.32 BPW)
File Size	12.87 GB
Total Parameters	20.77B
Active Parameters	~4B per token
Context Window	262,144 tokens
Architecture	Gemma4 MoE (30 layers)

Performance

Tested on RTX 4070 Ti SUPER (16GB VRAM):

Speed: 65-95 tokens/second
VRAM Usage: ~14.8 GB (fits 16GB cards)
Full GPU offload with (KV cache in system RAM)

How to Run (llama.cpp)

Conversion

Converted from BF16 safetensors → F16 GGUF using llama.cpp convert_hf_to_gguf.py, then quantized with llama-quantize to Q4_K_M.

Downloads last month: 1,026

GGUF

Model size

21B params

Architecture

gemma4

Hardware compatibility

4-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for saria-lh/gemma-4-21b-a4b-it-REAP-Q4_K_M-GGUF

Base model

0xSero/gemma-4-21b-a4b-it-REAP

Quantized

(9)

this model