Performance evaluation of Gemma 3-27b-it with different quantization methods (4-bit vs 8-bit)

#102
by Ryan1007 - opened

Hi team, I'm planning to deploy Gemma 3-27b-it on a consumer GPU with limited VRAM. I've noticed some performance variations when using 4-bit quantization (bitsandbytes). Have you guys performed any benchmarks on how much the reasoning capability drops compared to the FP16 version? Any recommended quantization parameters for maintaining logical consistency?

Google org

Hi @Ryan1007
Google has not published an official benchmark table specifically comparing bitsandbytes to the FP16/BF16 base model for Gemma 3 27b-it. However, you can refer the community-led benchmarks available on Reddit . I have included links to thes benchmarks below for your reference.
https://www.reddit.com/r/LocalLLaMA/comments/1k6nrl1/i_benchmarked_the_gemma_3_27b_qat_models/
https://www.reddit.com/r/LocalLLaMA/comments/1k3jal4/gemma_3_qat_versus_other_q4_quants/

To maintain logical consistency you can start with parameters like NF4 quantization and turning on double quantization, while keeping the compute dtype in FP16. In practice that means setting bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True and bnb_4bit_compute_dtype=torch.float16 . We’ve generally seen NF4 hold up better than plain linear 4-bit, especially around outlier weights and double quant helps recover a bit more fidelity, which translates into more stable reasoning. Please let me know if this setup helps you .

Thanks

Hi @pannaga10 .
Just to confirm, are you using FP16 or BF16? I thought GDM usually defaults to BF16.

Hi @Ryan1007 ,
Google has released the QAT version (https://huggingface.co/collections/google/gemma-3-qat).
I expect these QAT models to offer more stable performance after quantization, compared to directly quantizing the original 27b-it model.
I'm not entirely sure if my understanding is correct, but I'm currently using google/gemma-3-12b-it-qat-int4-unquantized together with bitsandbytes.

Sign up or log in to comment