Performance evaluation of Gemma 3-27b-it with different quantization methods (4-bit vs 8-bit)
Hi team, I'm planning to deploy Gemma 3-27b-it on a consumer GPU with limited VRAM. I've noticed some performance variations when using 4-bit quantization (bitsandbytes). Have you guys performed any benchmarks on how much the reasoning capability drops compared to the FP16 version? Any recommended quantization parameters for maintaining logical consistency?
Hi @Ryan1007
Google has not published an official benchmark table specifically comparing bitsandbytes to the FP16/BF16 base model for Gemma 3 27b-it. However, you can refer the community-led benchmarks available on Reddit . I have included links to thes benchmarks below for your reference.
https://www.reddit.com/r/LocalLLaMA/comments/1k6nrl1/i_benchmarked_the_gemma_3_27b_qat_models/
https://www.reddit.com/r/LocalLLaMA/comments/1k3jal4/gemma_3_qat_versus_other_q4_quants/
To maintain logical consistency you can start with parameters like NF4 quantization and turning on double quantization, while keeping the compute dtype in FP16. In practice that means setting bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True and bnb_4bit_compute_dtype=torch.float16 . We’ve generally seen NF4 hold up better than plain linear 4-bit, especially around outlier weights and double quant helps recover a bit more fidelity, which translates into more stable reasoning. Please let me know if this setup helps you .
Thanks
Hi @pannaga10 .
Just to confirm, are you using FP16 or BF16? I thought GDM usually defaults to BF16.
Hi @Ryan1007 ,
Google has released the QAT version (https://huggingface.co/collections/google/gemma-3-qat).
I expect these QAT models to offer more stable performance after quantization, compared to directly quantizing the original 27b-it model.
I'm not entirely sure if my understanding is correct, but I'm currently using google/gemma-3-12b-it-qat-int4-unquantized together with bitsandbytes.