Fast second-order Kronecker-factored quantization for LLMs
๐ง Abstract
Quantization with second-order information has shown strong promise for preserving model quality under aggressive compression. Building on the recent YAQA framework (Tseng et al., 2025b), which employs Kronecker-factored approximations of the Hessian via a power-iteration technique, we propose an alternative approach that replaces this step with a more efficient Kronecker decomposition method from Chekalina et al. (2025). This formulation preserves the benefits of second-order curvature-aware quantization while substantially reducing computational cost. We apply our method to LLaMA-2 7B, LLaMA-3 8B Instruct, and Qwen-3 8B Instruct and demonstrate that it achieves the same post-quantization model quality as YAQA, but with significantly faster computation โ the Kronecker factors required for target quality are obtained with 10ร fewer tokens and approximately a 10ร speedup over the original work.
๐งญ Checkpoints
Model name
Architecture
Bits
FastKronQuant-LLaMA2-7B-4bit
LLaMA-2-7B
4-bit
FastKronQuant-LLaMA3-8B-4bit
LLaMA-3-8B-Instruct
4-bit
FastKronQuant-Qwen3-8B-4bit
Qwen-3-8B
4-bit
FastKronQuant-LLaMA2-7B-2bit
LLaMA-2-7B
2-bit
FastKronQuant-LLaMA3-8B-2bit
LLaMA-3-8B-Instruct
2-bit
FastKronQuant-Qwen3-8B-2bit
Qwen-3-8B
2-bit
Each checkpoint is fully compatible with Hugging Face transformers and can be loaded like any standard model.
๐ Features
โก Fast Kronecker decomposition โ up to 10ร faster factor estimation
๐งฎ Second-order quantization โ preserves model accuracy
๐ชถ Works with popular architectures: LLaMA-2, LLaMA-3, Qwen-3
๐ธ Compatible with ๐ค transformers out of the box
๐ Usage Example (LLaMA-2 7B)
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "username/FastKronQuant-LLaMA2-7B"# replace with actual repo
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
)
prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
๐งช Example โ ARC Easy evaluation
from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
# Load ARC-Easy
ds = load_dataset("ai2_arc", "ARC-Easy")["test"]
# Load quantized model
repo_id = "username/FastKronQuant-LLaMA2-7B"
tok = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto"
)
pipe = pipeline("text-generation", model=model, tokenizer=tok)
# Simple evaluation loopfor i inrange(3):
q = ds[i]["question"]
a = pipe(q, max_new_tokens=32)[0]["generated_text"]
print(f"Q: {q}\nA: {a}\n---")
If you use model chekpoints in your experiments, please cite
@misc{chekalina2025gfwsvd,
title={Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models},
author={Viktoriia Chekalina and Daniil Moskovskiy and Daria Cherniuk and Maxim Kurkin and Andrey Kuznetsov and Evgeny Frolov},
year={2025},
eprint={2505.17974},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2505.17974},
}