FastKronQuantization / README.md

Sayankotor

Update README.md

1aeb829 verified 4 months ago

preview code

raw

history blame contribute delete

8.11 kB

metadata

license: apache-2.0
datasets:
  - Rowan/hellaswag
  - lighteval/piqa
  - google/boolq
  - omarmohamed/arc_easy
metrics:
  - accuracy

tags: - quantization - kronecker - second-order - YAQA - LLaMA - Qwen - efficient

⚡ FastKronQuantization

Fast second-order Kronecker-factored quantization for LLMs

🧠 Abstract

Quantization with second-order information has shown strong promise for preserving model quality under aggressive compression. Building on the recent YAQA framework (Tseng et al., 2025b), which employs Kronecker-factored approximations of the Hessian via a power-iteration technique, we propose an alternative approach that replaces this step with a more efficient Kronecker decomposition method from Chekalina et al. (2025).
This formulation preserves the benefits of second-order curvature-aware quantization while substantially reducing computational cost.
We apply our method to LLaMA-2 7B, LLaMA-3 8B Instruct, and Qwen-3 8B Instruct and demonstrate that it achieves the same post-quantization model quality as YAQA, but with significantly faster computation — the Kronecker factors required for target quality are obtained with 10× fewer tokens and approximately a 10× speedup over the original work.

🧭 Checkpoints

Model name	Architecture	Bits
`FastKronQuant-LLaMA2-7B-4bit`	LLaMA-2-7B	4-bit
`FastKronQuant-LLaMA3-8B-4bit`	LLaMA-3-8B-Instruct	4-bit
`FastKronQuant-Qwen3-8B-4bit`	Qwen-3-8B	4-bit
`FastKronQuant-LLaMA2-7B-2bit`	LLaMA-2-7B	2-bit
`FastKronQuant-LLaMA3-8B-2bit`	LLaMA-3-8B-Instruct	2-bit
`FastKronQuant-Qwen3-8B-2bit`	Qwen-3-8B	2-bit

Each checkpoint is fully compatible with Hugging Face transformers and can be loaded like any standard model.

📌 Features

⚡ Fast Kronecker decomposition — up to 10× faster factor estimation
🧮 Second-order quantization — preserves model accuracy
🪶 Works with popular architectures: LLaMA-2, LLaMA-3, Qwen-3
🔸 Compatible with 🤗 transformers out of the box

🚀 Usage Example (LLaMA-2 7B)

from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "username/FastKronQuant-LLaMA2-7B"  # replace with actual repo
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

prompt = "What is the capital of France?"
inputs = tokenizer(prompt, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=50)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

🧪 Example — ARC Easy evaluation

from datasets import load_dataset
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM

# Load ARC-Easy
ds = load_dataset("ai2_arc", "ARC-Easy")["test"]

# Load quantized model
repo_id = "username/FastKronQuant-LLaMA2-7B"
tok = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto"
)
pipe = pipeline("text-generation", model=model, tokenizer=tok)


# Simple evaluation loop
for i in range(3):
    q = ds[i]["question"]
    a = pipe(q, max_new_tokens=32)[0]["generated_text"]
    print(f"Q: {q}\nA: {a}\n---")

If you use model chekpoints in your experiments, please cite

@misc{chekalina2025gfwsvd, title={Generalized Fisher-Weighted SVD: Scalable Kronecker-Factored Fisher Approximation for Compressing Large Language Models}, author={Viktoriia Chekalina and Daniil Moskovskiy and Daria Cherniuk and Maxim Kurkin and Andrey Kuznetsov and Evgeny Frolov}, year={2025}, eprint={2505.17974}, archivePrefix={arXiv}, primaryClass={cs.LG}, url={https://arxiv.org/abs/2505.17974}, }

📊 Zero-shot results — LLaMA-3 8B

🟡 4-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
16 bit (baseline)	–	0.5171	0.8409	0.7986	0.8177	0.5908	0.7131	–	–
4-bit Sketch A	4096	0.5136	0.8443	0.7997	0.8198	0.5865	0.7127	92	16 M
4-bit FastKron	75	0.5116	0.8438	0.8025	0.8207	0.5863	0.7129	9.5	712 K
4-bit No Hess	–	0.5119	0.8415	0.7959	0.8097	0.5859	0.7112	–	–

🟠 2-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
2-bit Sketch A	4096	0.4312	0.7567	0.7647	0.7391	0.5259	0.6435	92	16 M
2-bit FastKron	100	0.4277	0.7646	0.7661	0.7468	0.5159	0.6442	11.5	950 K
2-bit No Hess	–	0.2363	0.6336	0.6554	0.5108	0.3620	0.5094	–	–

📊 Zero-shot results — Qwen-3 8B

🟡 4-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
16 bit (baseline)	–	0.5563	0.8682	0.7677	0.8354	0.5708	0.7197	–	–
4-bit Sketch A	4096	0.5503	0.8611	0.7612	0.8324	0.5601	0.7132	84	8 M
4-bit FastKron	150	0.5469	0.8667	0.7601	0.8287	0.5637	0.7132	42	712 K
4-bit No Hess	–	0.5467	0.8675	0.7622	0.8312	0.5585	0.7132	–	–

🟠 2-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
2-bit Sketch A	4096	0.4536	0.7782	0.7435	0.7797	0.4611	0.6432	84	8 M
2-bit FastKron	150	0.4616	0.8416	0.7334	0.7702	0.4853	0.6584	42	712 K
2-bit No Hess	–	0.3993	0.8675	0.7743	0.7003	0.4758	0.6434	–	–

📊 Zero-shot results — LLaMA-2 7B

🟡 4-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
16 bit (baseline)	–	0.4325	0.7767	0.7774	0.7617	0.5721	0.6640	–	–
4-bit Sketch A	4096	0.4274	0.7688	0.7752	0.7613	0.5672	0.6599	50	16 M
4-bit FastKron	75	0.4283	0.7792	0.7802	0.7610	0.5660	0.6629	5	712 K
4-bit No Hess	–	0.4352	0.7875	0.7742	0.7609	0.5628	0.6641	–	–

🟠 2-bit Quantization

Method	Steps	ARC_c ↑	BoolQ ↑	PIQA ↑	ARC_e ↑	HSwag ↑	AVG ↑	GPU/h ↓	Tokens ↓
2-bit Sketch A	4096	0.3805	0.7333	0.7562	0.7192	0.5227	0.6223	50	16 M
2-bit FastKron	150	0.3843	0.7510	0.7600	0.7112	0.5139	0.6240	6	1400 K
2-bit No Hess	–	0.2210	0.6355	0.6306	0.5152	0.3422	0.4689	–	–