Comparing to new Dynamic v2.0 Unlsoth quants ?

#2
by BernardH - opened

Thank you so much for your quants and the info you provide on how to use ik_llama.cpp !
It seems Unsloth just released new quants :
https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF-UD

Would you mind comparing your quants with the new (Dynamic v2.0) Unsloth quants ?

Oh hey, interesting I didn't know they had released those. Here is what I can say without downloading and testing:

  1. You can check the model card sidebar for the gguf dump info to compare similar bpw models e.g. the UD-Q4_K_XL. So my quants use full q8_0 for all attention tensors, while theirs use more quantized which likely gives degraded quality but possibly at slightly faster speed depending on how you're running it. Mine also support repacked quants exclusive to ik_llama.cpp so will run faster when offloading onto CPU.
  2. There seem to be some possible issues with mainline llama.cpp MLA with this specific quant perhaps mentioned in this discussion, but I haven't tested myself.

I'd be curious if anyone did end up doing speed benchmarks with llama-sweep-bench on ik_llama.cpp fork for example. There are various examples floating around in various issues/PRs/discussions.

I compared 2 models:

  1. ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4 (227GB)
  2. unsloth/DeepSeek-V3-0324-GGUF-UD/DeepSeek-V3-0324-GGUF-UD-Q2_K_XL (233GB)

On the same server:

  • RTX 4090D 48GB VRAM
  • Intel Xeon Gold 5218 (16 cores)
  • 6 channels DDR4-2666 * 64GB

Using this version on Ubuntu 24:
./build/bin/llama-server --version
version: 3744 (7b1a3eec)
built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu

unsloth:

 CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-sweep-bench \
    --model /mnt/models/ollama/DeepSeek-V3-0324-GGUF-UD-Q2_K_XL.gguf \
    --ctx-size 16384 \
    -ctk  q8_0 -fa -mla 2 \
    -amb 512 \
    -b 4096 -ub 4096 \
    --temp 0.6 --top-p 0.95 \
    --n-gpu-layers 999 \
    --override-tensor "blk\.([1-9])\.ffn_.*=CUDA0" \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 16

llm_load_print_meta: model size       = 233.180 GiB (2.985 BPW)
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 207466.29 MiB
llm_load_tensors:        CPU buffer size =   497.11 MiB
llm_load_tensors:      CUDA0 buffer size = 37882.98 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 4096
llama_new_context_with_model: n_ubatch   = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 2
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 0
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   583.34 MiB
llama_new_context_with_model: KV self size  =  583.31 MiB, c^KV (q8_0):  583.31 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  4328.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   368.05 MiB
llama_new_context_with_model: graph nodes  = 5677
llama_new_context_with_model: graph splits = 155

main: n_kv_max = 16384, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 16, n_threads_batch = 16

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   50.305 |    81.42 |  169.717 |     6.03 |

ubergarm is little bit slower (~4%):

CUDA_VISIBLE_DEVICES=0 ./build/bin/llama-sweep-bench \
    --model models--ubergarm--DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
    --temp 0.3 --min-p 0.05 \
    --ctx-size 16384 \
    -ctk  q8_0 -fa -mla 2 \
    -amb 512 \
    -fmoe \
    -b 4096 -ub 4096 \
    --temp 0.6 --top-p 0.95 \
    --n-gpu-layers 999 \
    --override-tensor "blk\.([1-8])\.ffn_.*=CUDA0" \
    --override-tensor exps=CPU \
    --parallel 1 \
    --threads 16

llm_load_print_meta: model size       = 226.003 GiB (2.889 BPW)
llm_load_tensors: offloaded 62/62 layers to GPU
llm_load_tensors:        CPU buffer size = 19114.89 MiB
llm_load_tensors:        CPU buffer size = 46857.06 MiB
llm_load_tensors:        CPU buffer size = 46857.06 MiB
llm_load_tensors:        CPU buffer size = 46857.06 MiB
llm_load_tensors:        CPU buffer size = 43869.77 MiB
llm_load_tensors:        CPU buffer size =   938.98 MiB
llm_load_tensors:      CUDA0 buffer size = 39752.02 MiB
....................................................................................................
llama_new_context_with_model: n_ctx      = 16384
llama_new_context_with_model: n_batch    = 4096
llama_new_context_with_model: n_ubatch   = 4096
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn   = 2
llama_new_context_with_model: attn_max_b = 512
llama_new_context_with_model: fused_moe  = 1
llama_new_context_with_model: ser        = -1, 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init:      CUDA0 KV buffer size =   583.34 MiB
llama_new_context_with_model: KV self size  =  583.31 MiB, c^KV (q8_0):  583.31 MiB, kv^T: not used
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.49 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =  3852.02 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =   368.05 MiB
llama_new_context_with_model: graph nodes  = 5561
llama_new_context_with_model: graph splits = 106

main: n_kv_max = 16384, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 999, n_threads = 16, n_threads_batch = 16

|    PP |     TG |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
|  4096 |   1024 |      0 |   50.002 |    81.92 |  176.493 |     5.80 |

both runs using ~46GB of VRAM

Is it expected that IQ2_K_R4 will have less perplexity than UD-Q2_K_XL ?

Thanks for the information.

ubergarm is little bit slower (~4%):

Yes some of the IQ2_K_R4 and IQ3_K_R4 quants can be a little slower than the older Q2_K / Q3_K quantized tensors. Keep an eye on ik's fork though as he's recently been improving speed for some various types.

Is it expected that IQ2_K_R4 will have less perplexity than UD-Q2_K_XL ?

Yes. While I have not tested that excact UD-Q2_K_XL model myself, I have tested a few of their other quants and for similar sizes my quants are showing lower perplexity given the higher quality (albiet possibly slightly slower depending on exact mix) quants. You can see from this Qwen3-30B-A3B KLD comparison how the ik quants tend to give better quality. Also The Great Quant Wars of 2025 has some more discussion on quality.

The best way to confirm is that you can test yourself both of them pretty quickly using the llama-perplexity commands given in the model card (click on the arrow to open the fold for Perplexity). Make sure to use the exact same wiki.test.raw file for a fair comparison as described to download and check it here.

wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz

You can see my DeepSeek-V3-0324-IQ2_K_R4.gguf scored Final estimate: PPL = 3.5614 +/- 0.02001

If you run them both using the exact provided methodology, please post your results! Thanks!

Sign up or log in to comment