Nice Work!

by ubergarm - opened 7 days ago

I learned you recently added DFlash support from this great post: https://www.reddit.com/r/LocalLLaMA/comments/1t9voxs/exllamav3_major_updates/

Nice to see some quant quality comparisons across various eco-systems, that is difficult to do!

Hope to kick the tires on this model soon!

Cheers!

ubergarm

7 days ago

 DFlash num_draft_tokens (ndt) Benchmark — Qwen3.6-27B on exllamav3

 Model: Qwen3.6-27B EXL3 (4.15bpw) + DFlash draft model on 3090Ti 24GB
 Benchmark: 10 sequential requests, ~300s window each, streamed output

 ### Summary Table

 ┌──────────────────────────────────────────────────┬───────┬────────┬────────┐
 │ Metric                                           │ ndt=6 │ ndt=10 │ ndt=15 │
 ├──────────────────────────────────────────────────┼───────┼────────┼────────┤
 │ Decode tokens/sec (per-user avg)                 │ 85.3  │ 96.2   │ 81.1   │

 ┌─────────────┬───────────────────────────────────────────────────────────────────────────┬──────────────────────────────┐
 │ Role        │ HuggingFace                                                               │ Quantization                 │
 ├─────────────┼───────────────────────────────────────────────────────────────────────────┼──────────────────────────────┤
 │ Main model  │ https://huggingface.co/UnstableLlama/Qwen3.6-27B-exl3-4.15bpw             │ EXL3, 4.15 bpw               │
 ├─────────────┼───────────────────────────────────────────────────────────────────────────┼──────────────────────────────┤
 │ Draft model │ https://huggingface.co/turboderp/Qwen3.6-27B-DFlash-exl3 (branch 4.00bpw) │ EXL3 DFlash tensors, 4.0 bpw │
 └─────────────┴───────────────────────────────────────────────────────────────────────────┴──────────────────────────────┘

i had to vibe code a few changes to tabbyAPI to get it working with everything on exllamav3 dev branch, and add config for num_draft_tokens.. but dropping it down to 10 from default of 15 helped a lot on this coding question aiperf concurrency=1 short test.

Decode speed on exllamav3 looks promising, and seems faster at least on this workload than regular MTP (not DFlash) on ik_llama.cpp (and the mainline draft PR which people are just using). But I haven't done a good benchmark of prefill to get a better full view.

Thanks!

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment