Nice Work!
I learned you recently added DFlash support from this great post: https://www.reddit.com/r/LocalLLaMA/comments/1t9voxs/exllamav3_major_updates/
Nice to see some quant quality comparisons across various eco-systems, that is difficult to do!
Hope to kick the tires on this model soon!
Cheers!
DFlash num_draft_tokens (ndt) Benchmark β Qwen3.6-27B on exllamav3
Model: Qwen3.6-27B EXL3 (4.15bpw) + DFlash draft model on 3090Ti 24GB
Benchmark: 10 sequential requests, ~300s window each, streamed output
### Summary Table
ββββββββββββββββββββββββββββββββββββββββββββββββββββ¬ββββββββ¬βββββββββ¬βββββββββ
β Metric β ndt=6 β ndt=10 β ndt=15 β
ββββββββββββββββββββββββββββββββββββββββββββββββββββΌββββββββΌβββββββββΌβββββββββ€
β Decode tokens/sec (per-user avg) β 85.3 β 96.2 β 81.1 β
βββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββ
β Role β HuggingFace β Quantization β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ€
β Main model β https://huggingface.co/UnstableLlama/Qwen3.6-27B-exl3-4.15bpw β EXL3, 4.15 bpw β
βββββββββββββββΌββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββΌβββββββββββββββββββββββββββββββ€
β Draft model β https://huggingface.co/turboderp/Qwen3.6-27B-DFlash-exl3 (branch 4.00bpw) β EXL3 DFlash tensors, 4.0 bpw β
βββββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββ
i had to vibe code a few changes to tabbyAPI to get it working with everything on exllamav3 dev branch, and add config for num_draft_tokens.. but dropping it down to 10 from default of 15 helped a lot on this coding question aiperf concurrency=1 short test.
Decode speed on exllamav3 looks promising, and seems faster at least on this workload than regular MTP (not DFlash) on ik_llama.cpp (and the mainline draft PR which people are just using). But I haven't done a good benchmark of prefill to get a better full view.
Thanks!