Maybe it is not the Instruct-2507 model?

#2
by NeoHuggingF - opened

Hey, there.

Thank you for the quantized model. Hopefully this new technique can improve performance while keeping the memory requirements lower.

I was testing it using Ollama and comparing it with the Intel AutoRound q2ks model:

  1. It is a bit slower, but I guess it is expected due to the size difference (this one is a bit bigger);
  2. The template includes <think> tags;
  3. The inference parameters look like the ones for Qwen3 Thinking, for instance temperature is set to 0.6 while instruct models usually have it set to 0.7;
  4. It does not answer correctly the classic "give me a country name that ends with LIA", while the Intruct-2507 model does.

Thanks again. Regards.

ByteShape org

Thanks for the detailed testing notes. This is genuinely helpful.

  1. Repro details (so we can investigate):
    Could you please share which exact models you tested , your backend software, and the hardware (CPU/GPU, RAM)?
    I'm interested in investigating this for future releases.

  2. About the <think> tags in the template:
    You’re right. The template you’re seeing matches what the upstream model originally shipped with, before Qwen updated it in this commit:
    https://huggingface.co/Qwen/Qwen3-30B-A3B-Instruct-2507/commit/0d7cf23991f47feeb3a57ecb4c9cee8ea4a17bfe

    For compatibility, we kept the same template that other popular GGUF conversions for this model use, since backends like Ollama and LM Studio can be sensitive to template changes.

  3. temperature, etc:
    As far as I know, these should be set at inference time. We do not intentionally bake sampling defaults into the GGUF.
    If you’re seeing temperature: 0.6 coming from our GGUFs, could you please point me to where it appears?

  4. The “ends with LIA” prompt:
    Nice find. I can reproduce odd answers even on the BF16 model with llama.cpp.

    BF16 (llama.cpp) examples I get:

    > give me a country name that ends with LIA
    Serbia
    
    > give me a country name that ends with LIA
    Bulgaria
    
    > give me a country name that ends with LIA
    Uganda
    

I even tried GPT-5.2-Instant. It usually replies Australia, but after a few regenerations it once said Bolivia. To be fair to models, questions like this are hard to answer because of how the tokenizer is designed. :)

Hi, there. Thank you for taking time to read and reply.

I will try to answer the questions the best I can, but please feel free to request for more info and/or tests.

  1. Hardware: Optimus laptop - i5-13420H - 32GB DDR5-5600 - NVidia 3050 6GB
    I have used the CPU optimized quantization as the model does not fit the available VRAM (from my tests, MOE mostly do inference on CPU and use GPU for KV cache, with good speed, more or less like llama.cpp --cpu-moe toggle)

    ollama run hf.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF:Qwen3-30B-A3B-Instruct-2507-Q3_K_S-3.25bpw.gguf --verbose

  2. Hmm, as far as I know the template used in Ollama for the model does not have think tags, but indeed it seems that the quants from Unsloth and Intel AutoRound do. As I use mostly Ollama, I usually use the same template if they provide the model as well. This has a weird effect on Ollama tho: it recognizes it as a "thinking" model and starts the reply with the tag, for instance:

$ ollama run hf.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF:Qwen3-30B-A3B-Instruct-2507-Q3_K_S-3.25bpw.gguf "2 + 2 = ?"
Thinking...
2 + 2 = 4
  1. This is how Ollama detects the model, maybe because it detects it as a "thinking" model:
$ ollama show hf.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF:Qwen3-30B-A3B-Instruct-2507-Q3_K_S-3.25bpw.gguf

  Model
    architecture        qwen3moe
    parameters          30.5B
    context length      262144
    embedding length    2048
    quantization        unknown

  Capabilities
    completion
    tools
    thinking

  Parameters
    repeat_penalty    1
    stop              "<|im_start|>"
    stop              "<|im_end|>"
    temperature       0.6
    top_k             20
    top_p             0.95
  1. Here I do not know what to say, but in my tests, most of the times (lets say in 4 out of 5 replies), both the quants from Ollama repo (Q4KM) and Intel AutoRound (q2ks-mixed) answer it correctly, most of the times as Australia (qwen3-ia is the short name I gave to the Intel AutoRound q2ks-mixed quant):
$ ollama run qwen3-ia:30b-a3b-instruct-2507 --verbose "give a country whose name in English ends with "LIA", give me the name of its capital city as well"
A country whose name in English ends with "LIA" is **Australia**.
Its capital city is **Canberra**.

total duration:       1.149178322s
load duration:        70.568553ms
prompt eval count:    30 token(s)
prompt eval duration: 39.90642ms
prompt eval rate:     751.76 tokens/s
eval count:           28 token(s)
eval duration:        1.029773804s
eval rate:            27.19 tokens/s

Anyway, this is not very important, just what made me think maybe this was not the Instruct-2507 model.

Thanks again. Regards

ByteShape org

Thanks a lot for sharing all the details and command outputs, I really appreciate it. I’ll look into the template and Ollama’s “thinking” detection on our side and see what we can improve in future models. The weights are definitely based on the Instruct-2507 model. 🙂

ByteShape org

@NeoHuggingF , thanks again for your help and feedback.
I’ve added a template and a params file to the repo that Ollama should pick up when pulling the model. This should help Ollama apply the correct configuration and generate the appropriate Modelfile.
I’d appreciate it if you could pull the model again and let me know whether it works on your side.
Many thanks!

The template matter is just an oversight made by someone at Qwen, then fixed AFTER the finetuners and quants authors created their own repositories with the wrong templates, their own failure to notice that the template has been changed in the base model and now turned into some sort of story that this is all due to some compatibility issues. No, it's still just an oversight being copied from one person to another over and over again.

@Ali93H Thank you for confirming this is indeed based on Instruct-2507 weights.

I have just pulled the model using Ollama and it updated the template and parameters, looks good now.

$ ollama show hf.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF:Qwen3-30B-A3B-Instruct-2507-Q3_K_S-3.25bpw.gguf
  Model
    architecture        qwen3moe
    parameters          30.5B
    context length      262144
    embedding length    2048
    quantization        unknown

  Capabilities
    completion
    tools

  Parameters
    top_p          0.8
    stop           "<|im_start|>"
    stop           "<|im_end|>"
    temperature    0.7
    top_k          20

$ ollama show --template hf.co/byteshape/Qwen3-30B-A3B-Instruct-2507-GGUF:Qwen3-30B-A3B-Instruct-2507-Q3_K_S-3.25bpw.gguf | grep -i think
<no matchces>

Closing this. Thanks again. Regards

NeoHuggingF changed discussion status to closed

Sign up or log in to comment