Instructions to use yasserrmd/glm5.1-distill-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use yasserrmd/glm5.1-distill-onnx with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="yasserrmd/glm5.1-distill-onnx")
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("yasserrmd/glm5.1-distill-onnx")
model = AutoModelForCausalLM.from_pretrained("yasserrmd/glm5.1-distill-onnx")
messages = [
    {"role": "user", "content": "Who are you?"},
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use yasserrmd/glm5.1-distill-onnx with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "yasserrmd/glm5.1-distill-onnx"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yasserrmd/glm5.1-distill-onnx",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/yasserrmd/glm5.1-distill-onnx

SGLang

How to use yasserrmd/glm5.1-distill-onnx with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "yasserrmd/glm5.1-distill-onnx" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yasserrmd/glm5.1-distill-onnx",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "yasserrmd/glm5.1-distill-onnx" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "yasserrmd/glm5.1-distill-onnx",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use yasserrmd/glm5.1-distill-onnx with Docker Model Runner:
```
docker model run hf.co/yasserrmd/glm5.1-distill-onnx
```

glm5.1-distill-onnx

ONNX export of yasserrmd/glm5.1-distill, produced with the official Liquid AI ONNX exporter: Liquid4All/onnx-export.

The graph layout, file naming, and quantization scheme follow Liquid AI's own ONNX releases (e.g. LiquidAI/LFM2.5-1.2B-Instruct-ONNX), so the same inference snippets work here unchanged.

Files

onnx/
  model_fp16.onnx
  model_fp16.onnx_data
  model_fp16.onnx_data_1
  model_q4.onnx
  model_q4.onnx_data
  model_q8.onnx
  model_q8.onnx_data

For models > 2 GB, weights are split across *.onnx_data, *.onnx_data_1, ... All data files must sit next to their .onnx graph.

Quickstart (Python, onnxruntime)

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer

model_id = "{TARGET_REPO_ID}"
graph    = "onnx/model_q4.onnx"   # recommended

# Download the graph and all of its external-data shards
hf_hub_download(model_id, graph)
for f in list_repo_files(model_id):
    if f.startswith(graph + "_data") or f.startswith(graph + ".onnx_data"):
        hf_hub_download(model_id, f)

session   = ort.InferenceSession(hf_hub_download(model_id, graph))
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

The full KV-cache generation loop is the same one published on the upstream LiquidAI ONNX model cards, e.g. LiquidAI/LFM2.5-1.2B-Instruct-ONNX.

Quickstart (Transformers.js / WebGPU)

import {{ AutoModelForCausalLM, AutoTokenizer, TextStreamer }} from "@huggingface/transformers";

const modelId   = "{TARGET_REPO_ID}";
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model     = await AutoModelForCausalLM.from_pretrained(modelId, {{
  device: "webgpu",
  dtype:  "q4",   // or "fp16"
}});

const messages  = [{{ role: "user", content: "Hello!" }}];
const input     = tokenizer.apply_chat_template(messages, {{
  add_generation_prompt: true, return_dict: true,
}});

const streamer  = new TextStreamer(tokenizer, {{ skip_prompt: true }});
const output    = await model.generate({{ ...input, max_new_tokens: 256, do_sample: false, streamer }});
console.log(tokenizer.decode(output[0], {{ skip_special_tokens: true }}));

Notes

Recommended precision: q4 for CPU/GPU/WebGPU, fp16 for higher quality on GPU/WebGPU, q8 for server-only quality/size balance.
Provenance: exported with Liquid4All/onnx-export. See the upstream README for the exact CLI used.
Source model details, training data, hyperparameters, limitations: see {SOURCE_MODEL_ID}.

Downloads last month: 27

Model tree for yasserrmd/glm5.1-distill-onnx

Base model

LiquidAI/LFM2.5-1.2B-Base

Finetuned

yasserrmd/glm5.1-distill

Quantized

(2)

this model