Instructions to use yasserrmd/glm5.1-distill-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use yasserrmd/glm5.1-distill-onnx with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="yasserrmd/glm5.1-distill-onnx") messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("yasserrmd/glm5.1-distill-onnx") model = AutoModelForCausalLM.from_pretrained("yasserrmd/glm5.1-distill-onnx") messages = [ {"role": "user", "content": "Who are you?"}, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use yasserrmd/glm5.1-distill-onnx with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "yasserrmd/glm5.1-distill-onnx" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yasserrmd/glm5.1-distill-onnx", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/yasserrmd/glm5.1-distill-onnx
- SGLang
How to use yasserrmd/glm5.1-distill-onnx with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "yasserrmd/glm5.1-distill-onnx" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yasserrmd/glm5.1-distill-onnx", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "yasserrmd/glm5.1-distill-onnx" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "yasserrmd/glm5.1-distill-onnx", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use yasserrmd/glm5.1-distill-onnx with Docker Model Runner:
docker model run hf.co/yasserrmd/glm5.1-distill-onnx
glm5.1-distill-onnx
ONNX export of yasserrmd/glm5.1-distill,
produced with the official Liquid AI ONNX exporter:
Liquid4All/onnx-export.
The graph layout, file naming, and quantization scheme follow Liquid AI's own
ONNX releases (e.g.
LiquidAI/LFM2.5-1.2B-Instruct-ONNX),
so the same inference snippets work here unchanged.
Files
onnx/
model_fp16.onnx
model_fp16.onnx_data
model_fp16.onnx_data_1
model_q4.onnx
model_q4.onnx_data
model_q8.onnx
model_q8.onnx_data
For models > 2 GB, weights are split across *.onnx_data, *.onnx_data_1,
... All data files must sit next to their .onnx graph.
Quickstart (Python, onnxruntime)
import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer
model_id = "{TARGET_REPO_ID}"
graph = "onnx/model_q4.onnx" # recommended
# Download the graph and all of its external-data shards
hf_hub_download(model_id, graph)
for f in list_repo_files(model_id):
if f.startswith(graph + "_data") or f.startswith(graph + ".onnx_data"):
hf_hub_download(model_id, f)
session = ort.InferenceSession(hf_hub_download(model_id, graph))
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
The full KV-cache generation loop is the same one published on the upstream
LiquidAI ONNX model cards, e.g.
LiquidAI/LFM2.5-1.2B-Instruct-ONNX.
Quickstart (Transformers.js / WebGPU)
import {{ AutoModelForCausalLM, AutoTokenizer, TextStreamer }} from "@huggingface/transformers";
const modelId = "{TARGET_REPO_ID}";
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model = await AutoModelForCausalLM.from_pretrained(modelId, {{
device: "webgpu",
dtype: "q4", // or "fp16"
}});
const messages = [{{ role: "user", content: "Hello!" }}];
const input = tokenizer.apply_chat_template(messages, {{
add_generation_prompt: true, return_dict: true,
}});
const streamer = new TextStreamer(tokenizer, {{ skip_prompt: true }});
const output = await model.generate({{ ...input, max_new_tokens: 256, do_sample: false, streamer }});
console.log(tokenizer.decode(output[0], {{ skip_special_tokens: true }}));
Notes
- Recommended precision:
q4for CPU/GPU/WebGPU,fp16for higher quality on GPU/WebGPU,q8for server-only quality/size balance. - Provenance: exported with
Liquid4All/onnx-export. See the upstream README for the exact CLI used. - Source model details, training data, hyperparameters, limitations: see
{SOURCE_MODEL_ID}.
- Downloads last month
- 27