glm5.1-distill-onnx

ONNX export of yasserrmd/glm5.1-distill, produced with the official Liquid AI ONNX exporter: Liquid4All/onnx-export.

The graph layout, file naming, and quantization scheme follow Liquid AI's own ONNX releases (e.g. LiquidAI/LFM2.5-1.2B-Instruct-ONNX), so the same inference snippets work here unchanged.

Files

onnx/
  model_fp16.onnx
  model_fp16.onnx_data
  model_fp16.onnx_data_1
  model_q4.onnx
  model_q4.onnx_data
  model_q8.onnx
  model_q8.onnx_data

For models > 2 GB, weights are split across *.onnx_data, *.onnx_data_1, ... All data files must sit next to their .onnx graph.

Quickstart (Python, onnxruntime)

import numpy as np
import onnxruntime as ort
from huggingface_hub import hf_hub_download, list_repo_files
from transformers import AutoTokenizer

model_id = "{TARGET_REPO_ID}"
graph    = "onnx/model_q4.onnx"   # recommended

# Download the graph and all of its external-data shards
hf_hub_download(model_id, graph)
for f in list_repo_files(model_id):
    if f.startswith(graph + "_data") or f.startswith(graph + ".onnx_data"):
        hf_hub_download(model_id, f)

session   = ort.InferenceSession(hf_hub_download(model_id, graph))
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)

The full KV-cache generation loop is the same one published on the upstream LiquidAI ONNX model cards, e.g. LiquidAI/LFM2.5-1.2B-Instruct-ONNX.

Quickstart (Transformers.js / WebGPU)

import {{ AutoModelForCausalLM, AutoTokenizer, TextStreamer }} from "@huggingface/transformers";

const modelId   = "{TARGET_REPO_ID}";
const tokenizer = await AutoTokenizer.from_pretrained(modelId);
const model     = await AutoModelForCausalLM.from_pretrained(modelId, {{
  device: "webgpu",
  dtype:  "q4",   // or "fp16"
}});

const messages  = [{{ role: "user", content: "Hello!" }}];
const input     = tokenizer.apply_chat_template(messages, {{
  add_generation_prompt: true, return_dict: true,
}});

const streamer  = new TextStreamer(tokenizer, {{ skip_prompt: true }});
const output    = await model.generate({{ ...input, max_new_tokens: 256, do_sample: false, streamer }});
console.log(tokenizer.decode(output[0], {{ skip_special_tokens: true }}));

Notes

  • Recommended precision: q4 for CPU/GPU/WebGPU, fp16 for higher quality on GPU/WebGPU, q8 for server-only quality/size balance.
  • Provenance: exported with Liquid4All/onnx-export. See the upstream README for the exact CLI used.
  • Source model details, training data, hyperparameters, limitations: see {SOURCE_MODEL_ID}.
Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yasserrmd/glm5.1-distill-onnx

Quantized
(2)
this model