--- base_model: - zai-org/GLM-5 license: mit --- ## Model Description **GLM-5-NVFP4** is an NVFP4-quantized version of [zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5), a 744B-parameter Mixture-of-Experts language model with 40B active parameters, 256 experts per MoE layer (8 activated per token), and DeepSeek Sparse Attention (DSA). Quantized directly from the full BF16 checkpoint ([zai-org/GLM-5](https://huggingface.co/zai-org/GLM-5)), *not the FP8 release*, to NVFP4 (4-bit with blockwise FP8 scales per 16 elements) using [NVIDIA Model Optimizer](https://github.com/NVIDIA/Model-Optimizer). ### What's quantized Only the MoE expert MLP layers (gate, up, and down projections) are quantized to NVFP4. Attention layers are left in BF16. Since the expert weights constitute the vast majority of model parameters in an MoE architecture, this still yields significant memory savings. Calibration uses natural top-k routing rather than forcing all experts to activate, so each expert's quantization scales reflect the token distributions it actually sees during inference. To compensate, calibration was run on a much larger number of samples than typical to ensure broad expert coverage through natural routing alone. ### Calibration dataset Three calibration passes were run: 1. **Coding pass** — Agentic coding samples (tool calling, multi-turn code generation, function calling) with English and Chinese system prompts. 2. **Broad pass** — Large-scale diverse samples drawn from WildChat and LMSYS-Chat covering real user conversations across a wide range of topics and languages. 3. **Deep pass** — Long-context samples (>8K tokens) from coding and diverse sources to exercise deep-sequence expert activation patterns. Merged via element-wise max across all calibration runs. ### How to Run NVFP4 requires Blackwell GPUs (RTX 5090, RTX Pro 6000, B100, B200, etc.). Even quantized, this is a huge model — tested on 8x RTX Pro 6000 Blackwell (96 GB each, 768 GB total). If you experience NCCL hangs with P2P, make sure you have `iommu=pt` (and `amd_iommu=pt` on AMD platforms) in your kernel command line. #### SGLang ```bash export NCCL_IB_DISABLE=1 export NCCL_P2P_LEVEL=PHB export NCCL_ALLOC_P2P_NET_LL_BUFFERS=1 export NCCL_MIN_NCHANNELS=8 export OMP_NUM_THREADS=8 export SAFETENSORS_FAST_GPU=1 python3 -m sglang.launch_server \ --model lukealonso/GLM-5-NVFP4 \ --served-model-name glm-5 \ --reasoning-parser glm45 \ --tool-call-parser glm47 \ --trust-remote-code \ --tp 8 \ --mem-fraction-static 0.95 \ --max-running-requests 8 \ --kv-cache-dtype fp8_e4m3 \ --quantization modelopt_fp4 \ --attention-backend flashinfer \ --moe-runner-backend flashinfer_cutlass \ --disable-custom-all-reduce \ --enable-flashinfer-allreduce-fusion \ --host 0.0.0.0 \ --port 8000 ``` #### vLLM Please contribute vLLM instructions if you successfully manage to run this model.