Instructions to use moonshotai/Kimi-Linear-48B-A3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use moonshotai/Kimi-Linear-48B-A3B-Instruct with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="moonshotai/Kimi-Linear-48B-A3B-Instruct", trust_remote_code=True) messages = [ {"role": "user", "content": "Who are you?"}, ] pipe(messages)# Load model directly from transformers import AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-Linear-48B-A3B-Instruct", trust_remote_code=True, dtype="auto") - Inference
- Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use moonshotai/Kimi-Linear-48B-A3B-Instruct with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "moonshotai/Kimi-Linear-48B-A3B-Instruct" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-Linear-48B-A3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker
docker model run hf.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
- SGLang
How to use moonshotai/Kimi-Linear-48B-A3B-Instruct with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-Linear-48B-A3B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-Linear-48B-A3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "moonshotai/Kimi-Linear-48B-A3B-Instruct" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "moonshotai/Kimi-Linear-48B-A3B-Instruct", "messages": [ { "role": "user", "content": "What is the capital of France?" } ] }' - Docker Model Runner
How to use moonshotai/Kimi-Linear-48B-A3B-Instruct with Docker Model Runner:
docker model run hf.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
trying to run this on a 4090 and 192GB RAM.. not enough RAM???
it eats until 185GB RAM... in vllm... how can I stop it from doing that? does it need that much really?? it goes there, then says "not enough for cache" and crashes. or just crashes with "OOM"
I have newest flashinfer (using that as attention backend after triton crashed even sooner), newest xformers 0.0.33.dev1091, newest 11.1 vllm (cloned the repo and installed it that way).... I don't know what else to do
got a ryzen AMD 16 cores
transformers worked, but when offloading, it only used 1 of my CPU cores, no matter what I tried.............. I'd try quantization, but it needs a config file for those that don't randomly quantize everything and risk breaking the model
Use cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit for vllm with 48GB NVRAM.
I run it on 2x3090 or 2x4090. Will not run with vllm with only one card.
You will have to wait for GGUF for cpu offloading of experts when llama.cpp will have it working.
use it with enforce eager as true
if you are offloading weigths to cpu I wouldn"t use vllm. I have 64gb of vram and 128gb of ddr5 and whenever I need to offload model weights or kv-cache to cpu I always use llama.cpp
On the other hand if the model fits my gpu then vllm is the best inference.
Hi, what maximum context length did you achieved with 48 GB of vram on vllm?
Honestly i tend to download mradermacher's quant models at i1-Q6K if i can, then downscale as appropriate between Q2 and Q4. Takes about 40Gb, right now trying it on i1-Q3_XXS (16k ctx) and getting pretty decent results on a 8Gb card.
Course i tend to do more RPing than coding, but getting some really.... interesting replies.