Instructions to use moonshotai/Kimi-Linear-48B-A3B-Instruct with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use moonshotai/Kimi-Linear-48B-A3B-Instruct with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="moonshotai/Kimi-Linear-48B-A3B-Instruct", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("moonshotai/Kimi-Linear-48B-A3B-Instruct", trust_remote_code=True, dtype="auto")

Inference
Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use moonshotai/Kimi-Linear-48B-A3B-Instruct with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "moonshotai/Kimi-Linear-48B-A3B-Instruct"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-Linear-48B-A3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/moonshotai/Kimi-Linear-48B-A3B-Instruct

SGLang

How to use moonshotai/Kimi-Linear-48B-A3B-Instruct with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "moonshotai/Kimi-Linear-48B-A3B-Instruct" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-Linear-48B-A3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "moonshotai/Kimi-Linear-48B-A3B-Instruct" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "moonshotai/Kimi-Linear-48B-A3B-Instruct",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use moonshotai/Kimi-Linear-48B-A3B-Instruct with Docker Model Runner:
```
docker model run hf.co/moonshotai/Kimi-Linear-48B-A3B-Instruct
```

trying to run this on a 4090 and 192GB RAM.. not enough RAM???

#10

by MikaSouthworth - opened Nov 5, 2025

Discussion

MikaSouthworth

Nov 5, 2025

•

edited Nov 5, 2025

it eats until 185GB RAM... in vllm... how can I stop it from doing that? does it need that much really?? it goes there, then says "not enough for cache" and crashes. or just crashes with "OOM"
I have newest flashinfer (using that as attention backend after triton crashed even sooner), newest xformers 0.0.33.dev1091, newest 11.1 vllm (cloned the repo and installed it that way).... I don't know what else to do
got a ryzen AMD 16 cores
transformers worked, but when offloading, it only used 1 of my CPU cores, no matter what I tried.............. I'd try quantization, but it needs a config file for those that don't randomly quantize everything and risk breaking the model

rekrek

Nov 6, 2025

Use cyankiwi/Kimi-Linear-48B-A3B-Instruct-AWQ-4bit for vllm with 48GB NVRAM.

I run it on 2x3090 or 2x4090. Will not run with vllm with only one card.

You will have to wait for GGUF for cpu offloading of experts when llama.cpp will have it working.

harveensingh-sarvam

Nov 17, 2025

use it with enforce eager as true

berkerdooo

Jan 17

if you are offloading weigths to cpu I wouldn"t use vllm. I have 64gb of vram and 128gb of ddr5 and whenever I need to offload model weights or kv-cache to cpu I always use llama.cpp

On the other hand if the model fits my gpu then vllm is the best inference.

AcoHuggingFace

Mar 3

Hi, what maximum context length did you achieved with 48 GB of vram on vllm?

yano2mch

Apr 13

Honestly i tend to download mradermacher's quant models at i1-Q6K if i can, then downscale as appropriate between Q2 and Q4. Takes about 40Gb, right now trying it on i1-Q3_XXS (16k ctx) and getting pretty decent results on a 8Gb card.

Course i tend to do more RPing than coding, but getting some really.... interesting replies.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment