Instructions to use Anurich/Jeeves-Small-75M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Anurich/Jeeves-Small-75M with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Anurich/Jeeves-Small-75M", trust_remote_code=True)
messages = [
    {"role": "user", "content": "Who are you?"},
]
pipe(messages)

# Load model directly
from transformers import AutoModelForCausalLM
model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True, dtype="auto")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use Anurich/Jeeves-Small-75M with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Anurich/Jeeves-Small-75M"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-75M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/Anurich/Jeeves-Small-75M

SGLang

How to use Anurich/Jeeves-Small-75M with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Anurich/Jeeves-Small-75M" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-75M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Anurich/Jeeves-Small-75M" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Anurich/Jeeves-Small-75M",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use Anurich/Jeeves-Small-75M with Docker Model Runner:
```
docker model run hf.co/Anurich/Jeeves-Small-75M
```

Anurich commited on Feb 22

Commit

5ab1a1a

verified ·

1 Parent(s): ddffa29

Update README.md

Browse files

Files changed (1) hide show

README.md +139 -15

README.md CHANGED Viewed

@@ -6,27 +6,77 @@ tags:
 - looped-transformer
 - value-residual
 - sentencepiece
 license: apache-2.0
 ---
-# Jeeves (75M)
-A compact language model using **Looped Transformer + Value Residual Learning**.
-## Usage
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
-tokenizer = AutoTokenizer.from_pretrained("REPO_ID", trust_remote_code=True)
-model = AutoModelForCausalLM.from_pretrained("REPO_ID", trust_remote_code=True)
 inputs = tokenizer("Hello, how are you?", return_tensors="pt")
 outputs = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
-**Note:** `trust_remote_code=True` is required.
 ## Architecture
@@ -35,17 +85,91 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 | Parameters | 74.9M |
 | Unique layers | 8 |
 | Effective depth | 15 |
-| Loop | block[4] x 8 |
-| Value residual | True |
 | Hidden dim | 768 |
-| FFN dim | 2048 |
-| Attention heads | 12 (Q) / 4 (KV) |
 | Vocab size | 32,000 |
 | Max seq length | 512 |
-| Training step | 1,100 |
-## Key Innovations
-- **Looped Transformer** ([arXiv 2311.12424](https://arxiv.org/abs/2311.12424))
-- **Value Residual Learning** ([arXiv 2410.17897](https://arxiv.org/abs/2410.17897))
-- **Input Injection** for loop stability

 - looped-transformer
 - value-residual
 - sentencepiece
+- tool-calling
+- conversational
 license: apache-2.0
+language:
+- en
+pipeline_tag: text-generation
 ---
+# Jeeves-Small-75M
+A compact 75M parameter language model built on **Looped Transformer** and **Value Residual Learning** architectures — with native support for **tool calling / function calling**.
+Jeeves is designed to punch above its weight class by reusing a small set of transformer layers iteratively (looping), giving it an effective depth far beyond what its parameter count suggests.
+---
+## Quick Start
 ```python
 from transformers import AutoTokenizer, AutoModelForCausalLM
+tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
+model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
 inputs = tokenizer("Hello, how are you?", return_tensors="pt")
 outputs = model.generate(**inputs, max_new_tokens=50)
 print(tokenizer.decode(outputs[0], skip_special_tokens=True))
 ```
+> **Note:** `trust_remote_code=True` is required due to custom model architecture code.
+---
+## Tool Calling (Function Calling)
+Jeeves supports structured tool/function calling out of the box. Below is an example:
+```python
+tools = [
+    {
+        "name": "get_weather",
+        "description": "Get the current weather for a given location.",
+        "parameters": {
+            "type": "object",
+            "properties": {
+                "location": {"type": "string", "description": "City name"},
+                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
+            },
+            "required": ["location"]
+        }
+    }
+]
+messages = [
+    {"role": "user", "content": "What's the weather like in London?"}
+]
+# Format prompt with tools using the chat template
+prompt = tokenizer.apply_chat_template(
+    messages,
+    tools=tools,
+    tokenize=False,
+    add_generation_prompt=True
+)
+inputs = tokenizer(prompt, return_tensors="pt")
+outputs = model.generate(**inputs, max_new_tokens=128)
+print(tokenizer.decode(outputs[0], skip_special_tokens=True))
+```
+---
 ## Architecture
 | Parameters | 74.9M |
 | Unique layers | 8 |
 | Effective depth | 15 |
+| Loop | block[4] × 8 |
+| Value residual | ✅ |
 | Hidden dim | 768 |
+| FFN dim | 2,048 |
+| Attention heads | 12 (Q) / 4 (KV) — GQA |
 | Vocab size | 32,000 |
 | Max seq length | 512 |
+| Training steps | 1,100 |
+### Key Innovations
+- **Looped Transformer** ([arXiv:2311.12424](https://arxiv.org/abs/2311.12424)) — A single transformer block is applied repeatedly in a loop, dramatically increasing effective depth while keeping parameter count small. This allows Jeeves to reason iteratively rather than in a single pass.
+- **Value Residual Learning** ([arXiv:2410.17897](https://arxiv.org/abs/2410.17897)) — Residual connections applied at the value projection level alleviate attention concentration in deep/looped networks, improving gradient flow and stability.
+- **Input Injection** — The original input is re-injected at each loop iteration to prevent representational drift across loops, a critical stabilization technique for looped architectures.
+---
+## Benchmark Results
+Evaluated using [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
+| Benchmark | Accuracy | Correct | Total |
+|---|---|---|---|
+| HellaSwag | 30.9% | 3,100 | 10,042 |
+| ARC-Easy | 47.1% | 1,118 | 2,376 |
+| ARC-Challenge | 24.9% | 292 | 1,172 |
+| **ARC (Average)** | **36.0%** | — | — |
+| PIQA | 63.9% | 1,174 | 1,838 |
+| WinoGrande | 52.4% | 664 | 1,267 |
+| MMLU | 25.2% | 3,536 | 14,042 |
+| TruthfulQA | 24.8% | 203 | 817 |
+| GSM8K | 1.4% | 18 | 1,319 |
+| IFEval | 40.0% | 4 | 10 |
+### Notes on Results
+- **PIQA (63.9%)** and **WinoGrande (52.4%)** are the strongest results, indicating reasonable physical commonsense and pronoun-resolution reasoning for the model's size.
+- **MMLU (25.2%)** is close to random (25% for 4-way MCQ), which is expected given the model's size and early training stage (1,100 steps). More training is needed for knowledge-heavy tasks.
+- **GSM8K (1.4%)** reflects a known limitation: multi-step mathematical reasoning is very demanding and typically requires much larger models or specialized fine-tuning.
+- **IFEval (40.0%)** is promising for a 75M model and reflects the tool-calling and instruction-following training signal.
+---
+## Limitations
+- **Short context (512 tokens):** Jeeves currently supports a maximum of 512 tokens. Long documents, multi-turn conversations, and complex tool chains may be truncated.
+- **Early training stage:** At 1,100 training steps, this is an early checkpoint. Knowledge-heavy and math benchmarks (MMLU, GSM8K) will improve significantly with more training.
+- **Not suitable for factual retrieval:** Like all small language models, Jeeves may hallucinate facts. It is best used with grounding via tool calls or RAG pipelines.
+- **English-centric:** Trained primarily on English data. Performance on other languages is not guaranteed.
+---
+## Intended Use
+Jeeves is designed for:
+- **On-device / edge inference** where a small footprint is critical
+- **Tool-augmented agents** that rely on function calling rather than parametric knowledge
+- **Research** into efficient architectures (looped transformers, value residual)
+- **Fine-tuning** on domain-specific tasks where a small, fast base model is preferred
+---
+## Citation
+If you use Jeeves in your work, please also cite the papers that inspired its architecture:
+```bibtex
+@article{looped_transformer_2023,
+  title={Looped Transformers are Better at Learning Learning Algorithms},
+  author={...},
+  journal={arXiv:2311.12424},
+  year={2023}
+}
+@article{value_residual_2024,
+  title={Value Residual Learning For Alleviating Attention Concentration In Transformers},
+  author={...},
+  journal={arXiv:2410.17897},
+  year={2024}
+}
+```
+---
+## License
+Apache 2.0 — see [LICENSE](LICENSE) for details.