Anurich commited on
Commit
5ab1a1a
Β·
verified Β·
1 Parent(s): ddffa29

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +139 -15
README.md CHANGED
@@ -6,27 +6,77 @@ tags:
6
  - looped-transformer
7
  - value-residual
8
  - sentencepiece
 
 
9
  license: apache-2.0
 
 
 
10
  ---
11
 
12
- # Jeeves (75M)
13
 
14
- A compact language model using **Looped Transformer + Value Residual Learning**.
15
 
16
- ## Usage
 
 
 
 
17
 
18
  ```python
19
  from transformers import AutoTokenizer, AutoModelForCausalLM
20
 
21
- tokenizer = AutoTokenizer.from_pretrained("REPO_ID", trust_remote_code=True)
22
- model = AutoModelForCausalLM.from_pretrained("REPO_ID", trust_remote_code=True)
23
 
24
  inputs = tokenizer("Hello, how are you?", return_tensors="pt")
25
  outputs = model.generate(**inputs, max_new_tokens=50)
26
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
27
  ```
28
 
29
- **Note:** `trust_remote_code=True` is required.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  ## Architecture
32
 
@@ -35,17 +85,91 @@ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
35
  | Parameters | 74.9M |
36
  | Unique layers | 8 |
37
  | Effective depth | 15 |
38
- | Loop | block[4] x 8 |
39
- | Value residual | True |
40
  | Hidden dim | 768 |
41
- | FFN dim | 2048 |
42
- | Attention heads | 12 (Q) / 4 (KV) |
43
  | Vocab size | 32,000 |
44
  | Max seq length | 512 |
45
- | Training step | 1,100 |
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
46
 
47
- ## Key Innovations
48
 
49
- - **Looped Transformer** ([arXiv 2311.12424](https://arxiv.org/abs/2311.12424))
50
- - **Value Residual Learning** ([arXiv 2410.17897](https://arxiv.org/abs/2410.17897))
51
- - **Input Injection** for loop stability
 
6
  - looped-transformer
7
  - value-residual
8
  - sentencepiece
9
+ - tool-calling
10
+ - conversational
11
  license: apache-2.0
12
+ language:
13
+ - en
14
+ pipeline_tag: text-generation
15
  ---
16
 
17
+ # Jeeves-Small-75M
18
 
19
+ A compact 75M parameter language model built on **Looped Transformer** and **Value Residual Learning** architectures β€” with native support for **tool calling / function calling**.
20
 
21
+ Jeeves is designed to punch above its weight class by reusing a small set of transformer layers iteratively (looping), giving it an effective depth far beyond what its parameter count suggests.
22
+
23
+ ---
24
+
25
+ ## Quick Start
26
 
27
  ```python
28
  from transformers import AutoTokenizer, AutoModelForCausalLM
29
 
30
+ tokenizer = AutoTokenizer.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
31
+ model = AutoModelForCausalLM.from_pretrained("Anurich/Jeeves-Small-75M", trust_remote_code=True)
32
 
33
  inputs = tokenizer("Hello, how are you?", return_tensors="pt")
34
  outputs = model.generate(**inputs, max_new_tokens=50)
35
  print(tokenizer.decode(outputs[0], skip_special_tokens=True))
36
  ```
37
 
38
+ > **Note:** `trust_remote_code=True` is required due to custom model architecture code.
39
+
40
+ ---
41
+
42
+ ## Tool Calling (Function Calling)
43
+
44
+ Jeeves supports structured tool/function calling out of the box. Below is an example:
45
+
46
+ ```python
47
+ tools = [
48
+ {
49
+ "name": "get_weather",
50
+ "description": "Get the current weather for a given location.",
51
+ "parameters": {
52
+ "type": "object",
53
+ "properties": {
54
+ "location": {"type": "string", "description": "City name"},
55
+ "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}
56
+ },
57
+ "required": ["location"]
58
+ }
59
+ }
60
+ ]
61
+
62
+ messages = [
63
+ {"role": "user", "content": "What's the weather like in London?"}
64
+ ]
65
+
66
+ # Format prompt with tools using the chat template
67
+ prompt = tokenizer.apply_chat_template(
68
+ messages,
69
+ tools=tools,
70
+ tokenize=False,
71
+ add_generation_prompt=True
72
+ )
73
+
74
+ inputs = tokenizer(prompt, return_tensors="pt")
75
+ outputs = model.generate(**inputs, max_new_tokens=128)
76
+ print(tokenizer.decode(outputs[0], skip_special_tokens=True))
77
+ ```
78
+
79
+ ---
80
 
81
  ## Architecture
82
 
 
85
  | Parameters | 74.9M |
86
  | Unique layers | 8 |
87
  | Effective depth | 15 |
88
+ | Loop | block[4] Γ— 8 |
89
+ | Value residual | βœ… |
90
  | Hidden dim | 768 |
91
+ | FFN dim | 2,048 |
92
+ | Attention heads | 12 (Q) / 4 (KV) β€” GQA |
93
  | Vocab size | 32,000 |
94
  | Max seq length | 512 |
95
+ | Training steps | 1,100 |
96
+
97
+ ### Key Innovations
98
+
99
+ - **Looped Transformer** ([arXiv:2311.12424](https://arxiv.org/abs/2311.12424)) β€” A single transformer block is applied repeatedly in a loop, dramatically increasing effective depth while keeping parameter count small. This allows Jeeves to reason iteratively rather than in a single pass.
100
+ - **Value Residual Learning** ([arXiv:2410.17897](https://arxiv.org/abs/2410.17897)) β€” Residual connections applied at the value projection level alleviate attention concentration in deep/looped networks, improving gradient flow and stability.
101
+ - **Input Injection** β€” The original input is re-injected at each loop iteration to prevent representational drift across loops, a critical stabilization technique for looped architectures.
102
+
103
+ ---
104
+
105
+ ## Benchmark Results
106
+
107
+ Evaluated using [EleutherAI lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness).
108
+
109
+ | Benchmark | Accuracy | Correct | Total |
110
+ |---|---|---|---|
111
+ | HellaSwag | 30.9% | 3,100 | 10,042 |
112
+ | ARC-Easy | 47.1% | 1,118 | 2,376 |
113
+ | ARC-Challenge | 24.9% | 292 | 1,172 |
114
+ | **ARC (Average)** | **36.0%** | β€” | β€” |
115
+ | PIQA | 63.9% | 1,174 | 1,838 |
116
+ | WinoGrande | 52.4% | 664 | 1,267 |
117
+ | MMLU | 25.2% | 3,536 | 14,042 |
118
+ | TruthfulQA | 24.8% | 203 | 817 |
119
+ | GSM8K | 1.4% | 18 | 1,319 |
120
+ | IFEval | 40.0% | 4 | 10 |
121
+
122
+ ### Notes on Results
123
+
124
+ - **PIQA (63.9%)** and **WinoGrande (52.4%)** are the strongest results, indicating reasonable physical commonsense and pronoun-resolution reasoning for the model's size.
125
+ - **MMLU (25.2%)** is close to random (25% for 4-way MCQ), which is expected given the model's size and early training stage (1,100 steps). More training is needed for knowledge-heavy tasks.
126
+ - **GSM8K (1.4%)** reflects a known limitation: multi-step mathematical reasoning is very demanding and typically requires much larger models or specialized fine-tuning.
127
+ - **IFEval (40.0%)** is promising for a 75M model and reflects the tool-calling and instruction-following training signal.
128
+
129
+ ---
130
+
131
+ ## Limitations
132
+
133
+ - **Short context (512 tokens):** Jeeves currently supports a maximum of 512 tokens. Long documents, multi-turn conversations, and complex tool chains may be truncated.
134
+ - **Early training stage:** At 1,100 training steps, this is an early checkpoint. Knowledge-heavy and math benchmarks (MMLU, GSM8K) will improve significantly with more training.
135
+ - **Not suitable for factual retrieval:** Like all small language models, Jeeves may hallucinate facts. It is best used with grounding via tool calls or RAG pipelines.
136
+ - **English-centric:** Trained primarily on English data. Performance on other languages is not guaranteed.
137
+
138
+ ---
139
+
140
+ ## Intended Use
141
+
142
+ Jeeves is designed for:
143
+
144
+ - **On-device / edge inference** where a small footprint is critical
145
+ - **Tool-augmented agents** that rely on function calling rather than parametric knowledge
146
+ - **Research** into efficient architectures (looped transformers, value residual)
147
+ - **Fine-tuning** on domain-specific tasks where a small, fast base model is preferred
148
+
149
+ ---
150
+
151
+ ## Citation
152
+
153
+ If you use Jeeves in your work, please also cite the papers that inspired its architecture:
154
+
155
+ ```bibtex
156
+ @article{looped_transformer_2023,
157
+ title={Looped Transformers are Better at Learning Learning Algorithms},
158
+ author={...},
159
+ journal={arXiv:2311.12424},
160
+ year={2023}
161
+ }
162
+
163
+ @article{value_residual_2024,
164
+ title={Value Residual Learning For Alleviating Attention Concentration In Transformers},
165
+ author={...},
166
+ journal={arXiv:2410.17897},
167
+ year={2024}
168
+ }
169
+ ```
170
+
171
+ ---
172
 
173
+ ## License
174
 
175
+ Apache 2.0 β€” see [LICENSE](LICENSE) for details.