--- license: apache-2.0 datasets: - nvidia/Nemotron-CC-v2 - HuggingFaceTB/finemath - bigcode/starcoderdata language: - en --- # Model Details This is a decoder-only model with approximately 2.15B parameters. The architecture largely follows the Llama design, with the following key hyperparameters: - Hidden Size: 2048 - Attention Heads: 32 - Layers: 24 - Sequence Length: 2048 # Training Data The training data is a diverse dataset, combined high-quality English, code, and math. The total token budget for training is 4 trillion tokens. The training mixture is comprised of the following datasets: - English: A mixture of Nemotron-CC high-actual and medium-high-actual datasets. - Code: The StarCoder dataset. - Math: The FineMath 4+ dataset. The final data split is based on a predefined proportion of English, code, and math, with the remaining token budget allocated to other languages. ![detailed_data_en](https://cdn-uploads.huggingface.co/production/uploads/618bf745f723a0c1e7f2ce6d/40lp--aeUIkXD8SEKoGXo.png) # Tokenizer The model utilizes Gemma-3 tokenizer — a SentencePiece tokenizer with a 262K vocabulary size. It supports over 140 languages, which contributed to the model's multilingual performance. # Training Information The model was trained using the Megatron-LM framework on the LUMI HPC supercomputer. The training utilized 64 AMD MI250x nodes, totaling approximately 165000 GPU hours. Intermediate Checkpoints We have released intermediate checkpoints to provide access to the model's training progression. These checkpoints are available in separate branches, with a new checkpoint released every 5000 training steps. The naming convention is `checkpoint_0xxxxx00`. For example, the checkpoint for 50000 iterations is named `checkpoint_0050000`. The available checkpoints range from `checkpoint_0005000` up to `checkpoint_0953675`. The final checkpoint, `checkpoint_0953675`, is located in the main branch.