See axolotl config
axolotl version: 0.14.0.dev0
base_model: Qwen/Qwen3-8B
# Dataset Configuration
datasets:
- path: data/merged-shuffled-even.jsonl
type: completion
dataset_prepared_path: last_run_prepared
# Sequence and packing settings
sequence_len: 32768
sample_packing: true
pretrain_multipack_attn: true
flash_attention: true
# Training Hyperparameters
micro_batch_size: 2
gradient_accumulation_steps: 1
num_epochs: 1
optimizer: adamw_torch
adam_beta2: 0.999
lr_scheduler: cosine
learning_rate: 5e-5
warmup_ratio: 0.1
weight_decay: 0.01
# Precision
bf16: auto
# Logging
logging_steps: 10
save_strategy: steps
save_steps: 500
save_total_limit: 4
# Weights & Biases
wandb_project: Machikado
wandb_name: Mazoku-8B-Run
dataset_processes: 64
dataloader_num_workers: 4
gradient_checkpointing: true
# Liger Kernel Plugins
plugins:
- axolotl.integrations.liger.LigerPlugin
liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true
Mazoku-8B is a native-first continued pre-train (CPT) of the Qwen-8B architecture. It is designed to capture the authentic "soul" of modern Filipino digital discourse, moving beyond the stiff, robotic translations typically found in mainstream LLMs.
Model Description
Mazoku-8B was trained on approximately 42,000 lines of high-density Tagalog and Tagalish data, including forum threads, social media commentary, and long-form narratives. The training utilized an AMD Instinct MI325X, leveraging a massive 32,768 context window and Liger kernels for vocabulary efficiency.
Unlike assistant-aligned models, Mazoku-8B is a raw base model. It has been "Reddit-poisoned" by design to understand deep colloquialisms, local political slang, and the natural rhythm of how Filipinos actually speak and argue online.
Intended Uses & Limitations
Intended Uses
- Foundation for Filipino AI: A high-quality base for SFT (Supervised Fine-Tuning) aimed at creating natural-sounding Filipino assistants.
- Cultural & Linguistic Research: Analyzing local slang, code-switching patterns (Tagalish), and digital sentiment.
- Creative Content: Generating dialogue that sounds like a native speaker rather than a translation script.
Limitations
- Non-Aligned: This model is not an assistant. It will not necessarily answer questions helpfully; it will often "complete" the prompt in the style of a forum user.
- Toxic Persona: Due to the nature of the training data (social media/forums), the model may exhibit aggressive, biased, or "kanto-style" tones. It is prone to political commentary and slang that may be inappropriate for professional environments without further alignment.
- Knowledge Cutoff: It inherits the knowledge cutoff of the Qwen-8B base but adds specific local context up to 2024-2025.
Training Procedure
Training Hyperparameters
The model was trained with a focus on stable language acquisition rather than aggressive weight overwriting:
- Learning Rate: 5e-05
- Sequence Length: 32,768 (with multipack attention)
- Batch Size: 2 (Micro) / 16 (Total)
- Precision: BF16
- Epochs: 1
Hardware
- GPU: 8x AMD Instinct MI325X (ROCm 7.1)
- Techniques: Liger Kernel (fused linear cross-entropy, RMSNorm, RoPE) for optimized VRAM and throughput.
Results
- Final Loss: ~2.04
- Final Perplexity (PPL): ~7.7
- Convergence: The model showed stable convergence with normalized gradient spikes after dataset cleaning (slicing long-form documents into header-aware chunks).
Evaluation Data
The model was evaluated qualitatively on its ability to handle "Tagalish" code-switching and local political context. It demonstrates a high degree of native-sounding fluency, outperforming its base model in capturing Filipino internet "rhythm."
How to use
As this is a Base Model, it is highly sensitive to the temperature and the start of the prompt.
- For authentic netizen-style text:
Temperature 0.7 - For cleaner, more logical completion:
Temperature 0.2 - Recommended: Follow up with an SFT pass using ChatML to align this "soulful" base into a helpful assistant.
- Downloads last month
- 8
