See axolotl config

axolotl version: 0.14.0.dev0

base_model: Qwen/Qwen3-8B

# Dataset Configuration
datasets:
  - path: data/merged-shuffled-even.jsonl
    type: completion

dataset_prepared_path: last_run_prepared

# Sequence and packing settings
sequence_len: 32768
sample_packing: true
pretrain_multipack_attn: true
flash_attention: true

# Training Hyperparameters
micro_batch_size: 2
gradient_accumulation_steps: 1
num_epochs: 1
optimizer: adamw_torch
adam_beta2: 0.999
lr_scheduler: cosine
learning_rate: 5e-5
warmup_ratio: 0.1
weight_decay: 0.01

# Precision
bf16: auto

# Logging
logging_steps: 10
save_strategy: steps
save_steps: 500
save_total_limit: 4

# Weights & Biases
wandb_project: Machikado
wandb_name: Mazoku-8B-Run
dataset_processes: 64
dataloader_num_workers: 4 

gradient_checkpointing: true

# Liger Kernel Plugins
plugins:
  - axolotl.integrations.liger.LigerPlugin

liger_rope: true
liger_rms_norm: true
liger_glu_activation: true
liger_layer_norm: true
liger_fused_linear_cross_entropy: true

Mazoku-8B is a native-first continued pre-train (CPT) of the Qwen-8B architecture. It is designed to capture the authentic "soul" of modern Filipino digital discourse, moving beyond the stiff, robotic translations typically found in mainstream LLMs.

Model Description

Mazoku-8B was trained on approximately 42,000 lines of high-density Tagalog and Tagalish data, including forum threads, social media commentary, and long-form narratives. The training utilized an AMD Instinct MI325X, leveraging a massive 32,768 context window and Liger kernels for vocabulary efficiency.

Unlike assistant-aligned models, Mazoku-8B is a raw base model. It has been "Reddit-poisoned" by design to understand deep colloquialisms, local political slang, and the natural rhythm of how Filipinos actually speak and argue online.

Intended Uses & Limitations

Intended Uses

Foundation for Filipino AI: A high-quality base for SFT (Supervised Fine-Tuning) aimed at creating natural-sounding Filipino assistants.
Cultural & Linguistic Research: Analyzing local slang, code-switching patterns (Tagalish), and digital sentiment.
Creative Content: Generating dialogue that sounds like a native speaker rather than a translation script.

Limitations

Non-Aligned: This model is not an assistant. It will not necessarily answer questions helpfully; it will often "complete" the prompt in the style of a forum user.
Toxic Persona: Due to the nature of the training data (social media/forums), the model may exhibit aggressive, biased, or "kanto-style" tones. It is prone to political commentary and slang that may be inappropriate for professional environments without further alignment.
Knowledge Cutoff: It inherits the knowledge cutoff of the Qwen-8B base but adds specific local context up to 2024-2025.

Training Procedure

Training Hyperparameters

The model was trained with a focus on stable language acquisition rather than aggressive weight overwriting:

Learning Rate: 5e-05
Sequence Length: 32,768 (with multipack attention)
Batch Size: 2 (Micro) / 16 (Total)
Precision: BF16
Epochs: 1

Hardware

GPU: 8x AMD Instinct MI325X (ROCm 7.1)
Techniques: Liger Kernel (fused linear cross-entropy, RMSNorm, RoPE) for optimized VRAM and throughput.

Results

Final Loss: ~2.04
Final Perplexity (PPL): ~7.7
Convergence: The model showed stable convergence with normalized gradient spikes after dataset cleaning (slicing long-form documents into header-aware chunks).

Evaluation Data

The model was evaluated qualitatively on its ability to handle "Tagalish" code-switching and local political context. It demonstrates a high degree of native-sounding fluency, outperforming its base model in capturing Filipino internet "rhythm."

How to use

As this is a Base Model, it is highly sensitive to the temperature and the start of the prompt.

For authentic netizen-style text: Temperature 0.7
For cleaner, more logical completion: Temperature 0.2
Recommended: Follow up with an SFT pass using ChatML to align this "soulful" base into a helpful assistant.

Downloads last month: 8

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for KaraKaraWitch/Mazoku-8B-Qwen3

Base model

Qwen/Qwen3-8B-Base

Finetuned

Qwen/Qwen3-8B

Finetuned

(1448)

this model

Quantizations

2 models