MMS-1B Fine-tuned for Itelmen Language (Fold 1 - Best Model)
Model Description
This model is a fine-tuned version of facebook/mms-1b-all for Itelmen language automatic speech recognition (ASR).
Itelmen is a critically endangered language spoken in the Kamchatka Peninsula, Russia, with approximately 100 speakers remaining (UNESCO classification: "Critically Endangered").
Performance
This is the best-performing fold from 3-fold cross-validation:
- CER (Character Error Rate): 5.23% โจ
- Training: Experiment 5, Fold 1/3
- Baseline (Pilot experiment): 28.54%
- Improvement: -23.31 percentage points (81.7% relative improvement)
Comparison
| Model | CER | Status |
|---|---|---|
| Pilot (Phase 6) | 28.54% | Previous version |
| Fold 1 (This model) | 5.23% | โ Best |
| Fold 0 | 9.40% | Completed |
| Fold 2 | TBD | In progress |
Training Details
Dataset
- Total samples: 360 audio samples
- Total duration: ~18 minutes
- Speakers: 9 native speakers
- Text encoding: IPA (International Phonetic Alphabet)
- Data split: Speaker-independent 3-fold cross-validation
Training Configuration
- Base model: facebook/mms-1b-all (1.4B parameters)
- Fine-tuning method: Full fine-tuning (all parameters)
- Training epochs: 57 (early stopped at epoch 54)
- Best checkpoint: epoch 54
- Batch size: 16 (effective)
- Learning rate: 3e-5
- Optimizer: AdamW with weight decay
- Hardware: NVIDIA T600 (4GB VRAM)
- Training time: ~3 days per fold
Data Augmentation
- Speed perturbation (0.9x, 1.1x) - training data only
- SpecAugment: Time masking
Usage
Using transformers
from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio
# Load model and processor
model_name = "sut0/mms-1b-itelmen-fold1"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)
# Load audio
audio, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
resampler = torchaudio.transforms.Resample(sr, 16000)
audio = resampler(audio)
# Transcribe
inputs = processor(audio.squeeze(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
logits = model(inputs.input_values).logits
predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)
Using Hugging Face Space
Try the model live at: Itelmen ASR Demo
Limitations and Biases
Limitations
- Small dataset: Only 360 samples (~18 minutes) due to language endangerment
- Limited speakers: 9 speakers may not cover full phonetic variability
- Single fold: This is a single-fold model from 3-fold cross-validation
- A 3-fold averaged model will be released after all folds complete
- May have lower generalization than the final ensemble
- Domain: Trained on conversational/narrative speech
- IPA transcription: Uses International Phonetic Alphabet, not Cyrillic script
Performance Notes
- CER 5.23% is calculated on this fold's validation set
- Cross-fold performance may vary (Fold 0: 9.40%, Fold 1: 5.23%)
- Real-world performance on unseen speakers may differ
Training Procedure
Phase 7-4: Full-scale Experiment
This model is part of a systematic hyperparameter search experiment:
- Experiment ID: 5
- Search strategy: Random search across 49 trials
- Current trial: 1/49
- Fold: 1/3 (completed)
Preprocessing
- Audio resampling to 16kHz
- Text normalization (lowercase, IPA preservation)
- Audio augmentation (speed perturbation on training data)
Training
- Architecture: Wav2Vec2 + CTC (Connectionist Temporal Classification)
- Loss function: CTC loss
- Decoding: Greedy decoding
- Early stopping: Patience of 3 epochs
- Metric for best model: CER
Intended Use
Primary Use Cases
- Language documentation: Transcribing Itelmen language recordings
- Educational tools: Supporting language learning and preservation
- Research: Low-resource ASR methodology development
Out-of-Scope Use
- Commercial speech recognition (insufficient robustness)
- High-stakes applications (medical, legal) without human verification
- Other languages (trained specifically for Itelmen)
Ethical Considerations
Language Preservation
This project aims to support the preservation and documentation of Itelmen, a critically endangered language. All training data was collected with appropriate permissions and cultural sensitivity.
Data Privacy
- Training data: Publicly available or appropriately licensed recordings
- No personally identifiable information in model outputs
Citation
If you use this model, please cite:
@misc{itelmen-asr-fold1-2025,
title={MMS-1B Fine-tuned for Itelmen Language ASR (Fold 1)},
author={sut0},
year={2025},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/sut0/mms-1b-itelmen-fold1}}
}
Acknowledgments
- Base model: Meta AI's MMS project
- Language resources: Harvard Itelmen Language Project
- Community: Itelmen language speakers and cultural preservation efforts
Model Card Authors
sut0
Model Card Contact
For questions or feedback, please open an issue on the Space repository.
Project: Itelmen ASR System Phase: 7-4 (Full-scale Experiment) Created: December 2025 Last Updated: December 23, 2025 Status: โ Fold 1 completed | โณ Fold 2 in progress | โณ Fold 3 pending
- Downloads last month
- 18