YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MMS-1B Fine-tuned for Itelmen Language (Fold 1 - Best Model)

Model Description

This model is a fine-tuned version of facebook/mms-1b-all for Itelmen language automatic speech recognition (ASR).

Itelmen is a critically endangered language spoken in the Kamchatka Peninsula, Russia, with approximately 100 speakers remaining (UNESCO classification: "Critically Endangered").

Performance

This is the best-performing fold from 3-fold cross-validation:

CER (Character Error Rate): 5.23% ✨
Training: Experiment 5, Fold 1/3
Baseline (Pilot experiment): 28.54%
Improvement: -23.31 percentage points (81.7% relative improvement)

Comparison

Model	CER	Status
Pilot (Phase 6)	28.54%	Previous version
Fold 1 (This model)	5.23%	✅ Best
Fold 0	9.40%	Completed
Fold 2	TBD	In progress

Training Details

Dataset

Total samples: 360 audio samples
Total duration: ~18 minutes
Speakers: 9 native speakers
Text encoding: IPA (International Phonetic Alphabet)
Data split: Speaker-independent 3-fold cross-validation

Training Configuration

Base model: facebook/mms-1b-all (1.4B parameters)
Fine-tuning method: Full fine-tuning (all parameters)
Training epochs: 57 (early stopped at epoch 54)
Best checkpoint: epoch 54
Batch size: 16 (effective)
Learning rate: 3e-5
Optimizer: AdamW with weight decay
Hardware: NVIDIA T600 (4GB VRAM)
Training time: ~3 days per fold

Data Augmentation

Speed perturbation (0.9x, 1.1x) - training data only
SpecAugment: Time masking

Usage

Using transformers

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio

# Load model and processor
model_name = "sut0/mms-1b-itelmen-fold1"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)

# Load audio
audio, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    audio = resampler(audio)

# Transcribe
inputs = processor(audio.squeeze(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Using Hugging Face Space

Try the model live at: Itelmen ASR Demo

Limitations and Biases

Limitations

Small dataset: Only 360 samples (~18 minutes) due to language endangerment
Limited speakers: 9 speakers may not cover full phonetic variability
Single fold: This is a single-fold model from 3-fold cross-validation
- A 3-fold averaged model will be released after all folds complete
- May have lower generalization than the final ensemble
Domain: Trained on conversational/narrative speech
IPA transcription: Uses International Phonetic Alphabet, not Cyrillic script

Performance Notes

CER 5.23% is calculated on this fold's validation set
Cross-fold performance may vary (Fold 0: 9.40%, Fold 1: 5.23%)
Real-world performance on unseen speakers may differ

Training Procedure

Phase 7-4: Full-scale Experiment

This model is part of a systematic hyperparameter search experiment:

Experiment ID: 5
Search strategy: Random search across 49 trials
Current trial: 1/49
Fold: 1/3 (completed)

Preprocessing

Audio resampling to 16kHz
Text normalization (lowercase, IPA preservation)
Audio augmentation (speed perturbation on training data)

Training

Architecture: Wav2Vec2 + CTC (Connectionist Temporal Classification)
Loss function: CTC loss
Decoding: Greedy decoding
Early stopping: Patience of 3 epochs
Metric for best model: CER

Intended Use

Primary Use Cases

Language documentation: Transcribing Itelmen language recordings
Educational tools: Supporting language learning and preservation
Research: Low-resource ASR methodology development

Out-of-Scope Use

Commercial speech recognition (insufficient robustness)
High-stakes applications (medical, legal) without human verification
Other languages (trained specifically for Itelmen)

Ethical Considerations

Language Preservation

This project aims to support the preservation and documentation of Itelmen, a critically endangered language. All training data was collected with appropriate permissions and cultural sensitivity.

Data Privacy

Training data: Publicly available or appropriately licensed recordings
No personally identifiable information in model outputs

Citation

If you use this model, please cite:

@misc{itelmen-asr-fold1-2025,
  title={MMS-1B Fine-tuned for Itelmen Language ASR (Fold 1)},
  author={sut0},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/sut0/mms-1b-itelmen-fold1}}
}

Acknowledgments

Base model: Meta AI's MMS project
Language resources: Harvard Itelmen Language Project
Community: Itelmen language speakers and cultural preservation efforts

Model Card Authors

sut0

Model Card Contact

For questions or feedback, please open an issue on the Space repository.

Project: Itelmen ASR System Phase: 7-4 (Full-scale Experiment) Created: December 2025 Last Updated: December 23, 2025 Status: ✅ Fold 1 completed | ⏳ Fold 2 in progress | ⏳ Fold 3 pending

Downloads last month: 18

Safetensors

Model size

1.0B params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

sut0
/

mms-1b-itelmen-fold1