YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

MMS-1B Fine-tuned for Itelmen Language (Fold 1 - Best Model)

Model Description

This model is a fine-tuned version of facebook/mms-1b-all for Itelmen language automatic speech recognition (ASR).

Itelmen is a critically endangered language spoken in the Kamchatka Peninsula, Russia, with approximately 100 speakers remaining (UNESCO classification: "Critically Endangered").

Performance

This is the best-performing fold from 3-fold cross-validation:

  • CER (Character Error Rate): 5.23% โœจ
  • Training: Experiment 5, Fold 1/3
  • Baseline (Pilot experiment): 28.54%
  • Improvement: -23.31 percentage points (81.7% relative improvement)

Comparison

Model CER Status
Pilot (Phase 6) 28.54% Previous version
Fold 1 (This model) 5.23% โœ… Best
Fold 0 9.40% Completed
Fold 2 TBD In progress

Training Details

Dataset

  • Total samples: 360 audio samples
  • Total duration: ~18 minutes
  • Speakers: 9 native speakers
  • Text encoding: IPA (International Phonetic Alphabet)
  • Data split: Speaker-independent 3-fold cross-validation

Training Configuration

  • Base model: facebook/mms-1b-all (1.4B parameters)
  • Fine-tuning method: Full fine-tuning (all parameters)
  • Training epochs: 57 (early stopped at epoch 54)
  • Best checkpoint: epoch 54
  • Batch size: 16 (effective)
  • Learning rate: 3e-5
  • Optimizer: AdamW with weight decay
  • Hardware: NVIDIA T600 (4GB VRAM)
  • Training time: ~3 days per fold

Data Augmentation

  • Speed perturbation (0.9x, 1.1x) - training data only
  • SpecAugment: Time masking

Usage

Using transformers

from transformers import Wav2Vec2ForCTC, Wav2Vec2Processor
import torchaudio

# Load model and processor
model_name = "sut0/mms-1b-itelmen-fold1"
processor = Wav2Vec2Processor.from_pretrained(model_name)
model = Wav2Vec2ForCTC.from_pretrained(model_name)

# Load audio
audio, sr = torchaudio.load("path/to/audio.wav")
if sr != 16000:
    resampler = torchaudio.transforms.Resample(sr, 16000)
    audio = resampler(audio)

# Transcribe
inputs = processor(audio.squeeze(), sampling_rate=16000, return_tensors="pt", padding=True)
with torch.no_grad():
    logits = model(inputs.input_values).logits

predicted_ids = torch.argmax(logits, dim=-1)
transcription = processor.batch_decode(predicted_ids)[0]
print(transcription)

Using Hugging Face Space

Try the model live at: Itelmen ASR Demo

Limitations and Biases

Limitations

  • Small dataset: Only 360 samples (~18 minutes) due to language endangerment
  • Limited speakers: 9 speakers may not cover full phonetic variability
  • Single fold: This is a single-fold model from 3-fold cross-validation
    • A 3-fold averaged model will be released after all folds complete
    • May have lower generalization than the final ensemble
  • Domain: Trained on conversational/narrative speech
  • IPA transcription: Uses International Phonetic Alphabet, not Cyrillic script

Performance Notes

  • CER 5.23% is calculated on this fold's validation set
  • Cross-fold performance may vary (Fold 0: 9.40%, Fold 1: 5.23%)
  • Real-world performance on unseen speakers may differ

Training Procedure

Phase 7-4: Full-scale Experiment

This model is part of a systematic hyperparameter search experiment:

  • Experiment ID: 5
  • Search strategy: Random search across 49 trials
  • Current trial: 1/49
  • Fold: 1/3 (completed)

Preprocessing

  1. Audio resampling to 16kHz
  2. Text normalization (lowercase, IPA preservation)
  3. Audio augmentation (speed perturbation on training data)

Training

  • Architecture: Wav2Vec2 + CTC (Connectionist Temporal Classification)
  • Loss function: CTC loss
  • Decoding: Greedy decoding
  • Early stopping: Patience of 3 epochs
  • Metric for best model: CER

Intended Use

Primary Use Cases

  1. Language documentation: Transcribing Itelmen language recordings
  2. Educational tools: Supporting language learning and preservation
  3. Research: Low-resource ASR methodology development

Out-of-Scope Use

  • Commercial speech recognition (insufficient robustness)
  • High-stakes applications (medical, legal) without human verification
  • Other languages (trained specifically for Itelmen)

Ethical Considerations

Language Preservation

This project aims to support the preservation and documentation of Itelmen, a critically endangered language. All training data was collected with appropriate permissions and cultural sensitivity.

Data Privacy

  • Training data: Publicly available or appropriately licensed recordings
  • No personally identifiable information in model outputs

Citation

If you use this model, please cite:

@misc{itelmen-asr-fold1-2025,
  title={MMS-1B Fine-tuned for Itelmen Language ASR (Fold 1)},
  author={sut0},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/sut0/mms-1b-itelmen-fold1}}
}

Acknowledgments

  • Base model: Meta AI's MMS project
  • Language resources: Harvard Itelmen Language Project
  • Community: Itelmen language speakers and cultural preservation efforts

Model Card Authors

sut0

Model Card Contact

For questions or feedback, please open an issue on the Space repository.


Project: Itelmen ASR System Phase: 7-4 (Full-scale Experiment) Created: December 2025 Last Updated: December 23, 2025 Status: โœ… Fold 1 completed | โณ Fold 2 in progress | โณ Fold 3 pending

Downloads last month
18
Safetensors
Model size
1.0B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Space using sut0/mms-1b-itelmen-fold1 1