--- title: "Whisper CTC-DRO ASR model - set 2" language: multilingual tags: - asr - whisper - whisper-dro - seq2seq license: apache-2.0 --- # Whisper CTC-DRO ASR model - set 2 This repository contains an automatic speech recognition (ASR) model fine-tuned from `openai/whisper-large-v3` using the principles of [CTC-DRO](https://arxiv.org/abs/2502.01777) applied to Whisper's seq2seq architecture. The model was trained on balanced training data from set 2 (eng, fas, hrv, ita, slk, yue). DRO hyperparameters: eta=5e-3, alpha=0.1, aggregation: mean ## Intended Use This model is intended for multilingual ASR. Users can run inference using the HuggingFace Transformers library: ```python import torch import librosa from transformers import WhisperForConditionalGeneration, WhisperProcessor model = WhisperForConditionalGeneration.from_pretrained("bartelds/whisper-dro-set2-dro") processor = WhisperProcessor.from_pretrained("bartelds/whisper-dro-set2-dro") model.eval() audio, sr = librosa.load("input.wav", sr=16000) inputs = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt") with torch.no_grad(): generated = model.generate(input_features=inputs.input_features) text = processor.tokenizer.batch_decode(generated, skip_special_tokens=True)[0] print("Recognized text:", text) ``` ## How to Use 1. Install dependencies: `pip install transformers torch librosa` 2. Load the model and processor using `from_pretrained()` as shown above. 3. The model supports multilingual transcription -- see the training repository for evaluation details. ## Training - **Base model:** `openai/whisper-large-v3` - **Training code:** [whisper-dro](https://github.com/bartelds/whisper-dro) - **Paper:** [CTC-DRO](https://arxiv.org/abs/2502.01777)