---
title: "Whisper CTC-DRO ASR model - set 2"
language: multilingual
tags:
  - asr
  - whisper
  - whisper-dro
  - seq2seq
license: apache-2.0
---

# Whisper CTC-DRO ASR model - set 2

This repository contains an automatic speech recognition (ASR) model fine-tuned from `openai/whisper-large-v3` using the principles of [CTC-DRO](https://arxiv.org/abs/2502.01777) applied to Whisper's seq2seq architecture.
The model was trained on balanced training data from set 2 (eng, fas, hrv, ita, slk, yue).

DRO hyperparameters: eta=5e-3, alpha=0.1, aggregation: mean

## Intended Use

This model is intended for multilingual ASR. Users can run inference using the HuggingFace Transformers library:

```python
import torch
import librosa
from transformers import WhisperForConditionalGeneration, WhisperProcessor

model = WhisperForConditionalGeneration.from_pretrained("bartelds/whisper-dro-set2-dro")
processor = WhisperProcessor.from_pretrained("bartelds/whisper-dro-set2-dro")
model.eval()

audio, sr = librosa.load("input.wav", sr=16000)
inputs = processor.feature_extractor(audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    generated = model.generate(input_features=inputs.input_features)

text = processor.tokenizer.batch_decode(generated, skip_special_tokens=True)[0]
print("Recognized text:", text)
```

## How to Use

1. Install dependencies: `pip install transformers torch librosa`
2. Load the model and processor using `from_pretrained()` as shown above.
3. The model supports multilingual transcription -- see the training repository for evaluation details.

## Training

- **Base model:** `openai/whisper-large-v3`
- **Training code:** [whisper-dro](https://github.com/bartelds/whisper-dro)
- **Paper:** [CTC-DRO](https://arxiv.org/abs/2502.01777)