Model Card for Model ID
This is a whisper-large-v3-turbo speech-to-text model
fine-tuned on Kazakh Speech Corpus 2 (ISSAI) and Golos datasets.
It achieves great performance in Kazakh and decent performance in Russian.
Word-Error-Rates (WERs) on test-sets
| Model | ISSAI-KSC2 | CV-kaz | FLEURS-kaz | Golos-crowd | CV-rus | FLEURS-rus |
|---|---|---|---|---|---|---|
| abilmansplus/whisper-turbo-kaz-rus-v1 | 8.92% | 13.34% | 13.60% | 8.95% | 20.73% | 16.34% |
| openai/whisper-large-v3-turbo | 70.30% | 47.42% | 23.68% | 26.25% | 8.78% | 5.21% |
Character-Error-Rates (CERs) on test-sets
| Model | ISSAI-KSC2 | CV-kaz | FLEURS-kaz | Golos-crowd | CV-rus | FLEURS-rus |
|---|---|---|---|---|---|---|
| abilmansplus/whisper-turbo-kaz-rus-v1 | 2.99% | 3.50% | 5.43% | 2.42% | 5.52% | 6.66% |
| openai/whisper-large-v3-turbo | 34.26% | 27.83% | 6.02% | 16.54% | 3.61% | 1.46% |
Model Details
Recommendations
Best suited for relatively clean Kazakh and Russian speech transcription.
You may need to further fine-tune the model on your domain-specific datasets (e.g., phone-calls).
The model outputs transcripts that do NOT include punctuation, capitalization, or time-stamps.
How to Get Started with the Model
For longer audio (35+ seconds), you can divide them into 30-second chunks, transcribe each chunk separately, and then merge the results.
For better quality long-form transcription consider dividing audio into voiced segments using VAD solutions, for example:
Example implementation of a transcriber that can handle both short and long audio files:
import librosa
import numpy as np
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
class Transcriber:
def __init__(
self,
model_path="abilmansplus/whisper-turbo-kaz-rus-v1",
processor_path="openai/whisper-large-v3-turbo", # converts audio into mel-spectrogram features
device="cuda",
sampling_rate=16_000,
num_beams=5,
chunk_length_s=30, stride_length_s=1,
half_precision=True
):
self.processor = WhisperProcessor.from_pretrained(
processor_path,
language=None,
task="transcribe"
)
self.half_precision = half_precision
self.model = WhisperForConditionalGeneration.from_pretrained(
model_path,
torch_dtype = torch.float16 if half_precision else torch.float32
)
self.model = self.model.to(device)
self.sr = sampling_rate
self.num_beams=num_beams
self.chunk_length_s = chunk_length_s # chunk length in seconds
self.stride_length_s = stride_length_s # overlap between chunks in seconds
def transcribe(self, audio_path: str) -> str:
speech_array, sampling_rate = librosa.load(audio_path, sr=self.sr)
audio_length_s = len(speech_array) / self.sr
# If audio is shorter than chunk_length_s, process normally
if audio_length_s <= self.chunk_length_s:
full_transcription = self._transcribe_chunk(speech_array)
return full_transcription
# For longer audio, process in chunks
chunk_length_samples = int(self.chunk_length_s * self.sr)
stride_length_samples = int(self.stride_length_s * self.sr)
# Calculate number of chunks
num_samples = len(speech_array)
num_chunks = max(1,
int(
1 +
np.ceil(
(num_samples - chunk_length_samples) /
(chunk_length_samples - stride_length_samples)
)
)
)
transcriptions = []
for i in range(num_chunks):
# Calculate chunk start and end
start = max(0, i * (chunk_length_samples - stride_length_samples))
end = min(num_samples, start + chunk_length_samples)
# Get audio chunk
chunk = speech_array[start:end]
# Transcribe chunk
chunk_transcription = self._transcribe_chunk(chunk)
transcriptions.append(chunk_transcription)
# Combine transcriptions (simple concatenation for now)
full_transcription = " ".join(transcriptions)
return full_transcription
def _transcribe_chunk(self, audio_chunk) -> str:
# Process inputs
inputs = self.processor(
audio_chunk,
sampling_rate=self.sr,
return_tensors="pt"
).input_features.to(self.model.device)
if self.half_precision:
inputs = inputs.half()
# Get forced decoder IDs for language and task
forced_decoder_ids = self.processor.get_decoder_prompt_ids(
language=None,
task="transcribe"
)
# The attention mask should be 1 for all positions in the input features
attention_mask = torch.ones_like(inputs[:, :, 0])
# Generate transcription
with torch.no_grad():
generated_ids = self.model.generate(
inputs,
forced_decoder_ids=forced_decoder_ids,
max_length=448,
num_beams=self.num_beams,
attention_mask=attention_mask,
)
# Decode the generated IDs to text
transcription = self.processor.batch_decode(
generated_ids,
skip_special_tokens=True
)[0]
return transcription
Training Details
Training Data
Fine-tuning has been performed on the following datasets:
- Kazakh Speech Corpus 2 (ISSAI) - train split
- Golos - "crowd" subset, train split
- "farfield" subset was omitted because, depending on use-case, we may not want to transcribe far-away voices (e.g., background chatter).
Training Hyperparameters
model_path = "abilmansplus/whisper-turbo-ksc2" # model to fine-tune
processor_path = "openai/whisper-large-v3-turbo" # this can stay constant, processor never changes, converts audio into mel-spectrogram
train_batch_size = 4
grad_accum_steps = 2
learning_rate = 1e-5
warmup_steps = 20000
weight_decay = 0.01
fp16 = True
bf16 = False
num_epochs = 3
dataloader_num_workers = 8
dataloader_prefetch_factor = 4
sample_rate = 16000 # for audio
max_audio_len = 30 # Maximum audio length in seconds
gradient_checkpointing = True
freeze_encoder = False # if True, only Decoder is fine-tuned, saves a lot of time and GPU memory on less powerful machines
## LoRA params
use_lora = True # LoRA tecnique to reduce the number of trainable parameteres
lora_r = 64 # rank
lora_alpha = lora_r * 2
lora_dropout = 0.05
Evaluation
Testing Data
Testing has been done on the following datasets:
- Kazakh Speech Corpus 2 (ISSAI) - test split
- Golos - "crowd" subset, ~5000 samples from test split (the rest were left-out for validation purposes)
- "farfield" subset was omitted because, depending on use-case, we may not want to transcribe far-away voices (e.g., background chatter).
- Common Voice (CV) - Common Voice Scripted Speech 23.0
- Kazakh subset (CV-kaz) - test split
- Russian subset (CV-rus) - test split
- FLEURS - Kazakh and Russian subsets, test splits
Hardware
Single GPU: Nvidia RTX 5060 Ti 16GB
Framework versions
- PEFT 0.18.0
- Downloads last month
- 44
Model tree for abilmansplus/whisper-turbo-kaz-rus-v1
Base model
openai/whisper-large-v3