| ---
|
| license: apache-2.0
|
| datasets:
|
| - mozilla-foundation/common_voice_10_0
|
| base_model:
|
| - facebook/wav2vec2-xls-r-300m
|
| tags:
|
| - pytorch
|
| - phoneme-recognition
|
| pipeline_tag: automatic-speech-recognition
|
| arxiv: arxiv.org/abs/2306.04306
|
| metrics:
|
| - per
|
| - aer
|
| library_name: allophant
|
| language:
|
| - bn
|
| - ca
|
| - cs
|
| - cv
|
| - da
|
| - de
|
| - el
|
| - en
|
| - es
|
| - et
|
| - eu
|
| - fi
|
| - fr
|
| - ga
|
| - hi
|
| - hu
|
| - id
|
| - it
|
| - ka
|
| - ky
|
| - lt
|
| - mt
|
| - nl
|
| - pl
|
| - pt
|
| - ro
|
| - ru
|
| - sk
|
| - sl
|
| - sv
|
| - sw
|
| - ta
|
| - tr
|
| - uk
|
| ---
|
|
|
| Model Information
|
| =================
|
|
|
| Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.
|
|
|
| The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).
|
|
|
| | Model Name | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
|
| | ---------------- | ---------: | ---------: | -------: | -------: |
|
| | [Multitask](https://huggingface.co/kgnlp/allophant) | **45.62%** | 19.44% | **34.34%** | **8.36%** |
|
| | [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical) | 46.09% | **19.18%** | 34.35% | 8.56% |
|
| | [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% |
|
| | **Baseline Shared** | 48.25% | - | 45.35% | - |
|
| | [Baseline](https://huggingface.co/kgnlp/allophant-baseline) | 57.01% | - | 46.95% | - |
|
|
|
| Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
|
|
|
| Usage
|
| =====
|
|
|
| Install the [`allophant`](https://github.com/kgnlp/allophant) package:
|
|
|
| ```bash
|
| pip install allophant
|
| ```
|
|
|
| A pre-trained model can be loaded from a huggingface checkpoint or local file:
|
|
|
| ```python
|
| from allophant.estimator import Estimator
|
|
|
| device = "cpu"
|
| model, attribute_indexer = Estimator.restore("kgnlp/allophant-baseline-shared", device=device)
|
| supported_features = attribute_indexer.feature_names
|
| # The phonetic feature categories supported by the model, including "phonemes"
|
| print(supported_features)
|
| ```
|
| Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
|
|
|
| ```python
|
| # 1. For a single language:
|
| inventory = attribute_indexer.phoneme_inventory("es")
|
| # 2. For multiple languages, e.g. in code-switching scenarios
|
| inventory = attribute_indexer.phoneme_inventory(["es", "it"])
|
| # 3. Any custom selection of phones for which features are available in the Allophoible database
|
| inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
|
| ````
|
|
|
| Audio files can then be loaded, resampled and transcribed using the given
|
| inventory by first computing the log probabilities for each classifier:
|
|
|
| ```python
|
| import torch
|
| import torchaudio
|
| from allophant.dataset_processing import Batch
|
|
|
| # Load an audio file and resample the first channel to the sample rate used by the model
|
| audio, sample_rate = torchaudio.load("utterance.wav")
|
| audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
|
|
|
| # Construct a batch of 0-padded single channel audio, lengths and language IDs
|
| # Language ID can be 0 for inference
|
| batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
|
| model_outputs = model.predict(
|
| batch.to(device),
|
| attribute_indexer.composition_feature_matrix(inventory).to(device)
|
| )
|
| ```
|
|
|
| Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
|
|
|
| ```python
|
| from allophant import predictions
|
|
|
| # Create a feature mapping for your inventory and CTC decoders for the desired feature set
|
| inventory_indexer = attribute_indexer.attributes.subset(inventory)
|
| ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
|
|
|
| for feature_name, decoder in ctc_decoders.items():
|
| decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
|
| # Print the feature name and values for each utterance in the batch
|
| for [hypothesis] in decoded:
|
| # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
|
| recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
|
| print(feature_name, recognized)
|
| ```
|
|
|
| Citation
|
| ========
|
|
|
| ```bibtex
|
| @inproceedings{glocker2023allophant,
|
| title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
|
| author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
|
| year={2023},
|
| booktitle={{Proc. Interspeech 2023}},
|
| month={8}}
|
| ```
|
| [](arxiv.org/abs/2306.04306)
|
|
|