Copied from https://github.com/Wasser1462/FunASR-nano-onnx/
Copied from https://github.com/Wasser1462/FunASR-nano-onnx/
Copied from https://github.com/Wasser1462/FunASR-nano-onnx/
FunASR-Nano ONNX
ONNX export and inference implementation for FunASR-Nano model.
Requirements
- Python >= 3.8
- PyTorch >= 2.0
- ONNX Runtime >= 1.15
- transformers
- funasr (for feature extraction)
- modelscope (for downloading models)
Install dependencies:
pip install -r requirements.txt
pip install modelscope
Quick Start
1. Download Models
Download pre-trained ONNX models from ModelScope to the models/ directory:
modelscope download --model zengshuishui/FunASR-nano-onnx --output_dir models
After downloading, the models/ directory will contain:
encoder_adaptor.onnxandencoder_adaptor.onnx.datallm.onnxandllm.onnx.dataencoder_adaptor.int8.onnxandllm.int8.onnx(INT8 quantized versions)embedding.onnxandembedding.int8.onnx
2. Run Inference
With ONNX Embedding Model (Recommended):
python inference.py \
--encoder-adaptor-model models/encoder_adaptor.onnx \
--llm-model models/llm.onnx \
--llm-tokenizer models/Qwen3-0.6B \
--embedding-model models/embedding.onnx \
--wave examples/zh.mp3 \
--prompt "è¯éŸ³è½¬å†™ï¼š" \
--max-new-tokens 512 \
--device auto
Without ONNX Embedding Model (requires model.safetensors):
python inference.py \
--encoder-adaptor-model models/encoder_adaptor.onnx \
--llm-model models/llm.onnx \
--llm-tokenizer models/Qwen3-0.6B \
--wave examples/zh.mp3 \
--prompt "è¯éŸ³è½¬å†™ï¼š" \
--max-new-tokens 512 \
--device auto
Parameters:
--device: Inference device, options:cpu,cuda, orauto(default:auto, LLM uses CPU by default due to CUDA float16 issues)--embedding-model: Path to ONNX embedding model (optional, if not provided, will use PyTorch model from--llm-tokenizer)--seed: Random seed for reproducible results (default: 42)--temperature: Sampling temperature (default: 0.3)--top-p: Top-p (nucleus) sampling threshold (default: 0.8)
3. Export ONNX Models (Optional)
If you need to export ONNX models from the original model.pt:
Export Encoder+Adaptor
python scripts/export_encoder_adaptor_onnx.py \
--model-pt /path/to/model.pt \
--output-filename models/encoder_adaptor.onnx \
--opset-version 18
Export LLM
python scripts/export_llm_onnx.py \
--model-pt /path/to/model.pt \
--llm-config-path /path/to/Qwen3-0.6B \
--output-filename models/llm.onnx \
--opset-version 18
Export Embedding Layer (Optional, Recommended)
To avoid loading the full PyTorch model during inference, you can export the embedding layer to ONNX:
python scripts/export_embedding_onnx.py \
--llm-config-path /path/to/Qwen3-0.6B \
--output-filename models/embedding.onnx \
--opset-version 18 \
--verify
The --verify flag will automatically verify the exported model by comparing with the PyTorch model. This script will:
- Check if
model.safetensorsexists in the LLM config directory - Export the embedding layer to ONNX format
- Create INT8 quantized version
- Verify the exported model (if
--verifyis used)
Note: The embedding ONNX model eliminates the need for model.safetensors during inference, reducing memory usage and startup time. The INT8 quantized version further reduces model size while maintaining accuracy.
Model Description
Encoder+Adaptor Model
- Input: Audio features
(batch, time, 560) - Output: LLM embeddings
(batch, time, 1024) - Supports dynamic sequence length
LLM Model
- Input:
inputs_embeds:(batch, sequence_length, 1024)attention_mask:(batch, sequence_length)
- Output:
logits:(batch, sequence_length, vocab_size) - Supports dynamic sequence length
Embedding Model (Optional)
- Input:
input_ids:(batch, sequence_length)- Token IDs (int64) - Output:
embeddings:(batch, sequence_length, 1024)- Token embeddings - Supports dynamic sequence length
- Purpose: Converts token IDs to embeddings, eliminating the need for full PyTorch model during inference
GPU Acceleration
Make sure onnxruntime-gpu is installed:
pip install onnxruntime-gpu
Note: Due to CUDA provider issues with float16, the LLM model uses CPU by default. The Encoder+Adaptor model can use GPU if available.
Use GPU for Encoder+Adaptor (LLM uses CPU):
python inference.py \
--encoder-adaptor-model models/encoder_adaptor.onnx \
--llm-model models/llm.onnx \
--llm-tokenizer models/Qwen3-0.6B \
--wave examples/zh.mp3 \
--device cuda
Use CPU for all models:
python inference.py \
--encoder-adaptor-model models/encoder_adaptor.onnx \
--llm-model models/llm.onnx \
--llm-tokenizer models/Qwen3-0.6B \
--wave examples/zh.mp3 \
--device cpu
License
Please refer to the license of the original FunASR project.
Acknowledgments
- Based on the FunASR project.
- Code structure and ONNX export implementation inspired by sherpa-onnx.