YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Copied from https://github.com/Wasser1462/FunASR-nano-onnx/

FunASR-Nano ONNX

ONNX export and inference implementation for FunASR-Nano model.

Requirements

Python >= 3.8
PyTorch >= 2.0
ONNX Runtime >= 1.15
transformers
funasr (for feature extraction)
modelscope (for downloading models)

Install dependencies:

pip install -r requirements.txt
pip install modelscope

Quick Start

1. Download Models

Download pre-trained ONNX models from ModelScope to the models/ directory:

modelscope download --model zengshuishui/FunASR-nano-onnx --output_dir models

After downloading, the models/ directory will contain:

encoder_adaptor.onnx and encoder_adaptor.onnx.data
llm.onnx and llm.onnx.data
encoder_adaptor.int8.onnx and llm.int8.onnx (INT8 quantized versions)
embedding.onnx and embedding.int8.onnx

2. Run Inference

With ONNX Embedding Model (Recommended):

python inference.py \
    --encoder-adaptor-model models/encoder_adaptor.onnx \
    --llm-model models/llm.onnx \
    --llm-tokenizer models/Qwen3-0.6B \
    --embedding-model models/embedding.onnx \
    --wave examples/zh.mp3 \
    --prompt "语音转写：" \
    --max-new-tokens 512 \
    --device auto

Without ONNX Embedding Model (requires model.safetensors):

python inference.py \
    --encoder-adaptor-model models/encoder_adaptor.onnx \
    --llm-model models/llm.onnx \
    --llm-tokenizer models/Qwen3-0.6B \
    --wave examples/zh.mp3 \
    --prompt "语音转写：" \
    --max-new-tokens 512 \
    --device auto

Parameters:

--device: Inference device, options: cpu, cuda, or auto (default: auto, LLM uses CPU by default due to CUDA float16 issues)
--embedding-model: Path to ONNX embedding model (optional, if not provided, will use PyTorch model from --llm-tokenizer)
--seed: Random seed for reproducible results (default: 42)
--temperature: Sampling temperature (default: 0.3)
--top-p: Top-p (nucleus) sampling threshold (default: 0.8)

3. Export ONNX Models (Optional)

If you need to export ONNX models from the original model.pt:

Export Encoder+Adaptor

python scripts/export_encoder_adaptor_onnx.py \
    --model-pt /path/to/model.pt \
    --output-filename models/encoder_adaptor.onnx \
    --opset-version 18

Export LLM

python scripts/export_llm_onnx.py \
    --model-pt /path/to/model.pt \
    --llm-config-path /path/to/Qwen3-0.6B \
    --output-filename models/llm.onnx \
    --opset-version 18

Export Embedding Layer (Optional, Recommended)

To avoid loading the full PyTorch model during inference, you can export the embedding layer to ONNX:

python scripts/export_embedding_onnx.py \
    --llm-config-path /path/to/Qwen3-0.6B \
    --output-filename models/embedding.onnx \
    --opset-version 18 \
    --verify

The --verify flag will automatically verify the exported model by comparing with the PyTorch model. This script will:

Check if model.safetensors exists in the LLM config directory
Export the embedding layer to ONNX format
Create INT8 quantized version
Verify the exported model (if --verify is used)

Note: The embedding ONNX model eliminates the need for model.safetensors during inference, reducing memory usage and startup time. The INT8 quantized version further reduces model size while maintaining accuracy.

Model Description

Encoder+Adaptor Model

Input: Audio features (batch, time, 560)
Output: LLM embeddings (batch, time, 1024)
Supports dynamic sequence length

LLM Model

Input:
- inputs_embeds: (batch, sequence_length, 1024)
- attention_mask: (batch, sequence_length)
Output: logits: (batch, sequence_length, vocab_size)
Supports dynamic sequence length

Embedding Model (Optional)

Input: input_ids: (batch, sequence_length) - Token IDs (int64)
Output: embeddings: (batch, sequence_length, 1024) - Token embeddings
Supports dynamic sequence length
Purpose: Converts token IDs to embeddings, eliminating the need for full PyTorch model during inference

GPU Acceleration

Make sure onnxruntime-gpu is installed:

pip install onnxruntime-gpu

Note: Due to CUDA provider issues with float16, the LLM model uses CPU by default. The Encoder+Adaptor model can use GPU if available.

Use GPU for Encoder+Adaptor (LLM uses CPU):

python inference.py \
    --encoder-adaptor-model models/encoder_adaptor.onnx \
    --llm-model models/llm.onnx \
    --llm-tokenizer models/Qwen3-0.6B \
    --wave examples/zh.mp3 \
    --device cuda

Use CPU for all models:

python inference.py \
    --encoder-adaptor-model models/encoder_adaptor.onnx \
    --llm-model models/llm.onnx \
    --llm-tokenizer models/Qwen3-0.6B \
    --wave examples/zh.mp3 \
    --device cpu

License

Please refer to the license of the original FunASR project.

Acknowledgments

Based on the FunASR project.
Code structure and ONNX export implementation inspired by sherpa-onnx.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support