Speech Recognition (ASR)
Speech Recognition (ASR, Automatic Speech Recognition) converts user speech to text. This project supports multiple speech recognition model implementations.
ASR-related configuration items are under asr_config
in conf.yaml
.
Here are the speech recognition options you can choose from:
sherpa_onnx_asr
(Local & Project Default)
(Added in v0.5.0-alpha.1
PR: Add sherpa-onnx support #50)
sherpa-onnx is a feature-rich inference tool that can run various speech recognition (ASR) models.
Starting from version v1.0.0
, this project uses sherpa-onnx
to run the SenseVoiceSmall
(int8 quantized) model as the default speech recognition solution. This is an out-of-the-box configuration - you don't need any additional setup. The system will automatically download and extract model files to the project's models
directory on first run.
Recommended Users
- All users (hence it's the default)
- Especially Mac users (due to limited options)
- Non-NVIDIA GPU users
- Chinese users
- Fast CPU inference
- Configuration difficulty: No configuration needed as it's the project default
The SenseVoiceSmall model may have average English performance.
CUDA Inference
sherpa-onnx
supports both CPU and CUDA inference. While the default SenseVoiceSmall
model performs well on CPU, if you have an NVIDIA GPU, you can enable CUDA inference for better performance by following these steps:
- First uninstall the CPU version dependencies:
uv remove sherpa-onnx onnxruntime
# Avoid introducing onnxruntime through dependencies
uv remove faster-whisper
Note that sherpa-onnx is installed via pre-built wheels in the example, which means you need to install
CUDA Toolkit 11.x + CUDNN 8.x for CUDA 11.x (and add
%SystemDrive%\Program Files\NVIDIA\CUDNN\v8.x\bin
to yourPATH
)Where x is your cudnn minor version number, e.g., for version
v8.9.7
, writev8.9
here.to link to the correct CUDA environment.
If you don't want to use the NVIDIA official installer/manually set PATH, consider using
pixi
to manage a local conda environment. This approach doesn't require you to install dependencies via uv.pixi remove --pypi onnxruntime sherpa-onnx
pixi add --pypi onnxruntime-gpu==1.17.1 pip
pixi run python -m pip install sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
- Install CUDA version of
sherpa-onnx
andonnxruntime-gpu
dependencies:
# sherpa-onnx provided pre-built wheels are compatible with onnxruntime-gpu==1.17.1
uv add onnxruntime-gpu==1.17.1 sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
- Modify configuration file:
In
conf.yaml
, find thesherpa_onnx_asr
section and setprovider
tocuda
Using Other sherpa-onnx Models
If you want to try other speech recognition models:
- Download the required model from sherpa-onnx ASR models
- Place the model files in the project's
models
directory - Modify the relevant configuration of
sherpa_onnx_asr
according to the instructions inconf.yaml
fun_asr
(Local)
FunASR is a fundamental end-to-end speech recognition toolkit from ModelScope that supports various ASR models. Among them, Alibaba's FunAudioLLM SenseVoiceSmall model performs well in both performance and speed.
Although FunASR can run the SenseVoiceSmall model, we recommend using the project's default sherpa_onnx_asr
. The FunASR project has some stability issues and may encounter exceptions on certain devices.
However, FunASR utilizes GPU better, so it might be faster for NVIDIA GPU users.
Recommended Users
- Users with NVIDIA GPUs who want to utilize GPU inference for the SenseVoiceSmall model
- Chinese users
- Fast CPU inference
- Configuration difficulty: Simple
SenseVoiceSmall may have average English performance.
Installation
In the project directory, run:
uv add funasr modelscope huggingface_hub onnxconverter_common torch torchaudio onnx
If you encounter the following dependency issues:
help: `llvmlite` (v0.36.0) was included because `open-llm-vtuber` (v1.0.0a1) depends on `funasr` (v1.2.2) which depends on `umap-learn` (v0.5.7)
which depends on `pynndescent` (v0.5.13) which depends on `llvmlite`
You can try using the following command instead:
uv pip install funasr modelscope huggingface_hub torch torchaudio onnx onnxconverter_common
Even if model files are already local, an internet connection is still required at startup.
Solution: Directly specify the local path of the model in the configuration, so no internet connection is needed during runtime. However, you need to download the model files in advance. See FunASR Issue #1897 for details.
faster_whisper
(Local)
This is an optimized Whisper inference engine that can run original Whisper and distilled Whisper models. It provides faster inference speed compared to the original Whisper but cannot automatically detect language.
Faster Whisper does not support Mac GPU inference and can only run on CPU with average performance. It's recommended for use on devices equipped with NVIDIA GPUs for optimal performance.
Recommended Users
- Users with NVIDIA GPUs who want to utilize GPU inference for Whisper models
- Non-Chinese users. Whisper series models have good multilingual support
- CPU inference is relatively slow
- Configuration difficulty: Simple
Installation and Configuration
If you want to use GPU acceleration (NVIDIA GPU users only), you need to install the following NVIDIA dependency libraries. For detailed installation steps, please refer to Quick Start:
If you don't care much about running speed or have a powerful CPU, you can also set the device
parameter of faster-whisper
to cpu
in the conf.yaml
configuration file. This avoids the hassle of installing NVIDIA dependency libraries.
# Faster Whisper Configuration
faster_whisper:
model_path: 'large-v3-turbo' # Model path, model name, or HF hub model id
download_root: 'models/whisper' # Model download root directory
language: 'zh' # Language, en, zh or others. Leave empty for auto-detection
device: 'auto' # Device, cpu, cuda or auto. faster-whisper doesn't support mps
compute_type: 'int8'
Model Selection (model_path)
model_path
can be filled with model name, local path of the model (if you downloaded it in advance), or model id on HuggingFace (must be a model already converted to CTranslate2 format).
Available model names:
tiny
, tiny.en
, base
, base.en
, small
, small.en
, distil-small.en
, medium
, medium.en
, distil-medium.en
, large-v1
, large-v2
, large-v3
, large
, distil-large-v2
, distil-large-v3
, large-v3-turbo
, turbo
The distil series models may only support English.
The selected model will be automatically downloaded from Hugging Face to the models/whisper
folder in the project directory.
Test results on 4060 (Thanks to Lena from the QQ group for providing test results in #187, #188)
Using 22-second generated audio, tested with int8 on 13th gen i5 and 4060 8GB, CUDA 12.8, cuDNN 9.8:
- CPU: v3-turbo took 5.98 seconds, small took 1.56 seconds
- GPU: v3-turbo took 1.04 seconds, small took 0.48 seconds
Summary:
- Without 4060, choose small, because medium and v3-turbo are similar in size, small might be the best recognition effect while ensuring speed for 20/30 series cards.
- With 4060, choose v3-turbo, higher accuracy is naturally better if speed is not an issue.
- Accuracy reference: faster-whisper-small has 244M parameters, faster-whisper-v3-turbo has 809M parameters.
Test results on MacBook Pro M1 Pro:
Don't even try, it's very slow. Using whisper cpp with CoreML acceleration or sense voice small model would be much faster.
Hugging Face model id format
"username/whisper-large-v3-ct2"
Note that faster whisper requires models already converted to CTranslate2 format.
The selected model will be automatically downloaded from Hugging Face to the models/whisper
folder in the project directory.
whisper_cpp
(Local)
whisper_cpp
can be accelerated through CoreML on macOS for faster inference speed- When running on CPU or NVIDIA GPU, performance may not be as good as Faster-Whisper
- Mac users please refer to the instructions below to configure WhisperCPP with CoreML support; if you need to use CPU or NVIDIA GPU, just run
pip install pywhispercpp
to install
Recommended Users
- Mac users who want to utilize GPU inference for Whisper series models
- Chinese users
- CPU inference is relatively slow, GPU is needed
- Configuration difficulty: Setting up GPU acceleration might be a bit challenging
SenseVoiceSmall may have average English performance.
Installation
- NVIDIA GPU
- macOS
- Vulkan
GGML_CUDA=1 uv pip install git+https://github.com/absadiki/pywhispercpp
WHISPER_COREML=1 uv pip install git+https://github.com/absadiki/pywhispercpp
GGML_VULKAN=1 pip install git+https://github.com/absadiki/pywhispercpp
CoreML Configuration
- Method 1: Follow the Whisper.cpp repository documentation to convert Whisper models to CoreML format
- Method 2: Download pre-converted CoreML models from Hugging Face repository. Note: After downloading, you need to extract the model files, otherwise the program cannot load and will crash.
- Configuration note: When configuring models in
conf.yaml
, you don't need to include the special prefix in the filename. For example, when the CoreML model filename isggml-base-encoder.mlmodelc
, you only need to fill inbase
in themodel_name
parameter ofWhisperCPP
.
whisper
(Local)
OpenAI's original Whisper. Install with uv pip install -U openai-whisper
. Very slow inference speed.
Recommended Users
- Not recommended
groq_whisper_asr
(Online, requires API key, but easy to register with generous free quota)
Groq's Whisper endpoint, very accurate (supports multiple languages) and fast, with many free uses per day. It's pre-installed. Get an API key from groq and add it to the groq_whisper_asr
settings in conf.yaml
. Users in mainland China and other unsupported regions need a proxy (may not support Hong Kong region) to use it.
Recommended Users
- Users who accept using online speech recognition
- Multilingual users
- No local computation, very fast speed (depends on your network speed)
- Configuration difficulty: Simple
SenseVoiceSmall may have average English performance.
azure_asr
(Online, requires API key)
- Azure Speech Recognition
- Configure API key and region under the
azure_asr
option
api_key.py
has been deprecated after v0.2.5
. Please set API keys in conf.yaml
.
Recommended Users
- People who have Azure API keys (Azure accounts are not easy to register)
- Multilingual users
- No local computation, very fast speed (depends on your network speed)
- Configuration difficulty: Simple