A local, OpenAI-compatible speech recognition API service using the Whisper model. This service provides a straightforward way to transcribe audio files in various formats with high accuracy and is designed to be compatible with the OpenAI Whisper API.
- 🔊 High-quality speech recognition using Whisper model
- 🌐 OpenAI-compatible API endpoints
- 🚀 Hardware acceleration support (CUDA, MPS)
- ⚡ Flash Attention 2 for faster transcription on compatible GPUs
- 🎛️ Audio preprocessing for better transcription results
- 🔄 Multiple input formats (file upload, URL, base64, local files)
- 🚪 Easy deployment with Docker or conda environment
- Python 3.10+ (3.11 recommended)
- CUDA-compatible GPU (optional, for faster processing)
- FFmpeg and SoX for audio processing
- Clone the repository:
git clone https://github.com/yourusername/whisper-api-service.git
cd whisper-api-service
- Run the server script with the update flag to create and set up the conda environment:
chmod +x server.sh
./server.sh --update
This will:
- Create a conda environment named "transcribe" with Python 3.11
- Install all required dependencies
- Start the service
- Create and activate a conda environment:
conda create -n transcribe python=3.11
conda activate transcribe
- Install the required dependencies:
pip install -r requirements.txt
- Start the service:
python server.py
The service is configured through the config.json
file:
{
"service_port": 5042,
"model_path": "/mnt/cloud/llm/whisper/whisper-large-v3-russian",
"language": "russian",
"chunk_length_s": 30,
"batch_size": 16,
"max_new_tokens": 256,
"return_timestamps": false,
"norm_level": "-0.5",
"compand_params": "0.3,1 -90,-90,-70,-70,-60,-20,0,0 -5 0 0.2"
}
Parameter | Description |
---|---|
service_port |
Port on which the service will run |
model_path |
Path to the Whisper model directory |
language |
Language for transcription (e.g., "russian", "english") |
chunk_length_s |
Length of audio chunks for processing (in seconds) |
batch_size |
Batch size for processing |
max_new_tokens |
Maximum new tokens for the model output |
return_timestamps |
Whether to return timestamps in the transcription |
audio_rate |
Audio sampling rate in Hz |
norm_level |
Normalization level for audio preprocessing |
compand_params |
Parameters for audio compression/expansion |
curl http://localhost:5042/health
curl http://localhost:5042/config
curl -X POST http://localhost:5042/v1/audio/transcriptions \
-F [email protected]
curl -X POST http://localhost:5042/v1/audio/transcriptions/url \
-H "Content-Type: application/json" \
-d '{"url":"https://example.com/audio.mp3"}'
curl -X POST http://localhost:5042/v1/audio/transcriptions/base64 \
-H "Content-Type: application/json" \
-d '{"file":"base64_encoded_audio_data"}'
curl -X POST http://localhost:5042/local/transcriptions \
-H "Content-Type: application/json" \
-d '{"file_path":"/path/to/audio.mp3"}'
The project consists of the following components:
server.py
: Entry point that initializes and starts the serviceserver.sh
: Bash script for launching the server with optional conda environment updateconfig.json
: Service configuration filerequirements.txt
: Project dependencies for conda/pipapp/
: Main application module__init__.py
: Contains theWhisperServiceAPI
class for service initializationlogger.py
: Logging configurationtranscriber.py
: Contains theWhisperTranscriber
class for speech recognitionaudio_processor.py
: Contains theAudioProcessor
class for audio preprocessingaudio_sources.py
: Contains theAudioSource
abstract class and implementationsroutes.py
: Contains the API route definitions
You can use any Whisper model by changing the model_path
in the configuration:
- Download a model from Hugging Face (e.g.,
openai/whisper-large-v3
) - Update the
model_path
inconfig.json
- Restart the service
For Russian language transcription, we recommend using the whisper-large-v3-russian model from Hugging Face. This model is fine-tuned specifically for Russian speech recognition and delivers high accuracy. For faster transcription with slightly lower accuracy, consider the whisper-large-v3-turbo-russian model, which is optimized for speed.
The service automatically selects the best available compute device:
- CUDA GPU (index 1 if available, otherwise index 0)
- Apple Silicon MPS (for Mac with M1/M2/M3 chips)
- CPU (fallback)
For best performance on NVIDIA GPUs, Flash Attention 2 is used when available.
If you encounter audio processing errors:
- Ensure that FFmpeg and SoX are installed on your system
- Check that the audio file is not corrupted
- Try different audio preprocessing parameters in the configuration
For slow transcription:
- Use a GPU if available
- Adjust
chunk_length_s
andbatch_size
parameters - Consider using a smaller Whisper model
- OpenAI for the Whisper model
- Hugging Face for model distribution and transformers library