The general purpose of this repository is to support real time generation with open source TTS (text to speech) models across common device architectures using the GGML tensor library. Rapid STT (speach to text), embedding generation, and LLM generation are well supported on GGML (via whisper.cpp and llama.cpp respectively). As such, this repo seeks to compliment those functionalities with a similarly optimized and portable TTS library.
In this endeavor, MacOS and metal support will be treated as the primary platform, and, as such, functionality will initially be developed for MacOS and later extended to other OS.
Warning! Currently TTS.cpp should be treated as a proof of concept and is subject to further development. Existing functionality has not be tested outside of a MacOS X environment.
Currently Parler TTS Mini v1.0 and Parler TTS Large v1.0 are the only supported TTS models.
Additional Model support will initially be added based on open source model performance in the TTS model arena and the availability of said models' architectures and checkpoints.
Planned Functionality | OS X | Linux | Windows |
---|---|---|---|
Basic CPU Generation | ✓ | ✓ | ✗ |
Metal Acceleration | ✓ | _ | _ |
CUDA support | _ | ✗ | ✗ |
Quantization | ✓* | ✗ | ✗ |
Layer Offloading | ✗ | ✗ | ✗ |
Server Support | ✓ | ✗ | ✗ |
Vulkan Support | _ | ✗ | ✗ |
Kompute Support | _ | ✗ | ✗ |
Streaming Audio | ✗ | ✗ | ✗ |
* Currently only the generative model supports these.
WARNING! This library is only currently supported on OS X
- Local GGUF format model file (see py-gguf for information on how to convert the hugging face model to GGUF).
- C++17 and C17
- XCode Command Line Tools (via
xcode-select --install
) should suffice for OS X
- XCode Command Line Tools (via
- CMake (>=3.14)
- GGML pulled locally
- this can be accomplished via
git clone -b support-for-tts [email protected]:mmwillet/ggml.git
- this can be accomplished via
Assuming that the above requirements are met the library and basic CLI example can be built by running the following command in the repository's base directory:
cmake -B build
cmake --build build --config Release
The CLI executable will be in the ./build/cli
directory and the compiled library will be in the ./build/src
(currently it is named parler as that is the only supported model).
See the CLI example readme for more details on its general usage.
Given that the central goal of this library is to support real time speech generation on OS X, generation speed has only been rigorously tested in that environment with supported models (i.e. Parler Mini version 1.0).
With the introduction of metal acceleration support for the DAC audio decoder model, text to speech generation is nearly possible in real time on a standard Apply M1 Max with ~3GB memory overhead. The best real time factor for accelerated models is currently 1.112033. This means that for every second of generated audio, the accelerated models require approximately 1.112033 seconds of generation time (with Q5_0 quantization applied to the generative model). For the latest stats via the performance battery see the readme therein.