Skip to content

Latest commit

 

History

History
67 lines (44 loc) · 4.02 KB

README.md

File metadata and controls

67 lines (44 loc) · 4.02 KB

TTS.cpp

Roadmap / Modified GGML

Purpose and Goals

The general purpose of this repository is to support real time generation with open source TTS (text to speech) models across common device architectures using the GGML tensor library. Rapid STT (speach to text), embedding generation, and LLM generation are well supported on GGML (via whisper.cpp and llama.cpp respectively). As such, this repo seeks to compliment those functionalities with a similarly optimized and portable TTS library.

In this endeavor, MacOS and metal support will be treated as the primary platform, and, as such, functionality will initially be developed for MacOS and later extended to other OS.

Supported Functionality

Warning! Currently TTS.cpp should be treated as a proof of concept and is subject to further development. Existing functionality has not be tested outside of a MacOS X environment.

Model Support

Currently Parler TTS Mini v1.0 and Parler TTS Large v1.0 are the only supported TTS models.

Additional Model support will initially be added based on open source model performance in the TTS model arena and the availability of said models' architectures and checkpoints.

Functionality

Planned Functionality OS X Linux Windows
Basic CPU Generation
Metal Acceleration _ _
CUDA support _
Quantization *
Layer Offloading
Server Support
Vulkan Support _
Kompute Support _
Streaming Audio

* Currently only the generative model supports these.

Installation

WARNING! This library is only currently supported on OS X

Requirements:

  • Local GGUF format model file (see py-gguf for information on how to convert the hugging face model to GGUF).
  • C++17 and C17
    • XCode Command Line Tools (via xcode-select --install) should suffice for OS X
  • CMake (>=3.14)
  • GGML pulled locally
    • this can be accomplished via git clone -b support-for-tts [email protected]:mmwillet/ggml.git

Build:

Assuming that the above requirements are met the library and basic CLI example can be built by running the following command in the repository's base directory:

cmake -B build                                           
cmake --build build --config Release

The CLI executable will be in the ./build/cli directory and the compiled library will be in the ./build/src (currently it is named parler as that is the only supported model).

Usage

See the CLI example readme for more details on its general usage.

Performance

Given that the central goal of this library is to support real time speech generation on OS X, generation speed has only been rigorously tested in that environment with supported models (i.e. Parler Mini version 1.0).

With the introduction of metal acceleration support for the DAC audio decoder model, text to speech generation is nearly possible in real time on a standard Apply M1 Max with ~3GB memory overhead. The best real time factor for accelerated models is currently 1.112033. This means that for every second of generated audio, the accelerated models require approximately 1.112033 seconds of generation time (with Q5_0 quantization applied to the generative model). For the latest stats via the performance battery see the readme therein.