forked from 2noise/ChatTTS
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
6be4564
commit 424e4bc
Showing
1 changed file
with
108 additions
and
9 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,27 +1,126 @@ | ||
# ChatTTS | ||
[**English**](./README.md) | [**中文简体**](./README_CN.md) | ||
|
||
## To be Finished | ||
ChatTTS is a text-to-speech model designed specifically for dialogue scenario such as LLM assistant. It supports both English and Chinese languages. Our model is trained with 100,000+ hours composed of chinese and english. The open-source version on HuggingFace is a 40,000 hours pre trained model without SFT. | ||
|
||
For formal inquiries about model and roadmap, please contact us at [email protected]. You could join our QQ group: 808364215 for discussion. Adding github issues is always welcomed. | ||
|
||
--- | ||
## Highlights | ||
1. **Conversational TTS**: ChatTTS is optimized for dialogue-based tasks, enabling natural and expressive speech synthesis. It supports multiple speakers, facilitating interactive conversations. | ||
2. **Fine-grained Control**: The model could predict and control fine-grained prosodic features, including laughter, pauses, and interjections. | ||
3. **Better Prosody**: ChatTTS surpasses most of open-source TTS models in terms of prosody. We provide pretrained models to support further research and development. | ||
|
||
--- | ||
|
||
## Disclaimer | ||
|
||
This repo is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The authors do not guarantee the accuracy, completeness, or reliability of the information. The information and data used in this repo, are for academic and research purposes only. The data obtained from publicly available sources, and the authors do not claim any ownership or copyright over the data. | ||
|
||
ChatTTS is a powerful text-to-speech system. However, it is very important to utilize this technology responsibly and ethically. To limit the use of ChatTTS, we added a small amount of high-frequency noise during the training of the 40,000-hour model, and compressed the audio quality as much as possible using MP3 format, to prevent malicious actors from potentially using it for criminal purposes. At the same time, we have internally trained a detection model and plan to open-source it in the future. | ||
|
||
|
||
--- | ||
## Usage | ||
|
||
<h4>basic usage</h4> | ||
|
||
```python | ||
import torch | ||
import ChatTTS | ||
from IPython.display import Audio | ||
|
||
chat = ChatTTS.Chat() | ||
chat.load_models() | ||
|
||
texts = ["<YOUR TEXT HERE>",] | ||
texts = ["<PUT YOUR TEXT HERE>",] | ||
|
||
wavs = chat.infer(texts, use_decoder=True) | ||
|
||
Audio(wavs[0], rate=24_000, autoplay=True) | ||
``` | ||
|
||
## | ||
Disclaimer: For Academic Purposes Only | ||
<h4>advanced usage</h4> | ||
|
||
```python | ||
################################### | ||
# Sample a speaker from Gaussian. | ||
import torch | ||
std, mean = torch.load('ChatTTS/asset/spk_stat.pt').chunk(2) | ||
rand_spk = torch.randn(768) * std + mean | ||
|
||
params_infer_code = { | ||
'spk_emb': rand_spk, # add sampled speaker | ||
'temperature': .3, # using custom temperature | ||
'top_P': 0.7, # top P decode | ||
'top_K': 20, # top K decode | ||
} | ||
|
||
################################### | ||
# For sentence level manual control. | ||
|
||
# use oral_(0-9), laugh_(0-2), break_(0-7) | ||
# to generate special token in text to synthesize. | ||
params_refine_text = { | ||
'prompt': '[oral_2][laugh_0][break_6]' | ||
} | ||
|
||
wav = chat.infer("<PUT YOUR TEXT HERE>", params_refine_text=params_refine_text, params_infer_code=params_infer_code) | ||
|
||
################################### | ||
# For word level manual control. | ||
text = 'What is [uv_break]your favorite english food?[laugh][lbreak]' | ||
wav = chat.infer(text, skip_refine_text=True, params_infer_code=params_infer_code) | ||
|
||
``` | ||
|
||
<details open> | ||
<summary><h4>Example: self introduction</h4></summary> | ||
|
||
```python | ||
inputs_en = """ | ||
chat T T S is a text to speech model designed for dialogue applications. | ||
[uv_break]it supports mixed language input [uv_break]and offers multi speaker | ||
capabilities with precise control over prosodic elements [laugh]like like | ||
[uv_break]laughter[laugh], [uv_break]pauses, [uv_break]and intonation. | ||
[uv_break]it delivers natural and expressive speech,[uv_break]so please | ||
[uv_break] use the project responsibly at your own risk.[uv_break] | ||
""".replace('\n', '') # English is still experimental. | ||
|
||
params_refine_text = { | ||
'prompt': '[oral_2][laugh_0][break_4]' | ||
} | ||
audio_array_cn = chat.infer(inputs_cn, params_refine_text=params_refine_text) | ||
audio_array_en = chat.infer(inputs_en, params_refine_text=params_refine_text) | ||
``` | ||
[male speaker](https://github.com/2noise/ChatTTS/assets/130631963/e0f51251-db7f-4d39-a0e9-3e095bb65de1) | ||
|
||
[female speaker](https://github.com/2noise/ChatTTS/assets/130631963/f5dcdd01-1091-47c5-8241-c4f6aaaa8bbd) | ||
</details> | ||
|
||
--- | ||
## Roadmap | ||
- [x] Open-source the 40k hour base model and spk_stats file | ||
- [ ] Open-source VQ encoder and Lora training code | ||
- [ ] Streaming audio generation without refining the text* | ||
- [ ] Open-source the 40k hour version with multi-emotion control | ||
- [ ] ChatTTS.cpp maybe? (PR or new repo are welcomed.) | ||
|
||
---- | ||
## FAQ | ||
|
||
##### How much VRAM do I need? How about infer speed? | ||
For a 30-second audio clip, at least 4GB of GPU memory is required. For the 4090D GPU, it can generate audio corresponding to approximately 7 semantic tokens per second. The Real-Time Factor (RTF) is around 0.65. | ||
|
||
##### model stability is not good enough, with issues such as multi speakers or poor audio quality. | ||
|
||
This is a problem that typically occurs with autoregressive models(for bark and valle). It's generally difficult to avoid. One can try multiple samples to find a suitable result. | ||
|
||
##### Besides laughter, can we control anything else? Can we control other emotions? | ||
|
||
The information provided in this document is for academic purposes only. It is intended for educational and research use, and should not be used for any commercial or legal purposes. The authors do not guarantee the accuracy, completeness, or reliability of the information. The information and data used in this document, are for academic and research purposes only. The data have been obtained from publicly available sources, and the authors do not claim any ownership or copyright over the data. | ||
In the current released model, the only token-level control units are [laugh], [uv_break], and [lbreak]. In future versions, we may open-source models with additional emotional control capabilities. | ||
|
||
免责声明:仅供学术交流 | ||
--- | ||
## Acknowledgements | ||
- [bark](https://github.com/suno-ai/bark), [TTSv2](https://github.com/coqui-ai/TTS) and [valle](https://arxiv.org/abs/2301.02111) demostrate a remarkable TTS result by a autoregressive-style system. | ||
- [fish-speech](https://github.com/fishaudio/fish-speech) reveals capability of GVQ as audio tokenizer for LLM modeling. | ||
- [vocos](https://github.com/gemelo-ai/vocos) which is used as a pretrained vocoder. | ||
|
||
本文件中的信息仅供学术交流使用。其目的是用于教育和研究,不得用于任何商业或法律目的。作者不保证信息的准确性、完整性或可靠性。本文件中使用的信息和数据,仅用于学术研究目的。这些数据来自公开可用的来源,作者不对数据的所有权或版权提出任何主张。 |