Welcome to VoxShift, a university project at KIT that delves into the fascinating world of voice conversion. Our model strives to alter the vocal characteristics from one speaker to another while maintaining the clarity of the spoken message. The results from our experiments have been sometimes surprising, occasionally educational, and always a learning experience. We encourage you to check out the 'Demo' section, run the code, and engage in the discussions.
- 🚀 Quick Starting Guide
- 💾 Dataset
- 🏗️ Model Architecture
- 🏋️♂️ Training
- 📊 Results
- 🎧 Demo
- 📚 Sources
Our code has been developed, tested and optimized for Google Colab. For an efficient start, please access our Google Drive folder where all scripts and prepared data is stored:
- Access and create a shortcut of the VC folder from our Google Drive to your drive's root for compatibility (Help from StackOverflow).
- Ensure to grant Colab permissions to access Drive files when prompted during mounting (click checkboxes).
- Begin with Training.ipynb and Evaluation.ipynb in our Google Drive folder for a quick overview.
- Consult the README.md in our Drive folder for detailed insights.
- For local runs, download the VCTK dataset yourself and utilize our preparation scripts in this GitHub repo. Adjust paths in the code as necessary for local environment compatibility.
For our project's training phase, we utilized the Voice Cloning Toolkit (VCTK) Dataset, courtesy of the Centre for Speech Technology Research at the University of Edinburgh. This dataset is well-suited for voice conversion endeavors but equally valuable for other research areas such as speaker identification, text-to-speech synthesis, and accent recognition. Comprising parallel data, the VCTK dataset ensures each speaker contributes a distinct set of newspaper sentences. These sentences were meticulously selected through a greedy algorithm to maximize contextual and phonetic diversity, enabling comprehensive coverage across various voice conversion scenarios.
In developing VoxShift, we leveraged the versatility of the VCTK dataset, which provides us with the unique opportunity to explore both parallel and non-parallel training approaches. Parallel Training involves utilizing audio recordings that have identical linguistic content but are spoken by different individuals. This method focuses on mapping the source to the target spectrogram, necessitating temporal alignment to ensure the accuracy of the voice conversion. Non-Parallel Training, on the other hand, employs a more flexible approach. By autoencoding, the model reconstructs the Mel-Spectrogram directly from embeddings, eliminating the need for parallel data. This method does not rely on having two sets of the same spoken content by different speakers, making it significantly more versatile and suited to a wider range of applications. Given the challenges inherent in sourcing parallel training data, we opted for the a non-parallel approach. Non-parallel training stands at the forefront of voice conversion research due to its ability to navigate the absence of parallel data — a common hurdle in the field. This method not only aligns with the state-of-the-art in voice conversion technology but also presents an intriguing avenue for research, offering insights into more dynamic and adaptive model architectures.
A typical model architecture in this domain may incorporate various embeddings to capture the nuances of speech, including:
- Linguistic Embeddings: These encode the textual or phonetic aspects of speech, capturing the content without being influenced by the speaker's unique vocal characteristics.
- Speaker Embeddings: Focus on capturing the unique vocal traits of the speaker, allowing the model to maintain or change the speaker identity in the voice conversion process.
- Prosodic Embeddings: These are used to encapsulate the rhythm, stress, and intonation patterns of speech, contributing to the naturalness and expressiveness of the converted voice.
For VoxShift, we decided to streamline our model by not utilizing prosodic embeddings. This decision was made to reduce complexity and focus our efforts on mastering the core aspects of voice conversion — linguistic and content integrity. By simplifying our model, we aim to achieve a balance between performance and computational efficiency, making VoxShift a robust yet accessible tool for exploring voice conversion technologies.
Continuing from our approach, we integrated two pretrained models to harness linguistic and speaker embeddings. HuBERT Soft, utilized for its linguistic embeddings, is pretrained on a diverse set of unlabeled audio, enabling the extraction of nuanced linguistic patterns crucial for voice conversion. For speaker characteristics, we leveraged WavLM, which excels in identifying vocal traits across languages and accents, due to its training on a wide-ranging speech corpus. Complementing these, our architecture incorporates HiFi-GAN, a vocoder trained on the LJSpeech dataset, chosen for its ability to produce high-fidelity speech from Mel spectrograms. The integration and final audio synthesis are achieved through a custom-implemented decoder, designed to merge linguistic and speaker embeddings effectively.
Building upon our model's foundation, the decoder is structured as a sequence-to-sequence model. At its core, the encoder segment harnesses 1D convolutional layers, specifically tasked with handling the linguistic embeddings derived from HuBERT Soft. This processed output is then concatenated with the speaker embeddings from WavLM, forming a representation of both linguistic content and speaker identity. This is fed into the decoder segment, which is designed with three LSTM layers. These layers are pivotal in managing sequential data, ensuring that the temporal dynamics of speech are captured and accurately reproduced in the conversion process. An additional enhancement to this architecture is the inclusion of a PreNet, which includes a series of linear layers and is used on the content embedding as well as on the ground truth mel-spectrogram during training. We incorporate the ground truth mel-spectrogram during training in order to condition the model on correct predictions, which can be considered a form of teacher forcing.
In the training of our voice conversion model, an autoencoder approach was employed. This involves an encoder-decoder structure where the encoder transforms the input into an intermediate representation, and the decoder reconstructs the output from this representation. The training process begins by extracting content and speaker embeddings from the input audio. These embeddings are then fed into the decoder model, which aims to output a Mel-spectrogram that closely resembles the target Mel-spectrogram. The fidelity of the generated Mel-spectrogram to the target is measured by a loss function, specifically an L1 loss, or mean absolute error, which guides the optimization of the model parameters. For training, the Adam optimizer is a widely-used choice that computes adaptive learning rates for each parameter, helping to converge faster than traditional stochastic gradient descent. We trained with a batch size of 64, which balances the generalization benefits of larger batch sizes and the stochastic nature of smaller ones, and for 80 epochs to ensure that the model has ample opportunity to learn from the data without overfitting, given the complexity of the task at hand. A learning rate of 0.0004 is chosen as a starting point that is neither too large to overshoot minima nor too small to stall the training process. We also implemented regularization methods such as weight decay of 0.00001 to prevent overfitting by penalizing large weights as well as dropout and instance normalization to help the model generalize better to unseen data by reducing co-adaptation of neurons and stabilizing the distribution of inputs to a layer across the mini-batch.
In our training regimen, a crucial aspect was establishing a robust train-validation-test split that would enable us to accurately gauge the model's performance on both many-to-many and any-to-any voice conversion tasks. The accompanying graphic delineates the distribution strategy for our datasets. We divided the dataset to ensure that the validation and test sets included utterances from both seen and unseen speakers during the training phase. This approach allows us to evaluate the model's ability to convert voices of speakers it has learned from (seen) and speakers it has never encountered during training (unseen). While the subset for unseen speakers is smaller, the flexibility of our model allows for generating multiple combinations between speakers, effectively expanding the test set and providing a comprehensive assessment of the model's generalization capabilities.
In assessing the performance of our voice conversion model, we employed three key metrics:
- Word Error Rate (WER): This metric is derived from automatic speech recognition (ASR) and involves comparing the ASR transcript of the converted audio against the ground truth text of the source audio. WER is calculated as the number of insertions, deletions, and substitutions made by the ASR system, normalized by the number of words in the ground truth. It ranges from 0 to 1, where 0 indicates a perfect transcription with no errors.
- Speaker Similarity (Sim): We measured the speaker similarity by computing the cosine similarity between the speaker embeddings of the converted audio and the target audio. The cosine similarity is a value between -1 and 1, where 1 signifies perfect similarity, 0 indicates no similarity, and -1 represents perfect dissimilarity.
- Mean Opinion Score (MOS): For the subjective evaluation of audio quality, we used the Mean Opinion Score, which reflects human judgments of the converted audio's quality. Participants rate the quality on a scale from 1 to 5, with 5 being the highest quality. In addition to MOS, we also employed Speaker Mean Opinion Score (SMOS) to rate the similarity of speaker identity between the source and target audio.
These metrics provided us with a comprehensive understanding of our model's performance, evaluating both the objective accuracy of voice conversion and the subjective quality as perceived by human listeners.
For our baseline model, the initial expectation was a potential overfit as the training progressed. Contrary to this, the validation loss continued to converge, suggesting that the model was learning generalizable patterns rather than memorizing the training data. However, to specifically account for the model's performance on unseen speakers, we altered our validation approach. By evaluating solely on unseen speakers, the validation loss exhibited an increase, indicating that while the model was adept at handling seen speakers, its performance on any-to-any voice conversion was less robust.
In our ablation study, we incrementally introduced components to the baseline model and observed their impact on performance. First, we incorporated dropout to mitigate overfitting, ensuring the model's generalizability. Next, we increased the hidden dimension of the LSTM layers, which enhanced the model's capacity for the more complex any-to-any voice conversion task. Finally, we added a PostNet module, which further refined the generated Mel-Spectrogram, resulting in improved audio quality. Each step was methodically assessed to measure its contribution to the overall efficacy of the model.
The ablation study yielded the following key insights: Introducing additional dropout to the model led to improved Word Error Rate (WER) for both many-to-many (m2m) and any-to-any (a2a) voice conversion tasks. Expanding the hidden dimension of the LSTM layers resulted in better Speaker Similarity (Sim) scores for both m2m and a2a scenarios. However, the integration of PostNet did not significantly enhance performance, as evidenced by the overall similar metrics before and after its inclusion.
Our investigation into the influence of gender on model performance revealed a notable trend: conversions where the target speaker was female generally resulted in higher speaker similarity scores and lower WER scores, suggesting a more accurate capture of vocal characteristics and linguistic content.
In our final set of experiments, we explored the impact of using more extensive audio inputs for creating speaker embeddings. The 'Base + single' configuration used the baseline model with embeddings from individual audio files. We then experimented with 'Base + window', where the embeddings were derived from concatenated audio sequences, including the target and its adjacent utterances, to form a more comprehensive audio context. This was further extended to 'BaseW + window', where the model was not only tested with windowed embeddings but also trained on them. Lastly, 'BaseA + agg' represented a model that used aggregated embeddings, averaging the embeddings from all utterances for each speaker in the VCTK dataset. The results indicated that using windowed and aggregated strategies for speaker embeddings led to improved Speaker Similarity scores for both many-to-many and any-to-any conversion tasks. However, this was accompanied by a deterioration in Word Error Rate (WER), suggesting that while speaker characteristics were captured more accurately, the intelligibility of the speech in the converted audio was somewhat compromised.
We have prepared an interactive demo website that showcases a variety of audio file conversions generated by our baseline model. This demonstration includes examples of both any-to-any and many-to-many voice conversions, organized in descending order of speaker similarity scores — with the highest scores at the top. We invite users to listen and determine for themselves whether they concur with the given scores.
Please note, it is advisable to approach listening with caution, particularly when using headphones. Some of the converted audio files may contain loud or unpleasant artifacts. To prevent potential hearing damage, we recommend starting at a low volume and adjusting as necessary.
Click here to visit our demo website
Our project's foundation is built upon the work done in the soft-vc and Tacotron 2 repository as well as the Medium article by Piero Esposito, which served as the initial codebase for our voice conversion model. Additionally, we've explored other resources that implement similar voice conversion approaches, enriching our understanding and capabilities in the domain. These resources have provided us with various perspectives and techniques that complement our primary framework.
- soft-vc: A voice conversion model utilizing a soft attention mechanism, created by bshall. This repository contains tools and pre-trained models for voice conversion. Available at: https://github.com/bshall/soft-vc
- Tacotron 2 by NVIDIA: An implementation of the Tacotron 2 speech synthesis model. This repository includes a robust and flexible framework for text-to-speech systems. Available at: https://github.com/NVIDIA/tacotron2
- knn-vc: Implements a k-nearest neighbors approach to voice conversion, also developed by bshall. It showcases an alternative method using non-parallel data. Available at: https://github.com/bshall/knn-vc
- AutoVC: A repository by auspicious3000, featuring an Autoencoder for Voice Conversion. It demonstrates a method for disentangling speaker characteristics from linguistic content. Available at: https://github.com/auspicious3000/autovc
- fastVC: Developed by fmiotello, this repository provides an efficient voice conversion framework. It is designed for rapid voice conversion without compromising on quality. Available at: https://github.com/fmiotello/fastVC