Skip to content

Commit

Permalink
Add arxiv link for Wave-Tacotron paper.
Browse files Browse the repository at this point in the history
  • Loading branch information
ronw committed Nov 10, 2020
1 parent e05da0a commit 8ab4e66
Show file tree
Hide file tree
Showing 2 changed files with 12 additions and 6 deletions.
11 changes: 11 additions & 0 deletions index.html
Original file line number Diff line number Diff line change
Expand Up @@ -223,5 +223,16 @@ <h2>Publications</h2>
</ul>
</header>
</article>

<article>
<header>
<span class="paper_date">(November 2020)</span>
<span class="paper_title">Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis</span>
<ul>
<li><a href="https://arxiv.org/abs/2011.03568">paper</a></li>
<li><a href="publications/wave-tacotron/index.html">audio samples</a></li>
</ul>
</header>
</article>
</body>
</html>
7 changes: 1 addition & 6 deletions publications/wave-tacotron/index.html
Original file line number Diff line number Diff line change
Expand Up @@ -88,12 +88,7 @@
<h1>Audio samples from "Wave-Tacotron: Spectrogram-free end-to-end text-to-speech synthesis"</h1>
</header>
</article>
<!--
<div><p><b>Paper:</b> <a href="">arXiv</a></p></div>
-->
<!--
<div><p><b>Paper:</b> <a href="">arXiv</a></p></div>
-->
<div><p><b>Paper:</b> <a href="https://arxiv.org/abs/2011.03568">arXiv</a></p></div>
<div><p><b>Authors:</b> Ron J. Weiss, RJ Skerry-Ryan, Eric Battenberg, Soroosh Mariooryad, Diederik P. Kingma</p></div>
<div><p><b>Abstract:</b> We describe a sequence-to-sequence neural network which can directly generate speech waveforms from text inputs. The architecture extends the Tacotron model by incorporating a normalizing flow into the autoregressive decoder loop. Output waveforms are modeled as a sequence of non-overlapping fixed-length frames, each one containing hundreds of samples. The interdependencies of waveform samples within each frame are modeled using the normalizing flow, enabling parallel training and synthesis. Longer-term dependencies are handled autoregressively by conditioning each flow on preceding frames. This model can be optimized directly with maximum likelihood, without using intermediate, hand-designed features nor additional loss terms. Contemporary state-of-the-art text-to-speech (TTS) systems use a cascade of separately learned models: one (such as Tacotron) which generates intermediate features (such as spectrograms) from text, followed by a vocoder (such as WaveRNN) which generates waveform samples from the intermediate features.The proposed system, in contrast, does not use a fixed intermediate representation, and learns all parameters end-to-end. Experiments show that the proposed model generates speech with quality approaching a state-of-the-art neural TTS system, with significantly improved generation speed.</p></div>

Expand Down

0 comments on commit 8ab4e66

Please sign in to comment.