forked from leiyi420/CSEmoTransfer
-
Notifications
You must be signed in to change notification settings - Fork 0
/
index.html
472 lines (440 loc) · 31.7 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
<!DOCTYPE html>
<!-- saved from url=(0033)https://leiyi420.github.io/HierarchicalEmoTTS/ -->
<html lang="en-US"><head><meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<!-- Begin Jekyll SEO tag v2.7.1 -->
<title>Cross-speaker Emotion Transfer for Expressive Speech Synthesis through Information Perturbation</title>
<meta name="generator" content="Jekyll v3.9.0">
<meta property="og:title" content="TODO: title">
<meta property="og:locale" content="en_US">
<link rel="canonical" href="https://leiyi420.github.io/CSEmoTransfer">
<meta property="og:url" content="https://leiyi420.github.io/CSEmoTransfer">
<meta name="twitter:card" content="summary">
<!-- End Jekyll SEO tag -->
<meta name="viewport" content="width=device-width, initial-scale=1">
<meta name="theme-color" content="#157878">
<link rel="stylesheet" href="style.css">
</head>
<body data-new-gr-c-s-check-loaded="14.1001.0" data-gr-ext-installed="">
<section class="page-header">
<!-- <h1 class="project-name">Demo PAGE</h1> -->
<!-- <h2 class="project-tagline"></h2> -->
</section>
<section class="main-content">
<h1 id=""><center>Cross-speaker Emotion Transfer for Expressive Speech Synthesis through Information Perturbation</center></h1>
<center> Yi Lei, Shan Yang, Xinfa Zhu, Qicong Xie, Lei Xie, Dan Su </center>
<center> Northwestern Polytechnical University </center>
<center> Tencent AI Lab </center>
<h2>0. Contents</h2>
<ol>
<li><a href="#abstract">Abstract</a></li>
<li><a href="#transfer">Examples of information perturbation</a></li>
<li><a href="#prediction">Demos -- Cross-speaker emotion transfer</a></li>
<li><a href="#control">Demos -- Generated audio examples from different branches</a></li>
</ol>
<br><br>
<h2 id="abstract">1. Abstract<a name="abstract"></a></h2>
<p> Cross-speaker emotion transfer is an effective way to produce expressive speech for neutral target speakers, which doesn't require emotional training data of target speakers. Since the emotion and timbre of the source speaker are heavily entangled, existing approaches often struggle to trade off the speaker similarity and the emotion expressions.
In this paper, we propose to disentangle the timbre and the emotion of speech through information perturbation to conduct cross-speaker emotion transfer, which effectively learns the emotion expressions of the source speaker and maintains the timbre of the target neutral speakers. Specifically, we perturb the timbre and emotion information (e.g., formant and pitch) of source speech separately to obtain and model the emotion- and timbre-independent signals, based on which the proposed model could further produce emotional speech in the timbre of target speakers. Experiment results demonstrate that the proposed approach significantly outperforms the baseline models in terms of naturalness and similarity, indicating the effectiveness of information perturbation for cross-speaker emotion transfer.
</p>
<center><img src='fig/architecture.jpg'></center>
<br><br>
<h2> 2. Examples of information perturbation<a name="transfer"></a></h2>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Speaker</strong></th>
<th style="text-align: center"><strong>Original recording</strong></th>
<th style="text-align: center"><strong>Perturbation on speaker</strong></th>
<th style="text-align: center"><strong>Perturbation on emotion</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: center">Source speaker - Neutral</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/db6_neutral_002346.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/db6_neutral_002346.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/db6_neutral_002346.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: center">Source speaker - Happy</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/db6_happy_210026.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/db6_happy_210026.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/db6_happy_210026.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: center">Source speaker - Angry</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/db6_angry_220125.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/db6_angry_220125.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/db6_angry_220125.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: center">Source speaker - Sad</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/db6_sad_230133.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/db6_sad_230133.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/db6_sad_230133.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: center">Source speaker - Surprise</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/db6_surprise_241231.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/db6_surprise_241231.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/db6_surprise_241231.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: center">Source speaker - Disgust</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/db6_disgust_261928.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/db6_disgust_261928.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/db6_disgust_261928.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: center">M1</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/db4_013463.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/db4_013463.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/db4_013463.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: center">F1</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/db2_002395.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/db2_002395.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/db2_002395.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: center">F2</td>
<td style="text-align: left"><audio src="samples/pert_samples/original/mydb1_002132.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_seg_formant_wavs/mydb1_002132.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/pert_samples/pert_pr_pm_wavs/mydb1_002132.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<br><br>
<h2>3. Demos -- Cross-speaker emotion transfer<a name="prediction"></a></h2>
<h3>Convert the emotion expresssions from the source speaker to the neutral target speakers without emotional training data.</h3>
<p><b>Target speaker: M1 </b></p>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Emotion</strong></th>
<th style="text-align: center"><strong>Target emotion example</strong></th>
<th style="text-align: center"><strong>Target speaker example</strong></th>
<th style="text-align: center"><strong>FS2-GST</strong></th>
<th style="text-align: center"><strong>FS2-VAE</strong></th>
<th style="text-align: center"><strong>FS2-BN</strong></th>
<th style="text-align: center"><strong>Proposed</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left" rowspan=2>Neutral</td>
<td style="text-align: left" colspan=5>Text: 我很明白你的意思。(English: I know exactly what you mean.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_neutral.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db4_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db4_005274_0100.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db4_005274_0100.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db4_005274_0100.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db4_005274_0100.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Happy</td>
<td style="text-align: left" colspan=5>Text: 太棒了,风浩这种做法果然是有效。(English: Great! Feng Hao's practice is really effective.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_happy.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db4_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_happy_210863_0101.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_happy_210863_0101.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_happy_210863_0101.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_happy_210863_0101.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Angry</td>
<td style="text-align: left" colspan=5>Text: 很快穆先生就由恐惧滋生成了愤怒的情绪。(English: Mr. Mu became angry from fear very soon.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_angry.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db4_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_angry_220253_0102.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_angry_220253_0102.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_angry_220253_0102.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_angry_220253_0102.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Sad</td>
<td style="text-align: left" colspan=5>Text: 心痛到濒临疯狂,是谁告诉我要学会遗忘。(English: Heartache to the brink of madness, who told me to learn to forget.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_sad.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db4_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_sad_231056_0103.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_sad_231056_0103.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_sad_231056_0103.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_sad_231056_0103.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Surprise</td>
<td style="text-align: left" colspan=5>Text: 我的手指头断了,竟然一点儿也没有感觉。(English: I broke my finger, but I didn't feel it at all.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_surprise.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db4_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_surprise_240927_0104.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_surprise_240927_0104.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_surprise_240927_0104.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_surprise_240927_0104.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Disgust</td>
<td style="text-align: left" colspan=5>Text: 心动不如行动,我不太擅长卖萌。(English: Action is better than thought. I'm not very good at being cute.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_disgust.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db4_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_disgust_261801_0106.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_disgust_261801_0106.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_disgust_261801_0106.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_disgust_261801_0106.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<p><b>Short summary:</b> When the target neutral speaker is male, the FS2-GST and FS2-VAE cannot produce correct timbre, although the genereted speech is emotional. Besides, FS2-BN sometimes also cannot generate male voice. The proposed model could successfully maintain the target timbre when generating emotional speech.</p>
<p><b>Target speaker: F1 </b></p>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Emotion</strong></th>
<th style="text-align: center"><strong>Target emotion example</strong></th>
<th style="text-align: center"><strong>Target speaker example</strong></th>
<th style="text-align: center"><strong>FS2-GST</strong></th>
<th style="text-align: center"><strong>FS2-VAE</strong></th>
<th style="text-align: center"><strong>FS2-BN</strong></th>
<th style="text-align: center"><strong>Proposed</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left" rowspan=2>Neutral</td>
<td style="text-align: left" colspan=5>Text: 他希望能找个办法既能让这家伙吃苦头又不连累自己。(English: He hoped to find a way to make this guy suffer without harming himself.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_neutral.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db2_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db2_016199_0000.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db2_016199_0000.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db2_016199_0000.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db2_016199_0000.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Happy</td>
<td style="text-align: left" colspan=5>Text: 戴老师他不是很高兴嘛?(English: Mr. Dai is very happy, isn't he?)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_happy.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db2_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_happy_211202_0001.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_happy_211202_0001.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_happy_211202_0001.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_happy_211202_0001.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Angry</td>
<td style="text-align: left" colspan=5>Text: 第一次打昏的时候,我是最愤怒的!(English: The first time I fainted, I was the angriest!)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_angry.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db2_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_angry_220349_0002.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_angry_220349_0002.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_angry_220349_0002.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_angry_220349_0002.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Sad</td>
<td style="text-align: left" colspan=5>Text: 求您给我一个改过自新的机会,以后我再也不去行骗了。(English: Please give me a chance to reform. I will never cheat again in the future.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_sad.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db2_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_sad_230090_0003.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_sad_230090_0003.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_sad_230090_0003.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_sad_230090_0003.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Surprise</td>
<td style="text-align: left" colspan=5>Text: 我的手指头断了,竟然一点儿也没有感觉。(English: I broke my finger, but I didn't feel it at all.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_surprise.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db2_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_surprise_240927_0004.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_surprise_240927_0004.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_surprise_240927_0004.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_surprise_240927_0004.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Disgust</td>
<td style="text-align: left" colspan=5>Text: 你是存在于世界的另一个我,如此憎恨却又无法摆脱。(English: You are another me in the world, so hated but unable to get rid of.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_disgust.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db2_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_disgust_260027_0006.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_disgust_260027_0006.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_disgust_260027_0006.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_disgust_260027_0006.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<p><b>Target speaker: F2 </b></p>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Emotion</strong></th>
<th style="text-align: center"><strong>Target emotion example</strong></th>
<th style="text-align: center"><strong>Target speaker example</strong></th>
<th style="text-align: center"><strong>FS2-GST</strong></th>
<th style="text-align: center"><strong>FS2-VAE</strong></th>
<th style="text-align: center"><strong>FS2-BN</strong></th>
<th style="text-align: center"><strong>Proposed</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left" rowspan=2>Neutral</td>
<td style="text-align: left" colspan=5>Text: 预计今天下午到前半夜我市部分地区有雷阵雨。(English: It is expected that there will be thunderstorms in some areas of our city from this afternoon to the first midnight.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_neutral.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db1_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_neutral_014907_0300.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_neutral_014907_0300.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_neutral_014907_0300.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_neutral_014907_0300.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Happy</td>
<td style="text-align: left" colspan=5>Text: 又有新发现了,真是令人惊喜万分。(English: It's amazing to find something new.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_happy.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db1_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_happy_210808_0301.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_happy_210808_0301.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_happy_210808_0301.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_happy_210808_0301.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Angry</td>
<td style="text-align: left" colspan=5>Text: 住口,不要喊我父亲,我没有你这样的女儿!(English: Shut up, don't call my father, I don't have a daughter like you!)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_angry.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db1_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_angry_220442_0302.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_angry_220442_0302.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_angry_220442_0302.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_angry_220442_0302.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Sad</td>
<td style="text-align: left" colspan=5>Text: 求您给我一个改过自新的机会,以后我再也不去行骗了。(English: Please give me a chance to reform. I will never cheat again in the future.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_sad.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db1_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_sad_230090_0303.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_sad_230090_0303.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_sad_230090_0303.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_sad_230090_0303.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Surprise</td>
<td style="text-align: left" colspan=5>Text: 啊,是的是的。真是耳闻不如目见。天使呀,您来有什么话对我说?(English: Ah, yes, yes. Seeing is better than hearing. Angel, what do you want to say to me?)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_surprise.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db1_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_surprise_241083_0304.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_surprise_241083_0304.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_surprise_241083_0304.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_surprise_241083_0304.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left" rowspan=2>Disgust</td>
<td style="text-align: left" colspan=5>Text: 翁倩玉在赈灾晚会上演唱。(English: Weng Qianyu sang at the disaster relief party.)</td>
</tr>
<tr>
<td style="text-align: left"><audio src="samples/ref2/db6_disgust.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/ref1/db1_rec.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/gst/db6_disgust_261643_0306.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/vae/db6_disgust_261643_0306.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/cs_bn/db6_disgust_261643_0306.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/proposed/db6_disgust_261643_0306.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<p><b>Short summary:</b> The results indicate the effectiveness of our proposed method can successfully transfer the source emotion to the target speaker while maintaining the target speaker's timbre.</p>
<br><br>
<h2>4. Demos -- Generated audio examples from different branches<a name="control"></a></h2>
<h3></h3>
<table>
<thead>
<tr>
<th style="text-align: center"><strong>Speaker</strong></th>
<th style="text-align: center"><strong>Emotion</strong></th>
<th style="text-align: center"><strong>Speaker-mel generator</strong></th>
<th style="text-align: center"><strong>Emotion-mel generator</strong></th>
<th style="text-align: center"><strong>Final output mel</strong></th>
</tr>
</thead>
<tbody>
<tr>
<td style="text-align: left">F1</td>
<td style="text-align: left">Neutral</td>
<td style="text-align: left"><audio src="samples/branch_gen/db2_003489_0000_spk.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db2_003489_0000_emo.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db2_003489_0000.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left">F2</td>
<td style="text-align: left">Happy</td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_happy_211439_0301_spk.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_happy_211439_0301_emo.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_happy_211439_0301.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left">F2</td>
<td style="text-align: left">Angry</td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_angry_220442_0302_spk.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_angry_220442_0302_emo.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_angry_220442_0302.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left">F1</td>
<td style="text-align: left">Sad</td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_sad_230090_0003_spk.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_sad_230090_0003_emo.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_sad_230090_0003.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left">Source speaker</td>
<td style="text-align: left">Surprise</td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_surprise_241334_0204_spk.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_surprise_241334_0204_emo.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_surprise_241334_0204.wav" controls="" preload=""></audio></td>
</tr>
<tr>
<td style="text-align: left">M1</td>
<td style="text-align: left">Disgust</td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_disgust_260027_0106_spk.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_disgust_260027_0106_emo.wav" controls="" preload=""></audio></td>
<td style="text-align: left"><audio src="samples/branch_gen/db6_disgust_260027_0106.wav" controls="" preload=""></audio></td>
</tr>
</tbody>
</table>
<p><b>Short summary:</b> Given different speaker and emotion embedding during inference, the Speaker-mel gererator could provide emotionless speech with specific timbre, while the output of the Emotion-mel gererator contains the emotion variations with random or averaged timbre. Based on the two outputs, the final generated speech has specifec emotion expressions and timbre.</p>
<footer class="site-footer">
<span class="site-footer-credits">This page was generated by <a href="https://pages.github.com/">GitHub Pages</a>.</span>
</footer>
</section>
</body></html>