|
1 |
| -# MakeItTalk |
| 1 | +# MakeItTalk: Speaker-Aware Talking-Head Animation |
2 | 2 |
|
| 3 | +This is the code repository implementing the paper: |
3 | 4 |
|
4 |
| -## Required packages |
5 |
| -- ffmpeg and ffmpeg-python (version >= 0.2.0) |
6 |
| -- pynormalize |
7 |
| -- pytorch |
| 5 | +> **MakeItTalk: Speaker-Aware Talking-Head Animation** |
| 6 | +> |
| 7 | +> [Yang Zhou](https://people.umass.edu/~yangzhou), |
| 8 | +> [Xintong Han](http://users.umiacs.umd.edu/~xintong/), |
| 9 | +> [Eli Shechtman](https://research.adobe.com/person/eli-shechtman), |
| 10 | +> [Jose Echevarria](http://www.jiechevarria.com) , |
| 11 | +> [Evangelos Kalogerakis](https://people.cs.umass.edu/~kalo/), |
| 12 | +> [Dingzeyu Li](https://dingzeyu.li) |
| 13 | +> |
| 14 | +> SIGGRAPH Asia 2020 |
| 15 | +> |
| 16 | +> **Abstract** We present a method that generates expressive talking-head videos from a single facial image with audio as the only input. In contrast to previous attempts to learn direct mappings from audio to raw pixels for creating talking faces, our method first disentangles the content and speaker information in the input audio signal. The audio content robustly controls the motion of lips and nearby facial regions, while the speaker information determines the specifics of facial expressions and the rest of the talking-head dynamics. Another key component of our method is the prediction of facial landmarks reflecting the speaker-aware dynamics. Based on this intermediate representation, our method works with many portrait images in a single unified framework, including artistic paintings, sketches, 2D cartoon characters, Japanese mangas, and stylized caricatures. |
| 17 | +In addition, our method generalizes well for faces and characters that were not observed during training. We present extensive quantitative and qualitative evaluation of our method, in addition to user studies, demonstrating generated talking-heads of significantly higher quality compared to prior state-of-the-art methods. |
| 18 | +> |
| 19 | +> [[Project page]](https://people.umass.edu/~yangzhou/MakeItTalk/) |
| 20 | +> [[Paper]](https://people.umass.edu/~yangzhou/MakeItTalk/MakeItTalk_SIGGRAPH_Asia_Final_round-5.pdf) |
| 21 | +> [[Video]](https://www.youtube.com/watch?v=OU6Ctzhpc6s) |
| 22 | +> [[Arxiv]](https://arxiv.org/abs/2004.12992) |
| 23 | +> [[Colab Demo]](quick_demo.ipynb) |
| 24 | +> [[Colab Demo TDLR]](quick_demo_tdlr.ipynb) |
8 | 25 |
|
| 26 | + |
9 | 27 |
|
10 |
| -## How to use |
| 28 | +Figure. Given an audio speech signal and a single portrait image as input (left), our model generates speaker-aware talking-head animations (right). |
| 29 | +Both the speech signal and the input face image are not observed during the model training process. |
| 30 | +Our method creates both non-photorealistic cartoon animations (top) and natural human face videos (bottom). |
11 | 31 |
|
12 |
| -### Step 1. Git clone |
| 32 | +## Updates |
13 | 33 |
|
14 |
| -### Step 2. Create root directory |
15 |
| - - create root directory ```ROOT_DIR``` |
16 |
| - - add sub folders ```ckpt```, ```dump```, ```nn_result```, ```puppets```, ```raw_wav```, ```test_wav_files``` to it. |
17 |
| - |
18 |
| -### Step 3. Import pre-trained model and demo Wilk files |
19 |
| -- put pre-trained face expression model under ```ROOT_DIR/ckpt/BEST_CONTENT_MODEL/ckpt_best_model.pth``` |
20 |
| -- put pre-trained face pose model under ```ROOT_DIR/ckpt/BEST_POSE_MODEL/ckpt_last_epoch.pth``` |
21 |
| -- put wilk demo files ```wilk_face_close_mouth.txt``` and ```wilk_face_open_mouth.txt``` under ```puppets``` |
| 34 | +- [x] Pre-trained models |
| 35 | +- [x] Google colab quick demo for natural faces [[detail]](quick_demo.ipynb) [[TDLR]](quick_demo_tdlr.ipynb) |
| 36 | +- [ ] Training code for each module |
| 37 | +- [ ] Customized puppet creating tool |
22 | 38 |
|
23 |
| -### Step 4. Import your test audio wav file |
24 |
| -- put your test audio file like ```example.wav``` under ```test_wav_files``` folder |
25 |
| - |
26 |
| -### Step 5. Run Talking Toon model |
27 |
| -- change the ```ROOT_DIR``` in ```main_sneak_demo.py``` line 10 to your own ```ROOT_DIR``` |
28 |
| -- run |
| 39 | +## Requirements |
| 40 | +- Python environment 3.6 |
29 | 41 | ```
|
30 |
| -python main_sneak_demo.py |
| 42 | +conda create -n makeittalk_env python=3.6 |
| 43 | +conda activate makeittalk_env |
31 | 44 | ```
|
32 |
| -- its process has 3 steps in details: |
33 |
| - - create input data for network from your test audio file |
34 |
| - - run Talking Toon neural network to get the predicted facial landmarks |
35 |
| - - post process output files into real image scale for later image morphing |
36 |
| - |
37 |
| -- its outputs are under ```ROOT_DIR/nn_result/sneak_demo``` |
38 |
| - - raw facial landmark prediction visualization mp4 file, i.e. ```*_pos_EVAL_av.mp4``` |
39 |
| - - a folder with your test audio name, containing 3 required files for later image morphing |
40 |
| - - ```reference_points.txt``` |
41 |
| - - ```triangulation.txt``` |
42 |
| - - ```warped_points.txt``` |
43 |
| - |
44 |
| -### Step 6. Image morphing (through Jakub's code) |
45 |
| -- rebuild Jakub's code with my updated ```dingwarp.cpp``` |
46 |
| -- copy 3 required files to Jakub's code directory ```dingwarp/test/``` |
47 |
| -- run ```test_win.bat``` or do with normal cmd commands. |
48 |
| -- run ``final_ffmpeg_combine.bat`` like this |
| 45 | +- ffmpeg (https://ffmpeg.org/download.html) |
49 | 46 | ```
|
50 |
| ->> final_ffmpeg_combine.bat [YOUR_TEST_AUDIO_FILE_DIR] [OUTPUT_VIDEO_NAME] |
| 47 | +sudo apt-get install ffmpeg |
51 | 48 | ```
|
52 |
| -for exmaple |
| 49 | +- python packages |
53 | 50 | ```
|
54 |
| ->> final_ffmpeg_combine.bat E:\TalkingToon\test_wav_files\example.wav output.mp4 |
| 51 | +pip install -r requirements.txt |
55 | 52 | ```
|
56 | 53 |
|
| 54 | +## Pre-trained Models |
| 55 | + |
| 56 | +Download the following pre-trained models to `examples/ckpt` folder for testing your own animation. |
| 57 | + |
| 58 | +| Model | Link to the model | |
| 59 | +| :-------------: | :---------------: | |
| 60 | +| Voice Conversion | [Link](https://drive.google.com/file/d/1ZiwPp_h62LtjU0DwpelLUoodKPR85K7x/view?usp=sharing) | |
| 61 | +| Speech Content Module | [Link](https://drive.google.com/file/d/1r3bfEvTVl6pCNw5xwUhEglwDHjWtAqQp/view?usp=sharing) | |
| 62 | +| Speaker-aware Module | [Link](https://drive.google.com/file/d/1rV0jkyDqPW-aDJcj7xSO6Zt1zSXqn1mu/view?usp=sharing) | |
| 63 | +| Image2Image Translation Module | [Link](https://drive.google.com/file/d/1i2LJXKp-yWKIEEgJ7C6cE3_2NirfY_0a/view?usp=sharing) | |
| 64 | +| Non-photorealistic Warping (.exe) | [Link](https://drive.google.com/file/d/1rlj0PAUMdX8TLuywsn6ds_G6L63nAu0P/view?usp=sharing) | |
| 65 | + |
| 66 | +## Animate You Portraits! |
| 67 | + |
| 68 | +- Download pre-trained embedding [[here]](https://drive.google.com/file/d/18-0CYl5E6ungS3H4rRSHjfYvvm-WwjTI/view?usp=sharing) and save to `examples/dump` folder. |
| 69 | + |
| 70 | +### _Nature Human Faces / Paintings_ |
| 71 | + |
| 72 | +- crop your portrait image into size `256x256` and put it under `examples` folder with `.jpg` format. |
| 73 | +Make sure the head is almost in the middle (check existing examples for a reference). |
| 74 | + |
| 75 | +- put test audio files under `examples` folder as well with `.wav` format. |
| 76 | + |
| 77 | +- animate! |
| 78 | + |
| 79 | +``` |
| 80 | +python main_end2end.py --jpg <portrait_file> |
| 81 | +``` |
| 82 | + |
| 83 | +- use addition args `--amp_lip_x <x> --amp_lip_y <y> --amp_pos <pos>` |
| 84 | +to amply lip motion (in x/y-axis direction) and head motion displacements, default values are `<x>=2., <y>=2., <pos>=.5` |
| 85 | + |
| 86 | + |
| 87 | + |
| 88 | +### _Cartoon Faces_ |
| 89 | + |
| 90 | +- put test audio files under `examples` folder as well with `.wav` format. |
| 91 | + |
| 92 | +- animate one of the existing puppets |
| 93 | + |
| 94 | +| Puppet Name | wilk | roy | sketch | color | cartoonM | danbooru1 | |
| 95 | +| :---: | :---: | :---: | :---: | :---: | :---: | :---: | |
| 96 | +| Image |  |  |  |  |  |  | |
| 97 | + |
| 98 | +``` |
| 99 | +python main_end2end_cartoon.py --jpg <cartoon_puppet_name> |
| 100 | +``` |
| 101 | + |
| 102 | +- create your own puppets (ToDo...) |
| 103 | + |
| 104 | +## Train |
| 105 | + |
| 106 | +### Train Voice Conversion Module |
| 107 | +Todo... |
| 108 | + |
| 109 | +### Train Content Branch |
| 110 | +- Create dataset root directory `<root_dir>` |
| 111 | + |
| 112 | +- Dataset: Download preprocessed dataset [[here]](https://drive.google.com/drive/folders/1EwuAy3j1b9Zc1MsidUfxG_pJGc_cV60O?usp=sharing), and put it under `<root_dir>/dump`. |
| 113 | + |
| 114 | +- Train script: Run script below. Models will be saved in `<root_dir>/ckpt/<train_instance_name>`. |
| 115 | + |
| 116 | + ```shell script |
| 117 | + python main_train_content.py --train --write --root_dir <root_dir> --name <train_instance_name> |
| 118 | + ``` |
| 119 | + |
| 120 | +### Train Speaker-Aware Branch |
| 121 | +Todo... |
| 122 | + |
| 123 | +### Train Image-to-Image Translation |
| 124 | + |
| 125 | +Todo... |
| 126 | + |
| 127 | +## [License](LICENSE.md) |
| 128 | + |
| 129 | +## Acknowledgement |
57 | 130 |
|
58 |
| -# [License](LICENSE.md) |
| 131 | +We would like to thank Timothy Langlois for the narration, and |
| 132 | +[Kaizhi Qian](https://scholar.google.com/citations?user=uEpr4C4AAAAJ&hl=en) |
| 133 | +for the help with the [voice conversion module](https://auspicious3000.github.io/icassp-2020-demo/). We |
| 134 | +thank Daichi Ito for sharing the caricature image and Dave Werner |
| 135 | +for Wilk, the gruff but ultimately lovable puppet. |
59 | 136 |
|
| 137 | +This research is partially funded by NSF (EAGER-1942069) |
| 138 | +and a gift from Adobe. Our experiments were performed in the |
| 139 | +UMass GPU cluster obtained under the Collaborative Fund managed |
| 140 | +by the MassTech Collaborative. |
60 | 141 |
|
61 |
| - |
|
0 commit comments