- [ongoing] Frame head parameters.
- [2023/05] BEAT_GENEA is allowed for pretraining in GENEA2023! Thanks for GENEA's organizers!
- [2023/03] Samples and readme for SMPL-X body and hands data.
- [2023/01] English data v0.2.1 are available. Fix the orientation issue. See updation for details.
- [2023/01] Provide checkpoints (#14, #16), scripts for rendering (#17), preprocessing (#18).
- [2022/12] Provide English data in Zip files (#10).
- [2022/10] Project page and rendered videos are available.
- [2022/08] All languages data v0.1.0 (in separated files) are available.
- [2022/03] CaMN training scripts from anonymous submission.
- 10-Scale Semantic Relevancy: BEAT provides a score and category-label for semantic relevancy between gestures and speech content: no gestures (0), beat gestures (1), low-middle-high quaility deictic gestures (2,3,4), iconic gestures (5,6,7), metaphoric gestures (8,9,10).
- 8-Class Emotional Gestures: For each speaker, data in speech section are recorded with eight emotions: neutral, happiness, anger, sadness, contempt, surprise, fear, and disgust. Data in conversation are labeled as neutral.
- 4-Modality Captured Data: With 16 cameras motion capture system and iphone arkit, BEAT recorded data in four modalities: 75 joints' motion, 52 dimensions blendshape weights, audio and text.
- 76-Hour and 30-Speaker: BEAT (English data) consists of 10 recorded four hours speakers and 20 recorded one hour speakers.
- 4-Language: BEAT contains four types of languages: English (60h), Chinese (12h), Spanish (2h) and Japanese (2h). For the latter three languages, speakers also record English data to provide paired data.
- 2-Scenario: BEAT provides speech (50%) and conversation (50%) recording.
Gesture Generation on BEAT-16h (speaker 2,4,6,8 in English data v0.2.1)
Method | Venue | Input Modalities | FID** | SRGR | BeatAlign | ckpt |
---|---|---|---|---|---|---|
Seq2Seq | ICRA'19 | text | 261.3 | 0.173 | 0.729 | - |
Speech2Gestures | CVPR'19 | audio | 256.7 | 0.092 | 0.751 | - |
Audio2Gestures | ICCV'21 | audio | 223.8 | 0.097 | 0.766 | - |
MultiContext | SIGGRAPH ASIA'20 | audio, text | 176.2 (177.2*) | 0.195 (0.227) | 0.776 (0.751) | link |
CaMN | ECCV'22 | audio, text, facial | 123.7 (122.8) | 0.239 (0.240) | 0.783 (0.782) | link |
*Checkpoints results trained from this repo. are denoted in parentheses. Results in paper are from codes: seq2seq, s2g, a2g, mutli, camn.
**Pretrained 300D AutoEncoder for FID calculation.
- name:
1_wayne_0_1_8
- where
1_wayne
is speaker id and name, 0
is the recording type:0
English Speech,1
English Conversation,2
Chinese Speech,3
Chinese Conversation,4
Spanish Speech,5
Spanish Conversation,6
Japanese Speech,7
Japanese Conversation.1_8
is thestart
andend
id for the current sequence, where range is1-118
for speech and1-12
for conversation.- for speech section:
0-64
neutral,65-72
happiness,73-80
anger,81-86
sadness,87-94
contempt,95-102
surprise,103-110
fear,111-118
disgust.
- where
- format:
120
FPS.bvh
for motion, usingZ-up
,Y-forward
in blender, right-hand system.60
FPS.json
for facial blendshape weights.16K
HZ.wav
for audio..TextGrid
for text.csv
for emotion label,0-7
: neutral, happiness, anger, sadness, contempt, surprise, fear, and disgust..txt
for semantic label, intypes
,start
,end
,duration
,score
, andkeywords
.
- missing sequences:
- speaker 9:
0_2_8
, speaker 21:0_1_8
.
- speaker 9:
Script is in /dataloaders/preprocessing.ipynb
, ratio: 2880:500:500
split_rule_english = {
# 4h speakers x 10
"1, 2, 3, 4, 6, 7, 8, 9, 11, 21":{
# 48+40+100=188mins each
"train": [
"0_9_9", "0_10_10", "0_11_11", "0_12_12", "0_13_13", "0_14_14", "0_15_15", "0_16_16", \
"0_17_17", "0_18_18", "0_19_19", "0_20_20", "0_21_21", "0_22_22", "0_23_23", "0_24_24", \
"0_25_25", "0_26_26", "0_27_27", "0_28_28", "0_29_29", "0_30_30", "0_31_31", "0_32_32", \
"0_33_33", "0_34_34", "0_35_35", "0_36_36", "0_37_37", "0_38_38", "0_39_39", "0_40_40", \
"0_41_41", "0_42_42", "0_43_43", "0_44_44", "0_45_45", "0_46_46", "0_47_47", "0_48_48", \
"0_49_49", "0_50_50", "0_51_51", "0_52_52", "0_53_53", "0_54_54", "0_55_55", "0_56_56", \
"0_66_66", "0_67_67", "0_68_68", "0_69_69", "0_70_70", "0_71_71", \
"0_74_74", "0_75_75", "0_76_76", "0_77_77", "0_78_78", "0_79_79", \
"0_82_82", "0_83_83", "0_84_84", "0_85_85", \
"0_88_88", "0_89_89", "0_90_90", "0_91_91", "0_92_92", "0_93_93", \
"0_96_96", "0_97_97", "0_98_98", "0_99_99", "0_100_100", "0_101_101", \
"0_104_104", "0_105_105", "0_106_106", "0_107_107", "0_108_108", "0_109_109", \
"0_112_112", "0_113_113", "0_114_114", "0_115_115", "0_116_116", "0_117_117", \
"1_2_2", "1_3_3", "1_4_4", "1_5_5", "1_6_6", "1_7_7", "1_8_8", "1_9_9", "1_10_10", "1_11_11",
],
# 8+7+10=25mins each
"val": [
"0_57_57", "0_58_58", "0_59_59", "0_60_60", "0_61_61", "0_62_62", "0_63_63", "0_64_64", \
"0_72_72", "0_80_80", "0_86_86", "0_94_94", "0_102_102", "0_110_110", "0_118_118", \
"1_12_12",
],
# 8+7+10=25mins each
"test": [
"0_1_1", "0_2_2", "0_3_3", "0_4_4", "0_5_5", "0_6_6", "0_7_7", "0_8_8", \
"0_65_65", "0_73_73", "0_81_81", "0_87_87", "0_95_95", "0_103_103", "0_111_111", \
"1_1_1",
],
},
# 1h speakers x 20
"5, 10, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 27, 28, 29, 30":{
# 8+7+20=35mins each
"train": [
"0_9_9", "0_10_10", "0_11_11", "0_12_12", "0_13_13", "0_14_14", "0_15_15", "0_16_16", \
"0_66_66", "0_74_74", "0_82_82", "0_88_88", "0_96_96", "0_104_104", "0_112_112", "0_118_118", \
"1_2_2", "1_3_3",
"1_0_0", "1_4_4", # for speaker 29 only
],
# 4+3.5+5 = 12.5mins each
# 0_65_a and 0_65_b denote the frist and second half of sequence 0_65_65
"val": [
"0_5_5", "0_6_6", "0_7_7", "0_8_8", \
"0_65_b", "0_73_b", "0_81_b", "0_87_b", "0_95_b", "0_103_b", "0_111_b", \
"1_1_b",
],
# 4+3.5+5 = 12.5mins each
"test": [
"0_1_1", "0_2_2", "0_3_3", "0_4_4", \
"0_65_a", "0_73_a", "0_81_a", "0_87_a", "0_95_a", "0_103_a", "0_111_a", \
"1_1_a",
],
},
}
- Scripts for videos in rendered videos
- Annotation tools for semantic relevancy.
- Avatars in our paper is from HumanGenerator V3, we could share the avatars after confirming your liscense of HGv3 by email.
python == 3.7
- build folders like:
audio2pose ├── codes │ └── audio2pose ├── datasets │ ├── beat_raw_data │ ├── beat_annotations │ └── beat_cache └── outputs └── audio2pose ├── custom └── wandb
- download the scripts to
codes/audio2pose/
- run
pip install -r requirements.txt
in the path./codes/audio2pose/
- download full dataset to
datasets/beat
- bulid data cache and calculate mean and std by given
number of joints
,FPS
,speakers
using/dataloader/preprocessing.ipynb
cd ./dataloaders && python build_vocab.py
for language model- run
python train.py -c ./configs/ae_4english_15_141.yaml
for pretrained_ae for FID calculation, or download pretrained ckpt to/datasets/beat_cache/cache_name/weights/
- run
python train.py -c ./configs/camn_4english_15_141.yaml
for training or or download pretrained ckpt to/datasets/beat_cache/cache_name/weights/
. - run
python test.py -c ./configs/camn_4english_15_141.yaml
for inference. - load
./outputs/audio2pose/custom/exp_name/epoch_number/xxx.bvh
into blender to visualize the test results.
- refer
train and test CaMN
for bvh cache - remove modalities, e.g., remove facial expressions.
- set
facial_rep: None
andfacial_f: 0
in.yaml
- set
dataset: trinity
in.yaml
- set
BEAT is established for the following research project. Please consider cite our work if you use BEAT dataset.
@article{liu2022beat,
title = {BEAT: A Large-Scale Semantic and Emotional Multi-Modal Dataset for Conversational Gestures Synthesis},
author = {Haiyang Liu, Zihao Zhu, Naoya Iwamoto, Yichen Peng, Zhengqing Li, You Zhou, Elif Bozkurt, Bo Zheng},
journal = {European Conference on Computer Vision},
year = {2022}
}