Skip to content

Latest commit

 

History

History
80 lines (64 loc) · 3.31 KB

README.md

File metadata and controls

80 lines (64 loc) · 3.31 KB

Mix and Localize: Localizing Sound Sources in Mixtures

Xixi Hu*, Ziyang Chen*,Andrew Owens
University of Michigan
CVPR 2022


This repository contains the official codebase for Mix and Localize: Localizing Sound Sources in Mixtures. [Project Page]

Cycle-consistent multi-source localization

MUSIC Dataset

  1. Download the MUSIC dataset here: MUSIC repo

  2. Postprocess the MUSIC dataset and extract the frames and audio clips. The structure of the dataset folder is as follow.

    data
      └──MUSIC 
      │    ├──data-splits
      │    ├──MUSIC_raw
      │           ├──duet
      │           ├──solo
      │                └── [class_label]
      │                         └── [ytid]
      │                               ├── audio
      │                               │      ├──audio_clips
      │                               │             ├── 00000.wav       # 1 second audio clips
      │                               │             ├── 00001.wav
      │                               │             ├── ...
      │                               └── frames
      │                                      ├── 00000.jpg              # fps = 4
      │                                      ├── ...
    
    

Training on MUSIC dataset

python train.py --setting="music_multi_nodes" --exp="exp_music" --batch_size=128 --epoch=30 

You can also download the pretrained model for MUSIC dataset here

VoxCeleb Dataset

  1. Download the VoxCeleb2 dataset here: VoxCeleb repo

  2. Postprocess the VoxCeleb2 dataset and extract the frames and audio clips. The structure of the dataset folder is as follow.

    data
      └── VoxCeleb 
      │    ├──data-splits
      │    ├──VoxCeleb2
      │            └── [idxxxxx]
      │                      └── [video_clip_name]  # 5s clip 
      │                               ├── audio
      │                               │      └── audio.wav
      │                               └── frames
      │                                      ├── frame000001.jpg              # fps = 10
      │                                      ├── ...
    
    

Training on VoxCeleb dataset

python train.py --setting="voxceleb_multi_nodes" --exp="exp_voxceleb" --batch_size=128 --lr=1e-4 --epoch=30 

You can also download the pretrained model for VoxCeleb2 dataset here

VGGSound annotations

We filtered and annotated segmentation masks for 446 high-quality video frames in VGGSound-Instruments. The annotations can be found at here.