--- Now Updating ---
This repository is the implementation of "Audio-visual Action Recognition using Transformer Fusion Network". This code is based on the GitHub repository of the "Swin Transformer" paper. (https://github.com/microsoft/Swin-Transformer)
Dataset preparation
-
UCF-sound(The subset of UCF-101)
- Download UCF-101 dataset
- The class list provided below includes audio files within video data. Please separate these classes from the UCF-101 dataset. (CliffDiving/Rafting/SoccerPenalty/BabyCrawling/LongJump/Hammering/HandstandWalking/CuttingInKitchen/StillRings/BoxingPunchingBag/PlayingDhol/Surfing/BrushingTeeth/Archery/IceDancing/MoppingFloor/PlayingFlute/BoxingSpeedBag/ParallelBars/UnevenBars/Typing/PlayingCello/TableTennisShot/BasketballDunk/ApplyLipstick/BalanceBeam/PlayingDaf/SumoWrestling/CricketShot/Knitting/FloorGymnastics/Shotput/WritingOnBoard/ShavingBeard/Haircut/BlowingCandles/PlayingSitar/HeadMassage/FrontCrawl/BodyWeightSquats/BandMarching/FrisbeeCatch/FieldHockeyPenalty/HandstandPushups/BlowDryHair/Bowling/WallPushups/CricketBowling/SkyDiving/HammerThrow)
-
Kinetics-sound(The subset of Kinetics-400)
- Download Kinetics-400 dataset
- The class list provided below includes audio files within video data. Please separate these classes from the Kinetics-400 dataset. (playingtrumpet/stompinggrapes/shovelingsnow/playingclarinet/strummingguitar/blowingnose/playingxylophone/blowingoutcandles/rippingpaper/tapdancing/bowling/laughing/playingbassguitar/playingviolin/playingkeyboard/playingtrombone/tappingpen/dribblingbasketball/playingdrums/choppingwood/singing/playingbagpipes/mowinglawn/playingorgan/playingpiano/shufflingcards/playingguitar/playingaccordion/tickling/playingharmonica/tappingguitar/playingsaxophone)
-
Extract the frames from the video.
-
Extract the WAV file from the video
The case of model
- IVA : all elements, includnig a single frame, T frames, and audio
- IV : single frame and T frames without audio
- IA : single frame and its corresponding audio
- VA : T frames and its corresponding audio