A curated list of action recognition and related area (e.g. object recognition, pose estimation) resources, inspired by awesome-computer-vision.
- Deep Learning for Videos: A 2018 Guide to Action Recognition - Summary of major landmark action recognition research papers till 2018
- Why Can't I Dance in the Mall? Learning to Mitigate Scene Bias in Action Recognition - J. Choi et al., NeurIPS2019. [project web] [code] [arXiv]
- Large-scale weakly-supervised pre-training for video action recognition - D. Ghadiyaram et al., arXiv2019.
- Video Classification with Channel-Separated Convolutional Networks - D. Tran et al., arXiv2019.
- DistInit: Learning Video Representations without a Single Labeled Video - R. Girdhar et al., arXiv2019.
- SCSampler: Sampling Salient Clips from Video for Efficient Action Recognition - B. Korbar et al., arXiv2019.
- Video Action Transformer Network - R. Girdhar et al., CVPR2019. [project web]
- Learning Correspondence from the Cycle-consistency of Time - X. Wang et al., CVPR2019. [code] [project web]
- Representation Flow for Action Recognition - AJ. Piergiovanni and M. S. Ryoo et al., CVPR2019.
- Collaborative Spatiotemporal Feature Learning for Video Action Recognition - C. Li et al., CVPR2019.
- Learning Video Representations from Correspondence Proposals - X. Liu et al., CVPR2019.
- Timeception for Complex Action Recognition - N. Hussein et al., CVPR2019.
- The Visual Centrifuge: Model-Free Layered Video Representations - J.-B. Alayrac et al., CVPR2019.
- Long-Term Feature Banks for Detailed Video Understanding - C.-Y. Wu. et al., CVPR2019. [code]
- Temporal Relational Reasoning in Videos - B. Zhou et al., ECCV2018. [code] [project web]
- Action Recognition Zoo - Codes for popular action recognition models, written based on pytorch, verified on the something-something dataset.
- Videos as Space-Time Region Graphs - X. Wang and A. Gupta, ECCV2018.
- Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet? - K. Hara et al., CVPR2019. [code]
- A Closer Look at Spatiotemporal Convolutions for Action Recognition - D. Tran et al., CVPR2018. [code] [PyTorch]
- Attend and Interact: Higher-Order Object Interactions for Video Understanding - CY. Ma et al., CVPR 2018.
- Non-Local Neural Networks - X. Wang et al., CVPR2018. [code]
- Rethinking Spatiotemporal Feature Learning For Video Understanding - S. Xie et al., arXiv2017.
- ConvNet Architecture Search for Spatiotemporal Feature Learning - D. Tran et al, arXiv2017. Note: Aka Res3D. [code]: In the repository, C3D-v1.1 is the Res3D implementation.
- Learning Spatio-Temporal Representation with Pseudo-3D Residual Networks - Z. Qui et al, ICCV2017. [code]
- Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset - J. Carreira et al, CVPR2017. [code][PyTorch code], [another PyTorch code]
- Learning Spatiotemporal Features with 3D Convolutional Networks - D. Tran et al, ICCV2015. [the official Caffe code] [project web] Note: Aka C3D. [Python Wrapper] Note that the official caffe does not support python wrapper. [TensorFlow], [TensorFlow + Keras], [Another TensorFlow Implemetation], [Keras C3D Project web]: [Keras code], [Pretrained weights].
- Deep Temporal Linear Encoding Networks - A. Diba et al, CVPR2017.
- Temporal Convolutional Networks: A Unified Approach to Action Segmentation and Detection - C. Lea et al, CVPR 2017. [code]
- Long-term Temporal Convolutions - G. Varol et al, TPAMI2017. [project web] [code]
- Temporal Segment Networks: Towards Good Practices for Deep Action Recognition - L. Wang et al, arXiv 2016. [code]
- Convolutional Two-Stream Network Fusion for Video Action Recognition - C. Feichtenhofer et al, CVPR2016. [code]
- Two-Stream Convolutional Networks for Action Recognition in Videos - K. Simonyan and A. Zisserman, NIPS2014.
- [3D ResNet PyTorch]
- [PyTorch Video Research]
- [M-PACT: Michigan Platform for Activity Classification in Tensorflow]
- [Inflated models on PyTorch]
- [I3D models transfered from Tensorflow to PyTorch]
- [A Two Stream Baseline on Kinectics dataset]
- [MMAction]
- Neural Graph Matching Networks for Fewshot 3D Action Recognition - M. Guo et al., ECCV2018.
- Temporal 3D ConvNets using Temporal Transition Layer - A. Diba et al., CVPRW2018.
- Temporal 3D ConvNets: New Architecture and Transfer Learning for Video Classification - A. Diba et al., arXiv2017.
- Attentional Pooling for Action Recognition - R. Girdhar and D. Ramanan, NIPS2017. [code]
- Fully Context-Aware Video Prediction - Byeon et al, arXiv2017.
- Hidden Two-Stream Convolutional Networks for Action Recognition - Y. Zhu et al, arXiv2017. [code]
- Dynamic Image Networks for Action Recognition - H. Bilen et al, CVPR2016. [code] [project web]
- Long-term Recurrent Convolutional Networks for Visual Recognition and Description - J. Donahue et al, CVPR2015. [code] [project web]
- Describing Videos by Exploiting Temporal Structure - L. Yao et al, ICCV2015. [code] note: from the same group of RCN paper “Delving Deeper into Convolutional Networks for Learning Video Representations"
- Two-Stream SR-CNNs for Action Recognition in Videos - L. Wang et al, BMVC2016.
- Real-time Action Recognition with Enhanced Motion Vector CNNs - B. Zhang et al, CVPR2016. [code]
- Action Recognition with Trajectory-Pooled Deep-Convolutional Descriptors - L. Wang et al, CVPR2015. [code]
- Actional-Structural Graph Convolutional Networks for Skeleton-Based Action Recognition - M. Li et al., CVPR2019.
- An Attention Enhanced Graph Convolutional LSTM Network for Skeleton-Based Action Recognition - C. Si et al., CVPR2019.
- View Adaptive Neural Networks for High Performance Skeleton-Based Human Action Recognition - P. Zhang et al., TPAMI2019.
- Spatial Temporal Graph Convolutional Networks for Skeleton-Based Action Recognition - S. Yan et al., AAAI2018. [code]
- Deep Progressive Reinforcement Learning for Skeleton-Based Action Recognition - Y. Tang et al., CVPR2018.
- Co-occurrence Feature Learning from Skeleton Data for Action Recognition and Detection with Hierarchical Aggregation - C. Li et al., IJCAI2018.
- Part-based Graph Convolutional Network for Action Recognition - K. Thakkar et al., BMVC2018.
- Rethinking the Faster R-CNN Architecture for Temporal Action Localization - Yu-Wei Chao et al., CVPR2018
- Weakly Supervised Action Localization by Sparse Temporal Pooling Network - Phuc Nguyen et al., CVPR 2018
- Temporal Deformable Residual Networks for Action Segmentation in Videos - P. Lei and S. Todrovic., CVPR2018.
- End-to-End, Single-Stream Temporal Action Detection in Untrimmed Videos - Shayamal Buch et al., BMVC 2017 [code]
- Cascaded Boundary Regression for Temporal Action Detection - Jiyang Gao et al., BMVC 2017 [code]
- Temporal Tessellation: A Unified Approach for Video Analysis - Kaufman et al., ICCV2017. [code]
- Temporal Action Detection with Structured Segment Networks - Y. Zhao et al., ICCV2017. [code] [project web]
- Temporal Context Network for Activity Localization in Videos - X. Dai et al., ICCV2017.
- Detecting the Moment of Completion: Temporal Models for Localising Action Completion - F. Heidarivincheh et al., arXiv2017.
- CDC: Convolutional-De-Convolutional Networks for Precise Temporal Action Localization in Untrimmed Videos - Z. Shou et al, CVPR2017. [code]
- SST: Single-Stream Temporal Action Proposals - S. Buch et al, CVPR2017. [code]
- R-C3D: Region Convolutional 3D Network for Temporal Activity Detection - H. Xu et al, arXiv2017. [code] [project web] [PyTorch]
- DAPs: Deep Action Proposals for Action Understanding - V. Escorcia et al, ECCV2016. [code] [raw data]
- Online Action Detection using Joint Classification-Regression Recurrent Neural Networks - Y. Li et al, ECCV2016. Noe: RGB-D Action Detection
- Temporal Action Localization in Untrimmed Videos via Multi-stage CNNs - Z. Shou et al, CVPR2016. [code] Note: Aka S-CNN.
- Fast Temporal Activity Proposals for Efficient Detection of Human Actions in Untrimmed Videos - F. Heilbron et al, CVPR2016. [code] Note: Depends on C3D, aka SparseProp.
- Actionness Estimation Using Hybrid Fully Convolutional Networks - L. Wang et al, CVPR2016. [code] Note: The code is not a complete verision. It only contains a demo, not training. [project web]
- Learning Activity Progression in LSTMs for Activity Detection and Early Detection - S. Ma et al, CVPR2016.
- End-to-end Learning of Action Detection from Frame Glimpses in Videos - S. Yeung et al, CVPR2016. [code] [project web] Note: This method uses reinforcement learning
- Fast Action Proposals for Human Action Detection and Search - G. Yu and J. Yuan, CVPR2015. Note: code for FAP is NOT available online. Note: Aka FAP.
- Bag-of-fragments: Selecting and encoding video fragments for event detection and recounting - P. Mettes et al, ICMR2015.
- Action localization in videos through context walk - K. Soomro et al, ICCV2015.
- A Better Baseline for AVA - R. Girdhar et al., ActivityNet Workshop, CVPR2018.
- Real-Time End-to-End Action Detection with Two-Stream Networks - A. El-Nouby and G. Taylor, arXiv2018.
- Human Action Localization with Sparse Spatial Supervision - P. Weinzaepfel et al., arXiv2017.
- Unsupervised Action Discovery and Localization in Videos - K. Soomro and M. Shah, ICCV2017.
- Spatial-Aware Object Embeddings for Zero-Shot Localization and Classification of Actions - P. Mettes and C. G. M. Snoek, ICCV2017.
- Action Tubelet Detector for Spatio-Temporal Action Localization - V. Kalogeiton et al, ICCV2017. [code] [project web]
- Tube Convolutional Neural Network (T-CNN) for Action Detection in Videos - R. Hou et al, ICCV2017. [project web]
- Chained Multi-stream Networks Exploiting Pose, Motion, and Appearance for Action Classification and Detection - M. Zolfaghari et al, ICCV2017. [project web]
- TORNADO: A Spatio-Temporal Convolutional Regression Network for Video Action Proposal - H. Zhu et al., ICCV2017.
- Online Real time Multiple Spatiotemporal Action Localisation and Prediction - G. Singh et al, ICCV2017. [code]
- AMTnet: Action-Micro-Tube regression by end-to-end trainable deep architecture - S. Saha et al, ICCV2017.
- Am I Done? Predicting Action Progress in Videos - F. Becattini et al, BMVC2017.
- Generic Tubelet Proposals for Action Localization - J. He et al, arXiv2017.
- Incremental Tube Construction for Human Action Detection - H. S. Behl et al, arXiv2017.
- Multi-region two-stream R-CNN for action detection - X. Peng and C. Schmid. ECCV2016. [code]
- Spot On: Action Localization from Pointly-Supervised Proposals - P. Mettes et al, ECCV2016.
- Deep Learning for Detecting Multiple Space-Time Action Tubes in Videos - S. Saha et al, BMVC2016. [code] [project web]
- Learning to track for spatio-temporal action localization - P. Weinzaepfel et al. ICCV2015.
- Action detection by implicit intentional motion clustering - W. Chen and J. Corso, ICCV2015.
- Finding Action Tubes - G. Gkioxari and J. Malik CVPR2015. [code] [project web]
- APT: Action localization proposals from dense trajectories - J. Gemert et al, BMVC2015. [code]
- Spatio-Temporal Object Detection Proposals - D. Oneata et al, ECCV2014. [code] [project web]
- Action localization with tubelets from motion - M. Jain et al, CVPR2014.
- Spatiotemporal deformable part models for action detection - Y. Tian et al, CVPR2013. [code]
- Action localization in videos through context walk - K. Soomro et al, ICCV2015.
- Fast Action Proposals for Human Action Detection and Search - G. Yu and J. Yuan, CVPR2015. Note: code for FAP is NOT available online. Note: Aka FAP.
- Actor and Observer: Joint Modeling of First and Third-Person Videos - G. Sigurdsson et al., CVPR2018. [code]
- What and How Well You Performed? A Multitask Learning Approach to Action Quality Assessment - P. Parma and B. T. Morris. CVPR2019.
- PathTrack: Fast Trajectory Annotation with Path Supervision - S. Manen et al., ICCV2017.
- CortexNet: a Generic Network Family for Robust Visual Temporal Representations A. Canziani and E. Culurciello - arXiv2017. [code] [project web]
- Slicing Convolutional Neural Network for Crowd Video Understanding - J. Shao et al, CVPR2016. [code]
- Two-Stream (RGB and Flow) pretrained model weights
- Video Dataset Overview from Antoine Miech
- HACS
- Moments in Time, paper
- AVA, paper, [INRIA web] for missing videos
- Kinetics, paper, download toolkit
- OOPS - A dataset of unintentional action, [paper]
- COIN - a large-scale dataset for comprehensive instructional video analysis, paper
- YouTube-8M, technical report
- YouTube-BB, technical report
- DALY Daily Action Localization in Youtube videos. Note: Weakly supervised action detection dataset. Annotations consist of start and end time of each action, one bounding box per each action per video.
- 20BN-JESTER, 20BN-SOMETHING-SOMETHING
- ActivityNet Note: They provide a download script and evaluation code here .
- Charades
- Charades-Ego, paper - First person and third person video aligned dataset
- EPIC-Kitchens, paper - First person videos recorded in kitchens. Note they provide download scripts and a python library here
- Sports-1M - Large scale action recognition dataset.
- THUMOS14 Note: It overlaps with UCF-101 dataset.
- THUMOS15 Note: It overlaps with UCF-101 dataset.
- HOLLYWOOD2: Spatio-Temporal annotations
- UCF-101, annotation provided by THUMOS-14, and corrupted annotation list, UCF-101 corrected annotations and different version annotaions. And there are also some pre-computed spatiotemporal action detection results
- UCF-50.
- UCF-Sports, note: the train/test split link in the official website is broken. Instead, you can download it from here.
- HMDB
- J-HMDB
- LIRIS-HARL
- KTH
- MSR Action Note: It overlaps with KTH datset.
- Sports Videos in the Wild
- NTU RGB+D
- Mixamo Mocap Dataset
- UWA3D Multiview Activity II Dataset
- Northwestern-UCLA Dataset
- SYSU 3D Human-Object Interaction Dataset
- MEVA (Multiview Extended Video with Activities) Dataset
- Efficiently scaling up crowdsourced video annotation - C. Vondrick et. al, IJCV2013. [code]
- The Design and Implementation of ViPER - D. Mihalcik and D. Doermann, Technical report.
- VTT: Visual Object Tagging Tool. Modern app to annotate objects in videos and images. It facilitates the development of an end-to-end machine learning pipeline encompassing the annotation/export/import of assets. Moreover, it could run as a native app or via web.
- VIA: VGG Image Annotator. Simple and standalone manual annotation web-app for image, audio and video. It runs in the web browser and does not require any installation or setup.
- Deformable Convolutional Networks - J. Dai et al., ICCV2017. [official code]
- Detectron - Open Source Object Detection Framework from Facebook AI Research. Includes Mask R-CNN, FPN, and etc. Caffe2 implementation.
- Mask R-CNN - K. He et al, [Detectron], [TensorFlow + Keras], [MXNet], [TensorFlow], [PyTorch] - State-of-the-art object detection/instance segmentation algorithm.
- Faster R-CNN - S. Ren et al, NIPS2015. [official MatCaffe code], [PyCaffe], [TensorFlow], [Another TF implementation] [Keras] - State-of-the-art object detector.
- YOLO - J. Redmon et al, CVPR2016. [official code], [TensorFLow] - Fast object detector.
- YOLO9000 - J. Redmon and A. Farhadi, CVPR2017. [official code] - State-of-the-art object detector which can detect 9000 objects in realtime.
- SSD - W. Liu et al, ECCV2016. [official PyCaffe code], [TensorFlow], [Keras] - State-of-the-art object detector with realtime processing speed.
- RetinaNet - Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He and Piotr Dollár, Facebook AI Research FAIR & ICCV 2017.[Keras] - State-of-the-art object detector with realtime processing speed.
- [Detect to Track and Track to Detect] - C. Feichtenhofer et al., ICCV2017. [code], [project web]
- [Flow-Guided Feature Aggregation for Video Object Detection] - X. Zhu et al., ICCV2017. [code], aka FGFA
- AlphaPose - PyTorch based realtime and accurate pose estimation and tracking tool from SJTU.
- Detect-and-Track: Efficient Pose Estimation in Videos - R. Girdhar et al., arXiv2017.
- OpenPose Library - Caffe based realtime pose estimation library from CMU.
- Realtime Multi-Person 2D Pose Estimation using Part Affinity Fields - Z. Cao et al, CVPR2017. [code] depends on the [caffe RT pose] - Earlier version of OpenPose from CMU.
- DensePose [code] - Dense pose human estimation in the wild implemented in the Detectron framework.
- MultiPoseNet: Fast Multi-Person Pose Estimation using Pose Residual Network - M. Kocabas et al, ECCV2018. [code]
- ActEV (Activities in Extended Video - Activity detection in security camera videos. Runs through 2021. Hosted by NIST.
License
To the extent possible under law, Jinwoo Choi has waived all copyright and related or neighboring rights to this work.
Please read the contribution guidelines. Then please feel free to send me pull requests or email ([email protected]) to add links.