Starred repositories
[ECCV2024] Official implementation of Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes
[ICCV 2023] Efficient Video Action Detection with Token Dropout and Context Refinement
[CVPR 2021] Actor-Context-Actor Relation Network for Spatio-temporal Action Localization
Spatio-Temporal Action Localization System
Hiera: A fast, powerful, and simple hierarchical vision transformer.
Code repository for the paper "On the Benefits of 3D Pose and Tracking for Human Action Recognition", (CVPR 2023)
We have implemented Track # 1 for ICME 2024: Spatial Action Localization on Chaotic World dataset. Our mAP on the validation set reaches 26.62%, and if we directly use officially provided chaos_tes…
Context-based Dialogue Act Recognition using Recurrent Neural Networks
Switchboard Dialog Act Corpus with Penn Treebank links
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models
The TalkMoves Dataset: K-12 mathematics lesson transcripts annotated for teacher and student discursive moves
Custom ava dataset, Multi-Person Video Dataset Annotation Method of Spatio-Temporally Actions
Pre-Training with Whole Word Masking for Chinese BERT(中文BERT-wwm系列模型)
Facebook AI Research Sequence-to-Sequence Toolkit written in Python.
Neural building blocks for speaker diarization: speech activity detection, speaker change detection, overlapped speech detection, speaker embedding
Official implementation of "Holmes-VAD: Towards Unbiased and Explainable Video Anomaly Detection via Multi-modal LLM"
PySlowFast: video understanding codebase from FAIR for reproducing state-of-the-art video models.
EDUVSUM is a multimodal neural architecture that utilizes state-of-the-art audio, visual and textual features to identify important temporal segments in educational videos.
[ACL 2024 🔥] Video-ChatGPT is a video conversation model capable of generating meaningful conversation about videos. It combines the capabilities of LLMs with a pretrained visual encoder adapted fo…
[ICCV 2021 Oral + TPAMI] Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Code release for "Learning Video Representations from Large Language Models"
Code for paper "VideoTree: Adaptive Tree-based Video Representation for LLM Reasoning on Long Videos"
[AAAI 2025] VTG-LLM: Integrating Timestamp Knowledge into Video LLMs for Enhanced Video Temporal Grounding
Code for CVPR 2023 paper "Procedure-Aware Pretraining for Instructional Video Understanding"
[CVPR 2024] EvalCrafter: Benchmarking and Evaluating Large Video Generation Models