Stars
AIGC-interview/CV-interview/LLMs-interview面试问题与答案集合仓,同时包含工作和科研过程中的新想法、新问题、新资源与新项目
The official repository for paper "Tora: Trajectory-oriented Diffusion Transformer for Video Generation"
🔥 Sa2VA: Marrying SAM2 with LLaVA for Dense Grounded Understanding of Images and Videos
Generative models for conditional audio generation
Efficient face emotion recognition in photos and videos
The official repo for Both Ears Wide Open: Towards Language-Driven Spatial Audio Generation
a text-conditional diffusion probabilistic model capable of generating high fidelity audio.
ACM MM 2024 FlashSpeech: Efficient Zero-Shot Speech Synthesis
This project aims to collect the latest "call for reviewers" links from various top CS/ML/AI conferences/journals
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Official implementation of OV-DINO: Unified Open-Vocabulary Detection with Language-Aware Selective Fusion
[CVPR 2022] Official code for "RegionCLIP: Region-based Language-Image Pretraining"
DeepEar: Sound Localization with Binaural Microphones
Official code for SEE-2-SOUND: Zero-Shot Spatial Environment-to-Spatial Sound
A python implementation of “Learning Deep Direct-Path Relative Transfer Function for Binaural Sound Source Localization” [TASLP 2021]
The Official PyTorch Implementation of FN-SSL & IPDnet for Sound Source Localization [INTERSPEECH2023 & TASLP2024]
The official repo for "Ref-AVS: Refer and Segment Objects in Audio-Visual Scenes", ECCV 2024
[ECCV 2024] The official code of paper "Open-Vocabulary SAM".
open soundstream-ish VAE codecs for downstream neural audio synthesis
The simplest, fastest repository for training/finetuning medium-sized GPTs.
A minimal PyTorch re-implementation of the OpenAI GPT (Generative Pretrained Transformer) training
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllable…
State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.