- Shenzhen
Lists (1)
Sort Name ascending (A-Z)
Stars
A generative world for general-purpose robotics & embodied AI learning.
AAAI 2025: Codec Does Matter: Exploring the Semantic Shortcoming of Codec for Audio Language Model
Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Official Implementation of Rectified Flow (ICLR2023 Spotlight)
📖 This is a repository for organizing papers, codes and other resources related to unified multimodal models.
[AAAI 2024 Oral] M2CLIP: A Multimodal, Multi-Task Adapting Framework for Video Action Recognition
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized Sounds. AI拟音大师,给你的无声视频添加生动而且同步的音效 😝
VideoLLM-online: Online Video Large Language Model for Streaming Video (CVPR 2024)
Open-Sora: Democratizing Efficient Video Production for All
Community interface for generative AI
Official Implementation of EnCLAP (ICASSP 2024)
Implementation of Google's USM speech model in Pytorch
OpenTAD is an open-source temporal action detection (TAD) toolbox based on PyTorch.
The official code repo of "HTS-AT: A Hierarchical Token-Semantic Audio Transformer for Sound Classification and Detection"
State-of-the-art audio codec with 90x compression factor. Supports 44.1kHz, 24kHz, and 16kHz mono/stereo audio.
State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
MNBVC(Massive Never-ending BT Vast Chinese corpus)超大规模中文语料集。对标chatGPT训练的40T数据。MNBVC数据集不但包括主流文化,也包括各个小众文化甚至火星文的数据。MNBVC数据集包括新闻、作文、小说、书籍、杂志、论文、台词、帖子、wiki、古诗、歌词、商品介绍、笑话、糗事、聊天记录等一切形式的纯文本中文数据。
Diff-Foley: Synchronized Video-to-Audio Synthesis with Latent Diffusion Models
InstantID: Zero-shot Identity-Preserving Generation in Seconds 🔥
🔊 Text-Prompted Generative Audio Model
Code and models for NExT-GPT: Any-to-Any Multimodal Large Language Model
Official codes and models of the paper "Auffusion: Leveraging the Power of Diffusion and Large Language Models for Text-to-Audio Generation"
✨✨Latest Advances on Multimodal Large Language Models