-
Nanjing University; RTC Lab ByteDance
- Nanjing China
Stars
This is the PyTorch implementation of the Universal Source Separation with Weakly labelled Data.
[NeurIPS 2024] Classification Done Right for Vision-Language Pre-Training
This is the code and dataset repo for Interspeech 2024 paper "Target conversation extraction: Source separation using turn-taking dynamics"
Foundational Models for State-of-the-Art Speech and Text Translation
Linguistic-Aware Patch Slimming Framework for Fine-grained Cross-Modal Alignment, CVPR, 2024
Official Implemetation of DPLM (ICML'24) - Diffusion Language Models Are Versatile Protein Learners
This repo is for the SPL paper "Auto-Tuning Spectral Clustering for Speaker Diarization Using Normalized Maximum Eigengap"
A 6-million Audio-Caption Paired Dataset Built with a LLMs and ALMs-based Automatic Pipeline
zero-shot voice conversion & singing voice conversion, with real-time support
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
Implementation of the proposed minGRU in Pytorch
Google Research
Awesome Deep Graph Clustering is a collection of SOTA, novel deep graph clustering methods (papers, codes, and datasets).
Learning audio concepts from natural language supervision
WavJourney: Compositional Audio Creation with LLMs
A collection of LLM papers, blogs, and projects, with a focus on OpenAI o1 🍓 and reasoning techniques.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Text-to-Music Generation with Rectified Flow Transformers
Computes the Mel-Cepstral Distance of two WAV files based on the paper "Mel-Cepstral Distance Measure for Objective Speech Quality Assessment" by Robert F. Kubichek.
Macaw-LLM: Multi-Modal Language Modeling with Image, Video, Audio, and Text Integration
This is the official repository for M2UGen