Stars
PyTorch implementation of FractalGen https://arxiv.org/abs/2502.17437
Wan: Open and Advanced Large-Scale Video Generative Models
This repository contains the code for the paper "voc2vec: A Foundation Model for Non-Verbal Vocalization", accepted at ICASSP 2025.
Genome modeling and design across all domains of life
The official implementation of TokenSynth (ICASSP 2025)
Unified automatic quality assessment for speech, music, and sound.
Fully open reproduction of DeepSeek-R1
Reverse Engineering of Supervised Semantic Speech Tokenizer (S3Tokenizer) proposed in CosyVoice
YuE: Open Full-song Music Generation Foundation Model, something similar to Suno.ai but open
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
Codec for paper: LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
LLaSA: Scaling Train-time and Inference-time Compute for LLaMA-based Speech Synthesis
Training Large Language Model to Reason in a Continuous Latent Space
Code for NeurIPS 2024 paper - The GAN is dead; long live the GAN! A Modern Baseline GAN - by Huang et al.
Code for the paper "FLowHigh: Towards efficient and high-quality audio super-resolution with single-step flow matching"
[INTERSPEECH 2024] EmoBox: Multilingual Multi-corpus Speech Emotion Recognition Toolkit and Benchmark
SANA: Efficient High-Resolution Image Synthesis with Linear Diffusion Transformer
Out-of-the-box (OOTB) GUI Agent for Windows and macOS
Awesome speech/audio LLMs, representation learning, and codec models
r9y9 / speech-trident
Forked from ga642381/speech-tridentAwesome speech/audio LLMs, representation learning, and codec models
Speaker change detection using SincNet and an LSTM/Transformer
SONAR, a new multilingual and multimodal fixed-size sentence embedding space, with a full suite of speech and text encoders and decoders.