Stars
š Text-Prompted Generative Audio Model
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
WavJourney: Compositional Audio Creation with LLMs
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.
The official code for the SALMonš£ benchmark (ICASSP 2025)
SOTA discrete acoustic codec models with 40 tokens per second for audio language modeling
This repo contains the official PyTorch implementation of vLMIG: Improving Visual Commonsense in Language Models via Multiple Image Generation
The official implementation of "A Language Modeling Approach to Diacritic-Free Hebrew TTS"
Code, Dataset, and Pretrained Models for Audio and Speech Large Language Model "Listen, Think, and Understand".
A curated list for awesome discrete diffusion models resources.
Official repository for NAST: Noise Aware Speech Tokenization for Speech Language Models (Interspeech 2024) https://arxiv.org/abs/2406.11037
slp-rl / TempoTokens
Forked from guyyariv/TempoTokensThis repo is a fork, containing the official PyTorch implementation of: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
This repo contains the official PyTorch implementation of: Diverse and Aligned Audio-to-Video Generation via Text-to-Video Model Adaptation
A Python toolbox for performing gradient-free optimization
A sequence-to-sequence voice conversion toolkit.
slp-rl / AudioToken
Forked from guyyariv/AudioTokenThis repo is a fork from the official PyTorch implementation of "AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation" (Interspeech 2023)
A spoken version of the textual story cloze benchmark
This repository contains the official PyTorch implementation of the paper: "Learning Discrete Structured VAE using NES".
This repo contains the official PyTorch implementation of AudioToken: Adaptation of Text-Conditioned Diffusion Models for Audio-to-Image Generation
Audiocraft is a library for audio processing and generation with deep learning. It features the state-of-the-art EnCodec audio compressor / tokenizer, along with MusicGen, a simple and controllableā¦
Official repository for "Speaking Style Conversion With Discrete Self-Supervised Units" (EMNLP 2023). https://arxiv.org/abs/2212.09730
This repo contains the official PyTorch implementation of "Audio Super Resolution in the Spectral Domain" (ICASSP 2023)
This repo contains the official PyTorch implementation of "Analyzing Discrete Self Supervised Speech Representation For Spoken Language Modeling" (ICASSP 2023)