-
Korea Advanced Institute of Science and Technology (KAIST)
- Daejeon, Korea
- https://choijeongsoo.github.io
Stars
Official PyTorch implementation of "Paralinguistics-Aware Speech-Empowered LLMs for Natural Conversation" (NeurIPS 2024)
Awesome Neural Codec Models, Text-to-Speech Synthesizers & Speech Language Models
✨✨Freeze-Omni: A Smart and Low Latency Speech-to-speech Dialogue Model with Frozen LLM
Movie Gen Bench - two media generation evaluation benchmarks released with Meta Movie Gen
Inference code for the paper "Spirit-LM Interleaved Spoken and Written Language Model".
Towards Open-source GPT-4o with Vision, Speech and Duplex Capabilities。
This repository collects papers related to Speech Tokenizer.
Official code for "F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching"
EchoMimic: Lifelike Audio-Driven Portrait Animations through Editable Landmark Conditioning
An Open-Sourced LLM-empowered Foundation TTS System
Real-time Speech-Text Foundation Model Toolkit (wip)
zero-shot voice conversion & singing voice conversion, with real-time support
open-source multimodal large language model that can hear, talk while thinking. Featuring real-time end-to-end speech input and streaming audio output conversational capabilities.
Multi-Scale Neural Audio Codec (SNAC) compresses audio into discrete codes at a low bitrate
Implementation of Acoustic BPE (Shen et al., 2024), extended for RVQ-based Neural Audio Codecs
SoftVC VITS Singing Voice Conversion
Lip Reading for Low-resource Languages by Learning and Combining General Speech Knowledge and Language-specific Knowledge (ICCV 2023)
[CVPR 2023] Official code for paper: Learning to Dub Movies via Hierarchical Prosody Models.
[ACL 2024] This is the Pytorch code for our paper "StyleDubber: Towards Multi-Scale Style Learning for Movie Dubbing"
An open source implementation of Microsoft's VALL-E X zero-shot TTS model. Demo is available in https://plachtaa.github.io/vallex/
Out of time: automated lip sync in the wild
Implementation of Autoregressive Diffusion in Pytorch
Offical code for the CVPR 2024 Paper: Separating the "Chirp" from the "Chat": Self-supervised Visual Grounding of Sound and Language
[Interspeech 2024] Whisper-Flamingo: Integrating Visual Features into Whisper for Audio-Visual Speech Recognition and Translation
PyTorch implementation of "Watch or Listen: Robust Audio-Visual Speech Recognition with Visual Corruption Modeling and Reliability Scoring" (CVPR2023) and "Visual Context-driven Audio Feature Enhan…
[CVPR 2024] Seeing and Hearing: Open-domain Visual-Audio Generation with Diffusion Latent Aligners