Stars
ASLP-lab / LLaSE-G1
Forked from Kevin-naticl/LLaSE-G1LLaSE-G1: Incentivizing Generalization Capability for LLaMA-based Speech Enhancement
SlamKit is an open source tool kit for efficient training of SpeechLMs. It was used for "Slamming: Training a Speech Language Model on One GPU in a Day"
Research and Production Oriented Speaker Verification, Recognition and Diarization Toolkit
Production-tested AI infrastructure tools for efficient AGI development and community-driven innovation
OSUM: Open Speech Understanding Model, open-sourced by ASLP@NPU.
Unified automatic quality assessment for speech, music, and sound.
Open-source industrial-grade ASR models supporting Mandarin, Chinese dialects and English, achieving a new SOTA on public Mandarin ASR benchmarks, while also offering outstanding singing lyrics rec…
🧑🚀 全世界最好的LLM资料总结(数据处理、模型训练、模型部署、o1 模型、小语言模型、视觉语言模型) | Summary of the world's best LLM resources.
Clean, minimal, accessible reproduction of DeepSeek R1-Zero
[Unofficial] PyTorch implementation of "Conformer: Convolution-augmented Transformer for Speech Recognition" (INTERSPEECH 2020)
An unofficial implementation of the Personal VAD speaker-conditioned voice activity detection method. Bachelor's thesis project.
Awesome speech/audio LLMs, representation learning, and codec models
A high-throughput and memory-efficient inference and serving engine for LLMs
Official repository for Mamba-based Segmentation Model for Speaker Diarization
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
real time face swap and one-click video deepfake with only a single image
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
🤗 PEFT: State-of-the-art Parameter-Efficient Fine-Tuning.
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversation
Moshi is a speech-text foundation model and full-duplex spoken dialogue framework. It uses Mimi, a state-of-the-art streaming neural audio codec.