Stars
An AI-Powered Speech Processing Toolkit and Open Source SOTA Pretrained Models, Supporting Speech Enhancement, Separation, and Target Speaker Extraction, etc.
code for L2 regularization of arbitrary Tikhonov matrices
A deep-learning-based method for sound field reconstruction
Python version of PEAQ(Perceptual Evaluation of Audio Quality)
AQUA-Tk = Audio QUality Assessment-Toolkit. (In development)
TrOMR:Transformer-based Polyphonic Optical Music Recognition
TG-CRITIC: A TIMBRE-GUIDED MODEL FOR REFERENCE-INDEPENDENT SINGING EVALUATION
State-of-the-art deep learning based audio codec supporting both mono 24 kHz audio and stereo 48 kHz audio.
Metrics for evaluating music and audio generative models – with a focus on long-form, full-band, and stereo generations.
A simple library for Fréchet Audio Distance (FAD) calculation
A lightweight library for Frechet Audio Distance calculation.
Automatic Speech Recognition with Speaker Diarization based on OpenAI Whisper
Train no-reference speech quality estimators with multiple datasets via learned, per-dataset alignments.
Verbatim Automatic Speech Recognition with improved word-level timestamps and filler detection
WhisperX: Automatic Speech Recognition with Word-level Timestamps (& Diarization)
A Repository for Single- and Multi-modal Speaker Verification, Speaker Recognition and Speaker Diarization
The official repo of NBC & SpatialNet for multichannel speech separation, denoising, and dereverberation
LLaMA-Omni is a low-latency and high-quality end-to-end speech interaction model built upon Llama-3.1-8B-Instruct, aiming to achieve speech capabilities at the GPT-4o level.
This repository contains a comprehensive computer vision/machine learning football project that uses YOLO for object detection, Kmeans for pixel segmentation, optical flow for motion tracking, and …
Official repository - Fully managed, cross platform (Windows, Mac, Linux) .NET library for capturing packets
faster_whisper GUI with PySide6
This repo contains required files for the INTERSPEECH 2022 Audio Deep Packet Loss Concealment (PLC) Challenge.
EmotiVoice 😊: a Multi-Voice and Prompt-Controlled TTS Engine
KAN-TTS is a speech-synthesis training framework, please try the demos we have posted at https://modelscope.cn/models?page=1&tasks=text-to-speech