Stars
Benchmarking physical understanding in generative video models
Official repo for paper "MiraData: A Large-Scale Video Dataset with Long Durations and Structured Captions"
🔥🔥 Kokoro in Rust. https://huggingface.co/hexgrad/Kokoro-82M Insanely fast, realtime TTS with high quality you ever have.
StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models
Code for NeurIPS 2024 paper - The GAN is dead; long live the GAN! A Modern Baseline GAN - by Huang et al.
Align Anything: Training All-modality Model with Feedback
Music repair method to convert lossy MP3 compressed music to lossless music.
Unofficial implementation of Titans, SOTA memory for transformers, in Pytorch
VisionReward: Fine-Grained Multi-Dimensional Human Preference Learning for Image and Video Generation
Official code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
📚[WIP] FFPA: Yet antother Faster Flash Prefill Attention with O(1)⚡️GPU SRAM complexity for headdim > 256, 1.8x~3x↑🎉faster vs SDPA EA.
Scalable RL solution for advanced reasoning of language models
[Preprint] "Understanding Bottlenecks of State Space Models through the Lens of Recency and Over-smoothing" by Peihao Wang, Ruisi Cai, Yuehao Wang, Jiajun Zhu, Pragya Srivastava, Zhangyang Wang, Pa…
Official repository of the paper "MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization".
Applied AI experiments and examples for PyTorch
Triton implement of bi-directional (non-causal) linear attention
Medical o1, Towards medical complex reasoning with LLMs
The official repository for the paper "Optimal Flow Matching: Learning Straight Trajectories in Just One Step" (NeurIPS 2024)
Tarsier -- a family of large-scale video-language models, which is designed to generate high-quality video descriptions , together with good capability of general video understanding.
DRT-o1: Optimized Deep Reasoning Translation via Long Chain-of-Thought
Grapheme-to-Phoneme for Mixed Chinese (Mandarin or Cantonese) and English.
Torchaudio Forced Aligner for Mixed Chinese (Mandarin or Cantonese) and English.
High performance components for building Trading Platform such as ultra fast matching engine, order book processor