Lists (1)
Sort Name ascending (A-Z)
Stars
[EuroSys'25] Mist: Efficient Distributed Training of Large Language Models via Memory-Parallelism Co-Optimization
LLM Finetuning with peft
Pretraining code for a large-scale depth-recurrent language model
[NeurIPS 2024] Fast Best-of-N Decoding via Speculative Rejection
Supplemental materials for The ASPLOS 2025 / EuroSys 2025 Contest on Intra-Operator Parallelism for Distributed Deep Learning
Public repository for "The Surprising Effectiveness of Test-Time Training for Abstract Reasoning"
ShadowKV: KV Cache in Shadows for High-Throughput Long-Context LLM Inference
Differentiable Combinatorial Scheduling at Scale (ICML'24). Mingju Liu, Yingjie Li, Jiaqi Yin, Zhiru Zhang, Cunxi Yu.
Archon provides a modular framework for combining different inference-time techniques and LMs with just a JSON config file.
Triton-based implementation of Sparse Mixture of Experts.
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
Survey: A collection of AWESOME papers and resources on the latest research in Mixture of Experts.
RDMA and SHARP plugins for nccl library
A modular, extensible LLM inference benchmarking framework that supports multiple benchmarking frameworks and paradigms.
xDiT: A Scalable Inference Engine for Diffusion Transformers (DiTs) with Massive Parallelism
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
A modular graph-based Retrieval-Augmented Generation (RAG) system
CUDA Python: Performance meets Productivity
A fast communication-overlapping library for tensor parallelism on GPUs.
📖A curated list of Awesome Diffusion Inference Papers with codes: Sampling, Caching, Multi-GPUs, etc. 🎉🎉
MSCCL++: A GPU-driven communication stack for scalable AI applications
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
Kubernetes controller for GitHub Actions self-hosted runners