-
University of Science and Technology of China (USTC)
Stars
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A self-learning tutorail for CUDA High Performance Programing.
Making Long-Context LLM Inference 10x Faster and 10x Cheaper
Thin, unified, C++-flavored wrappers for the CUDA APIs
A throughput-oriented high-performance serving framework for LLMs
A high-throughput and memory-efficient inference and serving engine for LLMs
Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA
A curated list for Efficient Large Language Models
No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images
Bridging Large Vision-Language Models and End-to-End Autonomous Driving
A curated list of awesome LLM for Autonomous Driving resources (continually updated)
Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding
Supplemental material for ECRTS24 paper: Autonomy Today: Many Delay-Prone Black Boxes
RT-Swap: Addressing GPU Memory Bottlenecks for Real-Time Multi-DNN Inference
InstantSplat: Sparse-view SfM-free Gaussian Splatting in Seconds
Source code for the paper: "Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs"
A Fine-Grained, Hardware-Level GPU Resource Isolation Solution for Multi-Tenant DNN Inference
Efficient tool-assisted LLM serving runtime.
Summary of some awesome work for optimizing LLM inference
FlashInfer: Kernel Library for LLM Serving
Hackable and optimized Transformers building blocks, supporting a composable construction.