Stars
📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉
[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
A unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deploym…
A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers)
[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
a state-of-the-art-level open visual language model | 多模态预训练模型
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
A high-throughput and memory-efficient inference and serving engine for LLMs
A curated list for Efficient Large Language Models
Development repository for the Triton language and compiler
Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment
Code for the paper "Evaluating Large Language Models Trained on Code"
A Pythonic framework to simplify AI service building
Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.
Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.
A framework for few-shot evaluation of language models.
Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models
⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡
AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.
A Python-level JIT compiler designed to make unmodified PyTorch programs faster.