Stars
Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops
每个人都能看懂的大模型知识分享,LLMs春/秋招大模型面试前必看,让你和面试官侃侃而谈
RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.
SGLang is a fast serving framework for large language models and vision language models.
Code repo for the paper "SpinQuant LLM quantization with learned rotations"
Official implementation of the ICLR 2024 paper AffineQuant
Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.
Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
Applied AI experiments and examples for PyTorch
Development repository for the Triton language and compiler
QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference
Compress your input to ChatGPT or other LLMs, to let them process 2x more content and save 40% memory and GPU time.
Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities
Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Ext…
A fast inference library for running LLMs locally on modern consumer-class GPUs
Large World Model -- Modeling Text and Video with Millions Context
FlashInfer: Kernel Library for LLM Serving
[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.
[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
High-speed Large Language Model Serving for Local Deployment
The official implementation of the EMNLP 2023 paper LLM-FP4
Sparsity-aware deep learning inference runtime for CPUs
ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型
[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding