Skip to content
View yzl-eup's full-sized avatar

Block or report yzl-eup

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

s1: Simple test-time scaling

Python 4,073 444 Updated Feb 8, 2025

Faster Pytorch bitsandbytes 4bit fp4 nn.Linear ops

Python 25 Updated Mar 16, 2024

每个人都能看懂的大模型知识分享,LLMs春/秋招大模型面试前必看,让你和面试官侃侃而谈

Jupyter Notebook 913 73 Updated Jan 20, 2025

RTP-LLM: Alibaba's high-performance LLM inference engine for diverse applications.

C++ 620 54 Updated Jan 21, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 8,980 869 Updated Feb 8, 2025

LLM KV cache compression made easy

Python 381 25 Updated Feb 3, 2025

Code repo for the paper "SpinQuant LLM quantization with learned rotations"

Python 208 26 Updated Nov 11, 2024

Official implementation of the ICLR 2024 paper AffineQuant

Python 24 2 Updated Mar 30, 2024

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

Python 334 30 Updated Nov 26, 2024

Official PyTorch implementation of FlatQuant: Flatness Matters for LLM Quantization

Python 97 8 Updated Jan 23, 2025
Jupyter Notebook 85 10 Updated Jul 23, 2024

[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

Cuda 238 24 Updated Nov 22, 2024

Applied AI experiments and examples for PyTorch

Python 218 22 Updated Feb 6, 2025

Development repository for the Triton language and compiler

C++ 14,317 1,778 Updated Feb 8, 2025

QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference

Python 116 5 Updated Mar 6, 2024

Compress your input to ChatGPT or other LLMs, to let them process 2x more content and save 40% memory and GPU time.

Python 352 17 Updated Feb 12, 2024

Large-scale Self-supervised Pre-training Across Tasks, Languages, and Modalities

Python 20,692 2,591 Updated Feb 6, 2025

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Ext…

Python 1,211 183 Updated Dec 26, 2024

A fast inference library for running LLMs locally on modern consumer-class GPUs

Python 3,928 299 Updated Feb 8, 2025

Large World Model -- Modeling Text and Video with Millions Context

Python 7,221 555 Updated Oct 19, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,959 197 Updated Feb 8, 2025

[EMNLP'23, ACL'24] To speed up LLMs' inference and enhance LLM's perceive of key information, compress the prompt and KV-Cache, which achieves up to 20x compression with minimal performance loss.

Python 4,850 271 Updated Jan 26, 2025

[MLSys'24] Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Cuda 292 25 Updated Jul 2, 2024

High-speed Large Language Model Serving for Local Deployment

C++ 8,082 420 Updated Jan 28, 2025

The official implementation of the EMNLP 2023 paper LLM-FP4

Python 178 14 Updated Dec 15, 2023

Sparsity-aware deep learning inference runtime for CPUs

Python 3,094 181 Updated Jul 19, 2024

ChatGLM3 series: Open Bilingual Chat LLMs | 开源双语对话语言模型

Python 13,616 1,592 Updated Jan 13, 2025

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,186 72 Updated Oct 14, 2024
Next