Skip to content
View lippman1125's full-sized avatar

Block or report lippman1125

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

VPTQ, A Flexible and Extreme low-bit quantization algorithm

Python 543 36 Updated Dec 15, 2024

[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…

Python 835 39 Updated Dec 16, 2024

EfficientQAT: Efficient Quantization-Aware Training for Large Language Models

Python 232 18 Updated Oct 8, 2024
Python 205 19 Updated Jun 11, 2024

An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).

Cuda 218 16 Updated Oct 28, 2024

Fast Hadamard transform in CUDA, with a PyTorch interface

C 119 17 Updated May 24, 2024

Code for Neurips24 paper: QuaRot, an end-to-end 4-bit inference of large language models.

Python 298 25 Updated Nov 26, 2024

Grok open release

Python 49,728 8,342 Updated Aug 30, 2024

Official Pytorch repository for Extreme Compression of Large Language Models via Additive Quantization https://arxiv.org/pdf/2401.06118.pdf and PV-Tuning: Beyond Straight-Through Estimation for Ext…

Python 1,184 180 Updated Nov 28, 2024

Official implementation of Half-Quadratic Quantization (HQQ)

Python 716 72 Updated Nov 22, 2024

The Truth Is In There: Improving Reasoning in Language Models with Layer-Selective Rank Reduction

Python 375 28 Updated Jul 9, 2024

Official inference library for Mistral models

Jupyter Notebook 9,800 869 Updated Nov 12, 2024

Supercharge Your Model Training

Python 5,190 423 Updated Dec 16, 2024

Code for paper titled "Towards the Law of Capacity Gap in Distilling Language Models"

Python 95 5 Updated Jul 9, 2024

The TinyLlama project is an open endeavor to pretrain a 1.1B Llama model on 3 trillion tokens.

Python 8,018 475 Updated May 3, 2024

A curated list for Efficient Large Language Models

Python 1,329 94 Updated Dec 9, 2024

Microsoft Automatic Mixed Precision Library

Python 528 42 Updated Sep 29, 2024

[ICLR2024 spotlight] OmniQuant is a simple and powerful quantization technique for LLMs.

Python 737 56 Updated Oct 8, 2024

Fast and memory-efficient exact attention

Python 14,645 1,374 Updated Dec 15, 2024

Code for paper: "QuIP: 2-Bit Quantization of Large Language Models With Guarantees"

Python 354 32 Updated Feb 24, 2024

PyTorch extension for emulating FP8 data formats on standard FP32 Xeon/GPU hardware.

Python 101 10 Updated Dec 3, 2024

LMDeploy is a toolkit for compressing, deploying, and serving LLMs.

Python 4,846 438 Updated Dec 16, 2024

Implementation of "Attention Is Off By One" by Evan Miller

Python 186 10 Updated Aug 28, 2023

Code repo for the paper "LLM-QAT Data-Free Quantization Aware Training for Large Language Models"

Python 259 24 Updated Sep 3, 2024

Flexible simulator for mixed precision and format simulation of LLMs and vision transformers.

Python 47 4 Updated Jul 10, 2023

Awesome LLM compression research papers and tools.

1,251 82 Updated Dec 11, 2024

Code for the ICLR 2023 paper "GPTQ: Accurate Post-training Quantization of Generative Pretrained Transformers".

Python 1,964 156 Updated Mar 27, 2024

GPTQ inference Triton kernel

Jupyter Notebook 286 22 Updated May 18, 2023

4 bits quantization of LLaMA using GPTQ

Python 3,015 460 Updated Jul 13, 2024
Next