Skip to content
View xxxxyu's full-sized avatar
🎯
Focusing
🎯
Focusing

Highlights

  • Pro

Block or report xxxxyu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Stars

LLM Inference & Serving

32 repositories

Official inference framework for 1-bit LLMs

C++ 12,767 897 Updated Feb 18, 2025

Low-bit LLM inference on CPU with lookup table

C++ 689 53 Updated Jan 9, 2025

[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.

Python 100 15 Updated May 16, 2024

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 528 39 Updated Feb 14, 2025

SGLang is a fast serving framework for large language models and vision language models.

Python 11,003 1,097 Updated Feb 28, 2025

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

Python 1,761 158 Updated Jan 27, 2025

FlashInfer: Kernel Library for LLM Serving

Cuda 2,204 227 Updated Feb 27, 2025

Run Large Language Models on RK3588 with GPU-acceleration

93 4 Updated Aug 16, 2023

High-speed Large Language Model Serving for Local Deployment

C++ 8,126 424 Updated Feb 19, 2025

纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行

C++ 3,396 348 Updated Feb 27, 2025

Reverse engineering the rk3588 npu

C 72 4 Updated May 30, 2024

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".

Python 770 100 Updated Aug 20, 2024

Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.

Go 130,053 10,632 Updated Feb 28, 2025

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 403 47 Updated Sep 11, 2024

Inference Llama 2 in one file of pure C

C 18,092 2,200 Updated Aug 6, 2024

Fast Multimodal LLM on Mobile Devices

C++ 712 83 Updated Feb 24, 2025

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Python 12,053 3,520 Updated Feb 28, 2025

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,725 510 Updated Jan 21, 2025

TinyChatEngine: On-Device LLM Inference Library

C++ 816 82 Updated Jul 4, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 39,685 5,943 Updated Feb 28, 2025

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,560 1,122 Updated Feb 28, 2025

Maid is a cross-platform Flutter app for interfacing with GGUF / llama.cpp models locally, and with Ollama and OpenAI models remotely.

Dart 1,728 197 Updated Feb 28, 2025

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

Python 11,382 711 Updated Dec 17, 2024

A mobile Implementation of llama.cpp

Dart 304 36 Updated Feb 1, 2024

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

Python 7,156 577 Updated Sep 23, 2024

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Python 42,354 5,180 Updated Feb 28, 2025

Fast and memory-efficient exact attention

Python 15,974 1,503 Updated Feb 25, 2025

LLM inference in C/C++

C++ 75,505 10,913 Updated Feb 27, 2025