xxxxyu

Follow

🎯

Focusing

Xiangyu Li xxxxyu

🎯

Focusing

Follow

Ph.D. student at AIR, THU

25 followers · 24 following

Tsinghua University
Beijing, China
13:47 (UTC +08:00)
https://xxxxyu.github.io/academic
https://xxxxyu.github.io/blog

Achievements

Achievements

Highlights

Pro

Stars

LLM Inference & Serving

32 repositories

microsoft / BitNet

Official inference framework for 1-bit LLMs

C++ 12,767 897 Updated Feb 18, 2025

microsoft / T-MAC

Low-bit LLM inference on CPU with lookup table

C++ 689 53 Updated Jan 9, 2025

DD-DuDa / BitDistiller

[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.

Python 100 15 Updated May 16, 2024

microsoft / BitBLAS

BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.

Python 528 39 Updated Feb 14, 2025

sgl-project / sglang

SGLang is a fast serving framework for large language models and vision language models.

Python 11,003 1,097 Updated Feb 28, 2025

kyegomez / BitNet

Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch

Python 1,761 158 Updated Jan 27, 2025

flashinfer-ai / flashinfer

FlashInfer: Kernel Library for LLM Serving

Cuda 2,204 227 Updated Feb 27, 2025

Chrisz236 / llm-rk3588

Run Large Language Models on RK3588 with GPU-acceleration

93 4 Updated Aug 16, 2023

SJTU-IPADS / PowerInfer

High-speed Large Language Model Serving for Local Deployment

C++ 8,126 424 Updated Feb 19, 2025

airockchip / rknn-llm

C 638 67 Updated Feb 8, 2025

ztxz16 / fastllm

纯c++的全平台llm加速库，支持python调用，chatglm-6B级模型单卡可达10000+token / s，支持glm, llama, moss基座，手机端流畅运行

C++ 3,396 348 Updated Feb 27, 2025

airockchip / rknn-toolkit2

C 1,348 139 Updated Nov 11, 2024

mtx512 / rk3588-npu

Reverse engineering the rk3588 npu

C 72 4 Updated May 30, 2024

IST-DASLab / sparsegpt

Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".

Python 770 100 Updated Aug 20, 2024

ollama / ollama

Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.

Go 130,053 10,632 Updated Feb 28, 2025

hahnyuan / LLM-Viewer

Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.

Python 403 47 Updated Sep 11, 2024

karpathy / llama2.c

Inference Llama 2 in one file of pure C

C 18,092 2,200 Updated Aug 6, 2024

UbiquitousLearning / mllm

Fast Multimodal LLM on Mobile Devices

C++ 712 83 Updated Feb 24, 2025

apache / tvm

Open deep learning compiler stack for cpu, gpu and specialized accelerators

Python 12,053 3,520 Updated Feb 28, 2025

AutoGPTQ / AutoGPTQ

An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.

Python 4,725 510 Updated Jan 21, 2025

mit-han-lab / TinyChatEngine

TinyChatEngine: On-Device LLM Inference Library

C++ 816 82 Updated Jul 4, 2024

vllm-project / vllm

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 39,685 5,943 Updated Feb 28, 2025

NVIDIA / TensorRT-LLM

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,560 1,122 Updated Feb 28, 2025

Mobile-Artificial-Intelligence / maid

Maid is a cross-platform Flutter app for interfacing with GGUF / llama.cpp models locally, and with Ollama and OpenAI models remotely.

Dart 1,728 197 Updated Feb 28, 2025

microsoft / LoRA

Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"

Python 11,382 711 Updated Dec 17, 2024

Bip-Rep / sherpa

A mobile Implementation of llama.cpp

Dart 304 36 Updated Feb 1, 2024

ymcui / Chinese-LLaMA-Alpaca-2

中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)

Python 7,156 577 Updated Sep 23, 2024

hiyouga / LLaMA-Factory

Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)

Python 42,354 5,180 Updated Feb 28, 2025

Dao-AILab / flash-attention

Fast and memory-efficient exact attention

Python 15,974 1,503 Updated Feb 25, 2025

ggml-org / llama.cpp

LLM inference in C/C++

C++ 75,505 10,913 Updated Feb 27, 2025