-
Tsinghua University
- Beijing, China
-
13:47
(UTC +08:00) - https://xxxxyu.github.io/academic
- https://xxxxyu.github.io/blog
Highlights
- Pro
LLM Inference & Serving
[ACL 2024] A novel QAT with Self-Distillation framework to enhance ultra low-bit LLMs.
BitBLAS is a library to support mixed-precision matrix multiplications, especially for quantized LLM deployment.
SGLang is a fast serving framework for large language models and vision language models.
Implementation of "BitNet: Scaling 1-bit Transformers for Large Language Models" in pytorch
FlashInfer: Kernel Library for LLM Serving
Run Large Language Models on RK3588 with GPU-acceleration
High-speed Large Language Model Serving for Local Deployment
纯c++的全平台llm加速库,支持python调用,chatglm-6B级模型单卡可达10000+token / s,支持glm, llama, moss基座,手机端流畅运行
Code for the ICML 2023 paper "SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot".
Get up and running with Llama 3.3, DeepSeek-R1, Phi-4, Gemma 2, and other large language models.
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
Open deep learning compiler stack for cpu, gpu and specialized accelerators
An easy-to-use LLMs quantization package with user-friendly apis, based on GPTQ algorithm.
TinyChatEngine: On-Device LLM Inference Library
A high-throughput and memory-efficient inference and serving engine for LLMs
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
Maid is a cross-platform Flutter app for interfacing with GGUF / llama.cpp models locally, and with Ollama and OpenAI models remotely.
Code for loralib, an implementation of "LoRA: Low-Rank Adaptation of Large Language Models"
中文LLaMA-2 & Alpaca-2大模型二期项目 + 64K超长上下文模型 (Chinese LLaMA-2 & Alpaca-2 LLMs with 64K long context models)
Unified Efficient Fine-Tuning of 100+ LLMs & VLMs (ACL 2024)
Fast and memory-efficient exact attention