Skip to content
View chenbohua3's full-sized avatar

Organizations

@AlibabaPAI

Block or report chenbohua3

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

📖A curated list of Awesome LLM/VLM Inference Papers with codes: WINT8/4, Flash-Attention, Paged-Attention, Parallelism, etc. 🎉🎉

3,572 247 Updated Mar 4, 2025

[NeurIPS'24 Spotlight, ICLR'25] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an …

Python 924 46 Updated Feb 25, 2025

CUDA Templates for Linear Algebra Subroutines

C++ 6,950 1,137 Updated Feb 28, 2025

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,736 165 Updated Feb 23, 2025

A unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deploym…

Python 759 55 Updated Mar 3, 2025

A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers)

9,750 755 Updated May 31, 2024

[MLSys'25] QServe: W4A8KV4 Quantization and System Co-design for Efficient LLM Serving; [MLSys'25] LServe: Efficient Long-sequence LLM Serving with Unified Sparse Attention

C++ 571 33 Updated Feb 21, 2025

这是一个面向中文社区,分析市面上智能合约应用的架构与实现的仓库。

Solidity 1,634 345 Updated Dec 4, 2024

leaked prompts of GPTs

29,349 3,986 Updated Sep 27, 2024

[ICML 2024] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding

Python 1,202 73 Updated Oct 14, 2024

TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…

C++ 9,603 1,126 Updated Mar 4, 2025

a state-of-the-art-level open visual language model | 多模态预训练模型

Python 6,389 427 Updated May 29, 2024

[ICLR 2024] Efficient Streaming Language Models with Attention Sinks

Python 6,814 378 Updated Jul 11, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,796 233 Updated Mar 3, 2025

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 40,198 6,021 Updated Mar 4, 2025

A curated list for Efficient Large Language Models

Python 1,482 116 Updated Feb 16, 2025

Development repository for the Triton language and compiler

MLIR 14,702 1,834 Updated Mar 4, 2025

Xwin-LM: Powerful, Stable, and Reproducible LLM Alignment

Python 1,036 42 Updated May 31, 2024

Code for the paper "Evaluating Large Language Models Trained on Code"

Python 2,606 373 Updated Jan 17, 2025

A Pythonic framework to simplify AI service building

Python 2,692 173 Updated Feb 25, 2025

Inference Llama 2 in one file of pure 🔥

Mojo 2,109 140 Updated May 21, 2024

Lepton Examples

Jupyter Notebook 141 18 Updated Dec 29, 2024

Implementation of the LLaMA language model based on nanoGPT. Supports flash attention, Int8 and GPTQ 4bit quantization, LoRA and LLaMA-Adapter fine-tuning, pre-training. Apache 2.0-licensed.

Python 6,036 520 Updated Sep 6, 2024

Evals is a framework for evaluating LLMs and LLM systems, and an open-source registry of benchmarks.

Python 15,603 2,669 Updated Dec 18, 2024

A framework for few-shot evaluation of language models.

Python 8,102 2,165 Updated Mar 4, 2025

Beyond the Imitation Game collaborative benchmark for measuring and extrapolating the capabilities of language models

Python 2,982 599 Updated Jul 19, 2024

⚡ Build your chatbot within minutes on your favorite device; offer SOTA compression techniques for LLMs; run LLMs efficiently on Intel Platforms⚡

Python 2,162 211 Updated Oct 8, 2024

AITemplate is a Python framework which renders neural network into high performance CUDA/HIP C++ code. Specialized for FP16 TensorCore (NVIDIA GPU) and MatrixCore (AMD GPU) inference.

Python 4,613 377 Updated Dec 4, 2024

A Python-level JIT compiler designed to make unmodified PyTorch programs faster.

Python 1,033 125 Updated Apr 17, 2024
Next