Highlights
- Pro
Lists (16)
Sort Name ascending (A-Z)
Starred repositories
Curated collection of papers in machine learning systems
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
UPMEM LLM Framework allows profiling PyTorch layers and functions and simulate those layers/functions with a given hardware profile.
LLMServingSim: A HW/SW Co-Simulation Infrastructure for LLM Inference Serving at Scale
📚150+ Tensor/CUDA Cores Kernels, ⚡️flash-attention-mma, ⚡️hgemm with WMMA, MMA and CuTe (98%~100% TFLOPS of cuBLAS 🎉🎉).
A Overview of Efficiently Serving Large Language Models across Edge Devices
GLake: optimizing GPU memory management and IO transmission.
SGLang is a fast serving framework for large language models and vision language models.
Awesome-LLM-Eval: a curated list of tools, datasets/benchmark, demos, leaderboard, papers, docs and models, mainly for Evaluation on LLMs. 一个由工具、基准/数据、演示、排行榜和大模型等组成的精选列表,主要面向基础大模型评测,旨在探求生成式AI的技术边界.
[TMLR 2024] Efficient Large Language Models: A Survey
A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture
awesome llm plaza: daily tracking all sorts of awesome topics of llm, e.g. llm for coding, robotics, reasoning, multimod etc.
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉
An unofficial implementation of "Mixture-of-Depths: Dynamically allocating compute in transformer-based language models"
A list of papers, docs, codes about efficient AIGC. This repo is aimed to provide the info for efficient AIGC research, including language and vision, we are continuously improving the project. Wel…
ONNXim is a fast cycle-level simulator that can model multi-core NPUs for DNN inference
Ramulator 2.0 is a modern, modular, extensible, and fast cycle-accurate DRAM simulator. It provides support for agile implementation and evaluation of new memory system designs (e.g., new DRAM stan…
Analyze the inference of Large Language Models (LLMs). Analyze aspects like computation, storage, transmission, and hardware roofline model in a user-friendly interface.
Large Language Model (LLM) Systems Paper List
how to optimize some algorithm in cuda.
FlashInfer: Kernel Library for LLM Serving