Skip to content
View zhukevkesky's full-sized avatar
  • University of Science and Technology of China (USTC)

Block or report zhukevkesky

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.

C++ 2,184 120 Updated Dec 13, 2024

A self-learning tutorail for CUDA High Performance Programing.

JavaScript 283 32 Updated Dec 12, 2024

Making Long-Context LLM Inference 10x Faster and 10x Cheaper

Python 286 32 Updated Dec 14, 2024

LLM KV cache compression made easy

Python 267 14 Updated Dec 12, 2024

Low-bit LLM inference on CPU with lookup table

C++ 617 48 Updated Dec 6, 2024

Thin, unified, C++-flavored wrappers for the CUDA APIs

C++ 804 80 Updated Dec 9, 2024

Materials for learning SGLang

136 9 Updated Dec 9, 2024

A throughput-oriented high-performance serving framework for LLMs

Cuda 654 26 Updated Sep 21, 2024

A high-throughput and memory-efficient inference and serving engine for LLMs

Python 31,894 4,847 Updated Dec 15, 2024

Mirage: Automatically Generating Fast GPU Kernels without Programming in Triton/CUDA

C++ 679 39 Updated Dec 11, 2024

A curated list for Efficient Large Language Models

Python 1,326 94 Updated Dec 9, 2024

No Pose, No Problem: Surprisingly Simple 3D Gaussian Splats from Sparse Unposed Images

Python 551 20 Updated Dec 7, 2024

Bridging Large Vision-Language Models and End-to-End Autonomous Driving

Python 219 6 Updated Dec 8, 2024

A curated list of awesome LLM for Autonomous Driving resources (continually updated)

1,065 53 Updated Sep 25, 2024

Breaking Throughput-Latency Trade-off for Long Sequences with Speculative Decoding

Python 100 4 Updated Dec 4, 2024
Cuda 6 1 Updated Sep 26, 2021

Compiler for Dynamic Neural Networks

Python 43 2 Updated Nov 13, 2023

Supplemental material for ECRTS24 paper: Autonomy Today: Many Delay-Prone Black Boxes

C++ 2 Updated May 27, 2024

RT-Swap: Addressing GPU Memory Bottlenecks for Real-Time Multi-DNN Inference

Python 2 Updated Dec 2, 2024

InstantSplat: Sparse-view SfM-free Gaussian Splatting in Seconds

Python 896 57 Updated Dec 14, 2024
C 6 Updated May 17, 2024

Source code for the paper: "Pantheon: Preemptible Multi-DNN Inference on Mobile Edge GPUs"

Python 7 1 Updated Apr 15, 2024

A Fine-Grained, Hardware-Level GPU Resource Isolation Solution for Multi-Tenant DNN Inference

C++ 2 Updated May 26, 2024
Jupyter Notebook 47 3 Updated Jun 13, 2024

Efficient tool-assisted LLM serving runtime.

Python 5 1 Updated Sep 11, 2024
Jupyter Notebook 15 3 Updated May 28, 2024

Summary of some awesome work for optimizing LLM inference

39 1 Updated Dec 13, 2024

FlashInfer: Kernel Library for LLM Serving

Cuda 1,549 153 Updated Dec 14, 2024

Hackable and optimized Transformers building blocks, supporting a composable construction.

Python 8,770 632 Updated Dec 11, 2024
Next