Quantized Attention that achieves speedups of 2.1-3.1x and 2.7-5.1x compared to FlashAttention2 and xformers, respectively, without lossing end-to-end metrics across various models.

Cuda 1,076 64 Updated Feb 28, 2025

epfml / dynamic-sparse-flash-attention

Jupyter Notebook 138 6 Updated Jun 2, 2023

DeepAuto-AI / hip-attention

Training-free Post-training Efficient Sub-quadratic Complexity Attention. Implemented with OpenAI Triton.

Python 122 13 Updated Feb 25, 2025

shihuihong214 / Trio-ViT

Python 3 Updated Sep 23, 2024

RulinShao / LightSeq

Official repository for LightSeq: Sequence Level Parallelism for Distributed Training of Long Context Transformers

Python 207 9 Updated Aug 19, 2024

triton-lang / triton

Development repository for the Triton language and compiler

MLIR 14,763 1,845 Updated Mar 8, 2025

ChuRuaNh0 / FastSam_Awsome_TensorRT

Python 114 12 Updated Jun 30, 2023

daquexian / onnx-simplifier

Simplify your onnx model

C++ 3,995 389 Updated Sep 3, 2024

AllenJWZhu / ViT_TensorRT_Inference_Optimization

Inference optimization of the ViT model using TensorRT, NVIDIA's high-performance deep learning inference platform. TensorRT is designed to maximize the efficiency of deep learning models during in…

2 Updated Aug 17, 2024

emptysoal / vit-tensorrt

Python 5 Updated Feb 12, 2025

chenlamei / MobileVit_TensorRT

TensorRT 2022 亚军方案，tensorrt加速mobilevit模型

Python 61 6 Updated Jun 22, 2022

tspeterkim / flash-attention-minimal

Flash Attention in ~100 lines of CUDA (forward pass only)

Cuda 714 64 Updated Dec 30, 2024

BBuf / how-to-optim-algorithm-in-cuda

how to optimize some algorithm in cuda.

Cuda 1,957 174 Updated Mar 5, 2025

NVIDIA / TensorRT-Model-Optimizer

A unified library of state-of-the-art model optimization techniques such as quantization, pruning, distillation, speculative decoding, etc. It compresses deep learning models for downstream deploym…

Python 768 56 Updated Mar 3, 2025

HeKun-NVIDIA / TensorRT-Developer_Guide_in_Chinese

263 62 Updated May 11, 2022

zjhellofss / KuiperInfer

校招、秋招、春招、实习好项目！带你从零实现一个高性能的深度学习推理库，支持大模型 llama2 、Unet、Yolov5、Resnet等模型的推理。Implement a high-performance deep learning inference library step by step

C++ 2,756 311 Updated Oct 26, 2024

Ackesnal / GTP-ViT

This is the official code for paper: Token Summarisation for Efficient Vision Transformers via Graph-based Token Propagation

Python 26 1 Updated Jan 15, 2024

YavorGIvanov / sam.cpp

C++ 1,272 54 Updated Oct 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ranpin ranpin

Block or report ranpin

Stars

LeoJhonSong / UCAS-Dissertation

LucidaLu / ucas-sep-addcourse

mohuangrui / ucasthesis

shuhongfan / sky-take-out

datawhalechina / fun-rec

BUAADreamer / EasyRAG

Nipi64310 / RAG-Book

datawhalechina / what-is-vs

datawhalechina / llm-deploy

snap-research / EfficientFormer

INT-FlashAttention2024 / INT-FlashAttention

ZRayZzz / flash-attention-v100

thu-ml / SageAttention