Skip to content
View fwensen's full-sized avatar

Block or report fwensen

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A highly optimized inference acceleration engine for Llama and its variants.

C++ 308 27 Updated Dec 12, 2024

My learning notes/codes for ML SYS.

Python 178 6 Updated Dec 12, 2024

A blazing fast inference solution for text embeddings models

Rust 2,922 190 Updated Dec 12, 2024

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

Python 302 36 Updated Dec 12, 2024

LLM101n: Let's build a Storyteller

30,475 1,668 Updated Aug 1, 2024

Use PEFT or Full-parameter to finetune 400+ LLMs (Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, ...) or 100+ MLLMs (Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, Inter…

Python 4,604 403 Updated Dec 12, 2024

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)

Python 35,973 4,437 Updated Dec 12, 2024

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 1,822 220 Updated Dec 6, 2024

ripgrep recursively searches directories for a regex pattern while respecting your gitignore

Rust 49,183 2,019 Updated Sep 30, 2024

A self-paced course to learn Rust, one exercise at a time.

Rust 6,310 1,127 Updated Nov 19, 2024
Jupyter Notebook 149 28 Updated Dec 12, 2024

Reference implementations of MLPerf™ inference benchmarks

Python 1,253 538 Updated Dec 12, 2024

Reading list for research topics in multimodal machine learning

6,146 853 Updated Aug 20, 2024

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Python 342 37 Updated Dec 12, 2024

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation: https://www.youtube.com/watch?v=vAmKB7iPkWw

Python 329 58 Updated Dec 6, 2024

A throughput-oriented high-performance serving framework for LLMs

Cuda 653 26 Updated Sep 21, 2024

Inference code for Llama models

Python 56,771 9,602 Updated Aug 18, 2024

A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (p…

1,915 208 Updated Nov 1, 2024

MIT Hans Lab 6.5940. efficient ML labs

Jupyter Notebook 2 Updated Feb 3, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,586 212 Updated Oct 16, 2024

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,281 150 Updated Jul 12, 2024

CUDA 6大并行计算模式 代码与笔记

Cuda 59 9 Updated Jul 30, 2020

CUDA Core Compute Libraries

C++ 1,333 169 Updated Dec 12, 2024
Python 587 52 Updated Jul 31, 2024

Material for gpu-mode lectures

Jupyter Notebook 3,169 325 Updated Dec 3, 2024

how to optimize some algorithm in cuda.

Cuda 1,718 141 Updated Dec 12, 2024

搜索引擎原理

1,513 125 Updated Apr 19, 2024

📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉

2,983 202 Updated Dec 9, 2024
Jupyter Notebook 52 9 Updated Jul 2, 2023
Next