Skip to content
View fwensen's full-sized avatar

Block or report fwensen

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Please don't include any personal information such as legal names or email addresses. Maximum 100 characters, markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
Showing results

A highly optimized inference acceleration engine for Llama and its variants.

C++ 331 28 Updated Dec 12, 2024

My learning notes/codes for ML SYS.

Python 182 6 Updated Dec 13, 2024

A blazing fast inference solution for text embeddings models

Rust 2,931 191 Updated Dec 12, 2024

A streamlined and customizable framework for efficient large model evaluation and performance benchmarking

Python 304 36 Updated Dec 13, 2024

LLM101n: Let's build a Storyteller

30,492 1,668 Updated Aug 1, 2024

Use PEFT or Full-parameter to finetune 400+ LLMs (Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, ...) or 100+ MLLMs (Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, Inter…

Python 4,618 403 Updated Dec 13, 2024

Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)

Python 36,034 4,442 Updated Dec 12, 2024

AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:

Python 1,823 220 Updated Dec 6, 2024

ripgrep recursively searches directories for a regex pattern while respecting your gitignore

Rust 49,203 2,019 Updated Sep 30, 2024

A self-paced course to learn Rust, one exercise at a time.

Rust 6,316 1,129 Updated Nov 19, 2024
Jupyter Notebook 152 30 Updated Dec 13, 2024

Reference implementations of MLPerf™ inference benchmarks

Python 1,254 539 Updated Dec 13, 2024

Reading list for research topics in multimodal machine learning

6,148 853 Updated Aug 20, 2024

[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".

Python 348 37 Updated Dec 13, 2024

Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation: https://www.youtube.com/watch?v=vAmKB7iPkWw

Python 331 58 Updated Dec 6, 2024

A throughput-oriented high-performance serving framework for LLMs

Cuda 652 26 Updated Sep 21, 2024

Inference code for Llama models

Python 56,790 9,605 Updated Aug 18, 2024

A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (p…

1,918 208 Updated Nov 1, 2024

MIT Hans Lab 6.5940. efficient ML labs

Jupyter Notebook 2 Updated Feb 3, 2024

[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Python 2,588 212 Updated Dec 13, 2024

[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models

Python 1,284 151 Updated Jul 12, 2024

CUDA 6大并行计算模式 代码与笔记

Cuda 59 9 Updated Jul 30, 2020

CUDA Core Compute Libraries

C++ 1,336 169 Updated Dec 13, 2024
Python 587 52 Updated Jul 31, 2024

Material for gpu-mode lectures

Jupyter Notebook 3,170 325 Updated Dec 3, 2024

how to optimize some algorithm in cuda.

Cuda 1,726 142 Updated Dec 12, 2024

搜索引擎原理

1,515 125 Updated Apr 19, 2024

📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉

2,991 204 Updated Dec 9, 2024
Jupyter Notebook 52 9 Updated Jul 2, 2023
Next