Lists (2)
Sort Name ascending (A-Z)
Stars
A highly optimized inference acceleration engine for Llama and its variants.
My learning notes/codes for ML SYS.
A blazing fast inference solution for text embeddings models
A streamlined and customizable framework for efficient large model evaluation and performance benchmarking
Use PEFT or Full-parameter to finetune 400+ LLMs (Qwen2.5, Llama3.2, GLM4, Internlm2.5, Yi1.5, Mistral, Baichuan2, DeepSeek, ...) or 100+ MLLMs (Qwen2-VL, Qwen2-Audio, Llama3.2-Vision, Llava, Inter…
Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
AutoAWQ implements the AWQ algorithm for 4-bit quantization with a 2x speedup during inference. Documentation:
ripgrep recursively searches directories for a regex pattern while respecting your gitignore
A self-paced course to learn Rust, one exercise at a time.
Reference implementations of MLPerf™ inference benchmarks
Reading list for research topics in multimodal machine learning
[EMNLP 2024 Industry Track] This is the official PyTorch implementation of "LLMC: Benchmarking Large Language Model Quantization with a Versatile Compression Toolkit".
Coding a Multimodal (Vision) Language Model from scratch in PyTorch with full explanation: https://www.youtube.com/watch?v=vAmKB7iPkWw
A throughput-oriented high-performance serving framework for LLMs
A list of papers, docs, codes about model quantization. This repo is aimed to provide the info for model quantization research, we are continuously improving the project. Welcome to PR the works (p…
MIT Hans Lab 6.5940. efficient ML labs
[MLSys 2024 Best Paper Award] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models
CUDA 6大并行计算模式 代码与笔记
how to optimize some algorithm in cuda.
📖A curated list of Awesome LLM/VLM Inference Papers with codes, such as FlashAttention, PagedAttention, Parallelism, etc. 🎉🎉