-
HKUST(Guangzhou)
-
05:39
(UTC +08:00) - https://lzzmm.github.io
Highlights
- Pro
Lists (3)
Sort Name ascending (A-Z)
Stars
DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
Educational framework exploring ergonomic, lightweight multi-agent orchestration. Managed by OpenAI Solution team.
Now we have become very big, Different from the original idea. Collect premium software in various categories.
Low latency JSON generation using LLMs ⚡️
A throughput-oriented high-performance serving framework for LLMs
Enhance Tesseract OCR output for scanned PDFs by applying Large Language Model (LLM) corrections.
Experimental projects related to TensorRT
User-friendly Desktop Client App for AI Models/LLMs (GPT, Claude, Gemini, Ollama...)
Flash Attention in ~100 lines of CUDA (forward pass only)
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
MiniCPM-V 2.6: A GPT-4V Level MLLM for Single Image, Multi Image and Video on Your Phone
A lightweight library for portable low-level GPU computation using WebGPU.
[NeurIPS'24 Spotlight] To speed up Long-context LLMs' inference, approximate and dynamic sparse calculate the attention, which reduces inference latency by up to 10x for pre-filling on an A100 whil…
A low-latency & high-throughput serving engine for LLMs
An efficient GPU support for LLM inference with x-bit quantization (e.g. FP6,FP5).
This is an online course where you can learn and master the skill of low-level performance analysis and tuning.
A ChatGPT(GPT-3.5) & GPT-4 Workload Trace to Optimize LLM Serving Systems
Unified Efficient Fine-Tuning of 100+ LLMs (ACL 2024)
C++ Insights - See your source code with the eyes of a compiler
Pruner-Zero: Evolving Symbolic Pruning Metric from scratch for LLMs
A list of tutorials, paper, talks, and open-source projects for emerging compiler and architecture
llama3 implementation one matrix multiplication at a time
A large-scale simulation framework for LLM inference
Bayesian optimisation & Reinforcement Learning library developed by Huawei Noah's Ark Lab
A high-throughput and memory-efficient inference and serving engine for LLMs