Stars
A highly optimized inference acceleration engine for Llama and its variants.
My learning notes/codes for ML SYS.
TensorRT-LLM provides users with an easy-to-use Python API to define Large Language Models (LLMs) and build TensorRT engines that contain state-of-the-art optimizations to perform inference efficie…
trypear / pearai-app
Forked from microsoft/vscodePearAI: Open Source AI Code Editor (Fork of VSCode). The PearAI Submodule (https://github.com/trypear/pearai-submodule) is a fork of Continue.
Amphion (/æmˈfaɪən/) is a toolkit for Audio, Music, and Speech Generation. Its purpose is to support reproducible research and help junior researchers and engineers get started in the field of audi…
Composable building blocks to build Llama Apps
An open-source RAG-based tool for chatting with your documents.
A high-throughput and memory-efficient inference and serving engine for LLMs
A package for parsing PDFs and analyzing their content using LLMs.
Microsoft's GraphRAG + AutoGen + Ollama + Chainlit = Fully Local & Free Multi-Agent RAG Superbot
SearchGPT / Perplexity clone, but personalised for you.
🔍 An LLM-based Multi-agent Framework of Web Search Engine (like Perplexity.ai Pro and SearchGPT)
An Open-source Framework for Data-centric, Self-evolving Autonomous Language Agents
izhuhaoran / vllm
Forked from vllm-project/vllmA high-throughput and memory-efficient inference and serving engine for LLMs
Agentic components of the Llama Stack APIs
SGLang is a fast serving framework for large language models and vision language models.
SearchGPT / Perplexity Pages clone, but personalised for you.
Fast Matrix Multiplications for Lookup Table-Quantized LLMs
FlashInfer: Kernel Library for LLM Serving
[ICML 2024] Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference
TLLM_QMM strips the implementation of quantized kernels of Nvidia's TensorRT-LLM, removing NVInfer dependency and exposes ease of use Pytorch module. We modified the dequantation and weight preproc…
MobileLLM Optimizing Sub-billion Parameter Language Models for On-Device Use Cases. In ICML 2024.
Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM
GPTModels - a multi model, window based LLM AI plugin for neovim, with an emphasis on stability and clean code
Production ready LLM model compression/quantization toolkit with accelerated inference support for both cpu/gpu via HF, vLLM, and SGLang.
QQQ is an innovative and hardware-optimized W4A8 quantization solution for LLMs.