Stars
DeepEP: an efficient expert-parallel communication library
The official implementation of Tensor ProducT ATTenTion Transformer (T6)
Code release for AdapMoE accepted by ICCAD 2024
Mooncake is the serving platform for Kimi, a leading LLM service provided by Moonshot AI.
An Open Large Reasoning Model for Real-World Solutions
Development repository for the Triton language and compiler
MiniCPM-o 2.6: A GPT-4o Level MLLM for Vision, Speech and Multimodal Live Streaming on Your Phone
[ICLR 2025] DuoAttention: Efficient Long-Context LLM Inference with Retrieval and Streaming Heads
✨✨Latest Advances on Multimodal Large Language Models
A throughput-oriented high-performance serving framework for LLMs
✨✨VITA-1.5: Towards GPT-4o Level Real-Time Vision and Speech Interaction
A low-latency & high-throughput serving engine for LLMs
FlashInfer: Kernel Library for LLM Serving
High performance Transformer implementation in C++.
Disaggregated serving system for Large Language Models (LLMs).
🚀 KIMI AI 长文本大模型逆向API【特长:长文本解读整理】,支持高速流式输出、智能体对话、联网搜索、探索版、K1思考模型、长文档解读、图像解析、多轮对话,零配置部署,多路token支持,自动清理会话痕迹,仅供测试,如需商用请前往官方开放平台。
Efficient and easy multi-instance LLM serving
Standardized Serverless ML Inference Platform on Kubernetes
Multi-LoRA inference server that scales to 1000s of fine-tuned LLMs
AutoGPT is the vision of accessible AI for everyone, to use and to build on. Our mission is to provide the tools, so that you can focus on what matters.
A list of ICs and IPs for AI, Machine Learning and Deep Learning.
Letta (formerly MemGPT) is a framework for creating LLM services with memory.
This is the official repo of "QuickLLaMA: Query-aware Inference Acceleration for Large Language Models"