A Overview of Efficiently Serving Large Language Models across Edge Devices [arXiv] (TBD)
Large language models (LLMs) have achieved impressive results across various tasks. However, their resource-intensive nature poses challenges for efficient deployment. In our overview, we explore serving LLMs on distributed edge devices, addressing scalability and latency concerns.
Efficiently serving LLMs over distributed heterogeneous devices is necessary to provide a seamless user experience with low latency. It enables scalability by leveraging resources from multiple devices, optimizing load balancing and resource utilization. This approach helps manage network congestion by distributing the delivery load across different devices. With the diverse range of device types available, efficient serving ensures that each device receives an optimized stream tailored to its capabilities. Overall, efficiently serving LLMs over distributed heterogeneous devices improves user experience, scalability, network performance, and accommodates the growing variety of devices in use.
- [arXiv 2023.02] Full Stack Optimization of Transformer Inference: a Survey | UC Berkeley
- [arXiv 2023.12] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | Carnegie Mellon University
- [TMLR 2024] Efficient Large Language Models: A Survey, The Ohio State University
- [arXiv 2024.04] A Survey on Efficient Inference for Large Language Models, Infinigence-AI and Tsinghua University
Image source: Efficient Large Language Model (LLM) Inferencing on GPUs
Image source: Efficient Memory Management for Large Language Model Serving with PagedAttention
- [NeurIPS 2017] Attention is all you need | Google Brain
- [NeurIPS 2020] Language Models are Few-Shot Learners | OpenAI
- [arXiv 2020.01] Scaling Laws for Neural Language Models | Johns Hopkins University and OpenAI
- [arXiv 2022.01] Scaling Language Models: Methods, Analysis & Insights from Training Gopher | DeepMind
Image source: Large Language Models (in 2023)
- [Tech Blog] LLM Inference Performance Engineering: Best Practices | Mosaic AI Research
- [arXiv 2024.04] Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | University of Michigan
- Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
- Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
- Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
- Throughput: The number of output tokens per second an inference server can generate across all users and requests.
- [arXiv 2020.04] Longformer: The Long-Document Transformer | Allen Institute for Artificial Intelligence
- [ICML 2022] DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale | Microsoft
- [EMNLP 2023] GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints | Google Research
- [ICML 2023] Fast Inference from Transformers via Speculative Decoding | Google Research
- [ACL 2023] Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes | University of Washington
- [arXiv 2023.05] SpecInfer: Accelerating Generative LLM Serving with Speculative Inference and Token Tree Verification | Carnegie Mellon University
- [arXiv 2024.02] Break the Sequential Dependency of LLM Inference Using Lookahead Decoding | University of California, San Diego
- [ACL 2024] Draft & Verify: Lossless Large Language Model Acceleration via Self-Speculative Decoding | Zhejiang University
- [NeurIPS 2022] FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness | Stanford University
- [ICLR 2024] FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning | Princeton University
- [arXiv 2023.11] FlashDecoding++: Faster Large Language Model Inference on GPUs | Tsinghua University & Infinigence-AI
- [ICLR 2023] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers | IST Austria
- [ICML 2023] SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models | MIT
- [MLSys 2024] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration | MIT
- [PPoPP 2021] TurboTransformers: An Efficient GPU Serving System For Transformer Models | Tencent
- [OSDI 2022] Orca: A Distributed Serving System for Transformer-Based Generative Models | Seoul National University
- [SOSP 2023] Efficient Memory Management for Large Language Model Serving with PagedAttention | UC Berkeley
- [ICLR 2024] Efficient Streaming Language Models with Attention Sinks | MIT
- [arXiv 2024.01] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | Peking University
- [arXiv 2024.02] FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning | Carnegie Mellon
- [arXiv 2024.02] MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
- [arXiv 2024.03] ALTO: An Efficient Network Orchestrator for Compound AI Systems
- [arXiv 2024.03] AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | National University of Singapore
- [arXiv 2024.04] MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving | The Chinese University of Hong Kong
- [arXiv 2024.04] BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models | City University of Hong Kong
- [arXiv 2024.05] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention | Microsoft Research India
Parallelism | Tensor Parallelism (TP) | Sequence Parallelism (SP) |
---|---|---|
Illustration | Image source: LoongServe |
Image source: LoongServe |
- [NeurIPS 2019] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | Google
- [SC 2021] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | NVIDIA
- [OSDI 2022] Orca: A Distributed Serving System for Transformer-Based Generative Models | Seoul National University
- [ICML 2023] FlexGen: high-throughput generative inference of large language models with a single GPU | Stanford Univeristy
- [arXiv 2023.05] Fast Distributed Inference Serving for Large Language Models | Peking University
- [arXiv 2023.12] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | Shanghai Jiao Tong University
- [arXiv 2024.01] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache | Alibaba Group
- [arXiv 2024.03] FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines | Tsinghua University
- [arXiv 2024.04] LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | Peking University
Target: Latency or Throughput
Optimization: Quantization, Flash Attention, PagedAttention, Speculation, Continuous batching
Parallelism: Tensor Parallelism (TP), Pipeline Parallelism (PP), Sequence Parallelism (SP), CPU-GPU offloading (offload)
Supported hardware: NVIDIA GPU, AMD GPU, Intel CPU
Serving System | Target | Optimization | Parallelism | Hardware |
---|---|---|---|---|
vLLM | Throughput | Quantization, Flash Attention, PagedAttention, Speculation, Continuous batching | TP | NVIDIA GPU, Intel CPU, AMD GPU |
TensorRT-LLM | Latency | Quantization, Flash Attention | TP, PP | NVIDIA GPU |
llama.cpp | Latency | Quantization, Flash Attention, Speculation, Continuous batching | TP, PP | NVIDIA GPU, AMD GPU, Intel GPU/CPU, Mac |
text-generation-inference | Latency and Throughput | Quantization, Flash Attention, PagedAttention, Continuous batching | TP | NVIDIA GPU, AMD GPU, Intel CPU |
Suggestion: If your tasks are latency-sensistive, please use llama.cpp or TensorRT-LLM. If your tasks are latency-insensitive and have enough powerful GPUs, please use vLLM for a better throughput. If you just want to evaluate your research ideas with serving optimizations, text-generation-inference is a good chooice.
- [ICML 2023] FlexGen: high-throughput generative inference of large language models with a single GPU | Stanford Univeristy
- [arXiv 2023.12] LLM in a flash: Efficient Large Language Model Inference with Limited Memory | Apple
- [arXiv 2023.08] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models | Beijing University of Posts and Telecommunications
- [ICML 2023, Oral] Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
- [ASPLOS'24] NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing | KAIST
- [FMEC 2023] PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices | Purdue University
- [ASPLOS 2023] STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining | University of Virginia
- [arXiv 2023.11] Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices | Zhejiang University
- [ACL 2023 Demo] PETALS: Collaborative Inference and Fine-tuning of Large Models | HSE University and Yandex
- [arXiv 2024.02] APISERVE: Efficient API Support for Large-Language Model Inferencing | University of California, San Diego
- [Tech Report] Task Scheduling for Decentralized LLM Serving in Heterogeneous Networks | UC Berkeley
- [arXiv 2024.01] CARASERVE: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference | HKUST
- [arXiv 2024.03] DejaVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving | ETH Zurich
- [arXiv 2024.04] Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity | UC Berkeley
- [MLSys 2024] HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices | National University of Singapore
- [ICML 2024] HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment | HKUST