HetServe-LLMs

A Overview of Efficiently Serving Large Language Models across Edge Devices [arXiv] (TBD)

What is This Overview About?

Large language models (LLMs) have achieved impressive results across various tasks. However, their resource-intensive nature poses challenges for efficient deployment. In our overview, we explore serving LLMs on distributed edge devices, addressing scalability and latency concerns.

Why Efficiently Serving LLMs over Distributed Heterogeneous Devices is Needed?

Efficiently serving LLMs over distributed heterogeneous devices is necessary to provide a seamless user experience with low latency. It enables scalability by leveraging resources from multiple devices, optimizing load balancing and resource utilization. This approach helps manage network congestion by distributing the delivery load across different devices. With the diverse range of device types available, efficient serving ensures that each device receives an optimized stream tailored to its capabilities. Overall, efficiently serving LLMs over distributed heterogeneous devices improves user experience, scalability, network performance, and accommodates the growing variety of devices in use.

LLM Serving Survey

[arXiv 2023.02] Full Stack Optimization of Transformer Inference: a Survey | UC Berkeley
[arXiv 2023.12] Towards Efficient Generative Large Language Model Serving: A Survey from Algorithms to Systems | Carnegie Mellon University
[TMLR 2024] Efficient Large Language Models: A Survey, The Ohio State University
[arXiv 2024.04] A Survey on Efficient Inference for Large Language Models, Infinigence-AI and Tsinghua University

LLM Serving

Image source: Efficient Large Language Model (LLM) Inferencing on GPUs

Image source: Efficient Memory Management for Large Language Model Serving with PagedAttention

Background

[NeurIPS 2017] Attention is all you need | Google Brain
[NeurIPS 2020] Language Models are Few-Shot Learners | OpenAI
[arXiv 2020.01] Scaling Laws for Neural Language Models | Johns Hopkins University and OpenAI
[arXiv 2022.01] Scaling Language Models: Methods, Analysis & Insights from Training Gopher | DeepMind

Image source: Large Language Models (in 2023)

Serving Metrics

[Tech Blog] LLM Inference Performance Engineering: Best Practices | Mosaic AI Research
[arXiv 2024.04] Andes: Defining and Enhancing Quality-of-Experience in LLM-Based Text Streaming Services | University of Michigan

Common Metrics

Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
Throughput: The number of output tokens per second an inference server can generate across all users and requests.

Latency-oriented Optimizations

Throughput-oriented Optimizations

Resource Management

[PPoPP 2021] TurboTransformers: An Efficient GPU Serving System For Transformer Models | Tencent
[OSDI 2022] Orca: A Distributed Serving System for Transformer-Based Generative Models | Seoul National University
[SOSP 2023] Efficient Memory Management for Large Language Model Serving with PagedAttention | UC Berkeley
[ICLR 2024] Efficient Streaming Language Models with Attention Sinks | MIT
[arXiv 2024.01] DistServe: Disaggregating Prefill and Decoding for Goodput-optimized Large Language Model Serving | Peking University
[arXiv 2024.02] FlexLLM: A System for Co-Serving Large Language Model Inference and Parameter-Efficient Finetuning | Carnegie Mellon
[arXiv 2024.02] MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving
[arXiv 2024.03] ALTO: An Efficient Network Orchestrator for Compound AI Systems
[arXiv 2024.03] AttentionStore: Cost-effective Attention Reuse across Multi-turn Conversations in Large Language Model Serving | National University of Singapore
[arXiv 2024.04] MuxServe: Flexible Multiplexing for Efficient Multiple LLM Serving | The Chinese University of Hong Kong
[arXiv 2024.04] BlockLLM: Multi-tenant Finer-grained Serving for Large Language Models | City University of Hong Kong
[arXiv 2024.05] vAttention: Dynamic Memory Management for Serving LLMs without PagedAttention | Microsoft Research India

Parallelism

Parallelism	Tensor Parallelism (TP)	Sequence Parallelism (SP)
Illustration	Image source: LoongServe	Image source: LoongServe

[NeurIPS 2019] GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism | Google
[SC 2021] Efficient Large-Scale Language Model Training on GPU Clusters Using Megatron-LM | NVIDIA
[OSDI 2022] Orca: A Distributed Serving System for Transformer-Based Generative Models | Seoul National University
[ICML 2023] FlexGen: high-throughput generative inference of large language models with a single GPU | Stanford Univeristy
[arXiv 2023.05] Fast Distributed Inference Serving for Large Language Models | Peking University
[arXiv 2023.12] PowerInfer: Fast Large Language Model Serving with a Consumer-grade GPU | Shanghai Jiao Tong University
[arXiv 2024.01] Infinite-LLM: Efficient LLM Service for Long Context with DistAttention and Distributed KVCache | Alibaba Group
[arXiv 2024.03] FASTDECODE: High-Throughput GPU-Efficient LLM Serving using Heterogeneous Pipelines | Tsinghua University
[arXiv 2024.04] LoongServe: Efficiently Serving Long-context Large Language Models with Elastic Sequence Parallelism | Peking University

Open-source LLM Serving Systems

Target: Latency or Throughput

Optimization: Quantization, Flash Attention, PagedAttention, Speculation, Continuous batching

Parallelism: Tensor Parallelism (TP), Pipeline Parallelism (PP), Sequence Parallelism (SP), CPU-GPU offloading (offload)

Supported hardware: NVIDIA GPU, AMD GPU, Intel CPU

Serving System	Target	Optimization	Parallelism	Hardware
vLLM	Throughput	Quantization, Flash Attention, PagedAttention, Speculation, Continuous batching	TP	NVIDIA GPU, Intel CPU, AMD GPU
TensorRT-LLM	Latency	Quantization, Flash Attention	TP, PP	NVIDIA GPU
llama.cpp	Latency	Quantization, Flash Attention, Speculation, Continuous batching	TP, PP	NVIDIA GPU, AMD GPU, Intel GPU/CPU, Mac
text-generation-inference	Latency and Throughput	Quantization, Flash Attention, PagedAttention, Continuous batching	TP	NVIDIA GPU, AMD GPU, Intel CPU

Suggestion: If your tasks are latency-sensistive, please use llama.cpp or TensorRT-LLM. If your tasks are latency-insensitive and have enough powerful GPUs, please use vLLM for a better throughput. If you just want to evaluate your research ideas with serving optimizations, text-generation-inference is a good chooice.

Serving on Heterogeneous Devices

A Single Device

[ICML 2023] FlexGen: high-throughput generative inference of large language models with a single GPU | Stanford Univeristy
[arXiv 2023.12] LLM in a flash: Efficient Large Language Model Inference with Limited Memory | Apple
[arXiv 2023.08] EdgeMoE: Fast On-Device Inference of MoE-based Large Language Models | Beijing University of Posts and Telecommunications
[ICML 2023, Oral] Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time
[ASPLOS'24] NeuPIMs: NPU-PIM Heterogeneous Acceleration for Batched LLM Inferencing | KAIST

Scaling to Distributed Devices

[FMEC 2023] PipeEdge: Pipeline Parallelism for Large-Scale Model Inference on Heterogeneous Edge Devices | Purdue University
[ASPLOS 2023] STI: Turbocharge NLP Inference at the Edge via Elastic Pipelining | University of Virginia
[arXiv 2023.11] Moirai: Towards Optimal Placement for Distributed Inference on Heterogeneous Devices | Zhejiang University
[ACL 2023 Demo] PETALS: Collaborative Inference and Fine-tuning of Large Models | HSE University and Yandex
[arXiv 2024.02] APISERVE: Efficient API Support for Large-Language Model Inferencing | University of California, San Diego
[Tech Report] Task Scheduling for Decentralized LLM Serving in Heterogeneous Networks | UC Berkeley
[arXiv 2024.01] CARASERVE: CPU-Assisted and Rank-Aware LoRA Serving for Generative LLM Inference | HKUST
[arXiv 2024.03] DejaVu: KV-cache Streaming for Fast, Fault-tolerant Generative LLM Serving | ETH Zurich
[arXiv 2024.04] Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity | UC Berkeley
[MLSys 2024] HeteGen: Heterogeneous Parallel Inference for Large Language Models on Resource-Constrained Devices | National University of Singapore
[ICML 2024] HEXGEN: Generative Inference of Large Language Model over Heterogeneous Environment | HKUST

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
README.md		README.md
sequence_parallelism.png		sequence_parallelism.png
tensor_parallelism.png		tensor_parallelism.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HetServe-LLMs

What is This Overview About?

Why Efficiently Serving LLMs over Distributed Heterogeneous Devices is Needed?

Table of Contents

LLM Serving Survey