Skip to content

A Overview of Efficiently Serving Large Language Models across Edge Devices

Notifications You must be signed in to change notification settings

Jason-cs18/HetServe-LLMs

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

91 Commits
 
 
 
 
 
 

Repository files navigation

HetServe-LLMs

A Overview of Efficiently Serving Large Language Models across Edge Devices [arXiv] (TBD)

What is This Overview About?

Large language models (LLMs) have achieved impressive results across various tasks. However, their resource-intensive nature poses challenges for efficient deployment. In our overview, we explore serving LLMs on distributed edge devices, addressing scalability and latency concerns.

Why Efficiently Serving LLMs over Distributed Heterogeneous Devices is Needed?

Efficiently serving LLMs over distributed heterogeneous devices is necessary to provide a seamless user experience with low latency. It enables scalability by leveraging resources from multiple devices, optimizing load balancing and resource utilization. This approach helps manage network congestion by distributing the delivery load across different devices. With the diverse range of device types available, efficient serving ensures that each device receives an optimized stream tailored to its capabilities. Overall, efficiently serving LLMs over distributed heterogeneous devices improves user experience, scalability, network performance, and accommodates the growing variety of devices in use.

Table of Contents

LLM Serving Survey

LLM Serving

Image source: Efficient Large Language Model (LLM) Inferencing on GPUs

Image source: Efficient Memory Management for Large Language Model Serving with PagedAttention

Background

Image source: Large Language Models (in 2023)

Serving Metrics

Common Metrics

  • Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
  • Time Per Output Token (TPOT): Time to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
  • Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
  • Throughput: The number of output tokens per second an inference server can generate across all users and requests.

Latency-oriented Optimizations

Efficient Models

Efficient Operators/Kernels

Quantization

Throughput-oriented Optimizations

Resource Management

Parallelism

Parallelism Tensor Parallelism (TP) Sequence Parallelism (SP)
Illustration
Image source: LoongServe

Image source: LoongServe

Open-source LLM Serving Systems

Target: Latency or Throughput

Optimization: Quantization, Flash Attention, PagedAttention, Speculation, Continuous batching

Parallelism: Tensor Parallelism (TP), Pipeline Parallelism (PP), Sequence Parallelism (SP), CPU-GPU offloading (offload)

Supported hardware: NVIDIA GPU, AMD GPU, Intel CPU

Serving System Target Optimization Parallelism Hardware
vLLM Github stars Github forks Throughput Quantization, Flash Attention, PagedAttention, Speculation, Continuous batching TP NVIDIA GPU, Intel CPU, AMD GPU
TensorRT-LLM Github stars Github forks Latency Quantization, Flash Attention TP, PP NVIDIA GPU
llama.cpp Github stars Github forks Latency Quantization, Flash Attention, Speculation, Continuous batching TP, PP NVIDIA GPU, AMD GPU, Intel GPU/CPU, Mac
text-generation-inference Github stars Github forks Latency and Throughput Quantization, Flash Attention, PagedAttention, Continuous batching TP NVIDIA GPU, AMD GPU, Intel CPU

Suggestion: If your tasks are latency-sensistive, please use llama.cpp or TensorRT-LLM. If your tasks are latency-insensitive and have enough powerful GPUs, please use vLLM for a better throughput. If you just want to evaluate your research ideas with serving optimizations, text-generation-inference is a good chooice.

Serving on Heterogeneous Devices

A Single Device

Scaling to Distributed Devices

Other List

About

A Overview of Efficiently Serving Large Language Models across Edge Devices

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published