High-throughput serving engine for AI models
✅ Batching ✅ Streaming ✅ Auto-GPU, multi-GPU ✅ Multi-modal ✅ PyTorch/JAX/TF ✅ Full control ✅ Auth ✅ Built on Fast API ✅ Custom specs (Open AI)
Lightning AI • Get started • Examples • Features • Docs
LitServe is a high-throughput serving engine for deploying AI models at scale. LitServe generates an API endpoint for a model, handles batching, streaming, autoscaling across CPU/GPUs and more.
Why we wrote LitServe:
- Work with any model: LLMs, vision, time-series, etc...
- We wanted a zero abstraction, minimal, hackable code-base without bloat.
- Built for enterprise scale (not demos, etc...).
- Easy enough for researchers, scalable and hackable for engineers.
- Work on any hardware (GPU/TPU) automatically.
- Let you focus on model performance, not the serving boilerplate.
Think of LitServe as PyTorch Lightning for model serving (if you're familiar with Lightning) but supports every framework like PyTorch, JAX, Tensorflow and more.
litserve-promo.mp4
Explore various examples that show different models deployed with LitServe:
Example | description | Run |
---|---|---|
Hello world | Hello world model | |
Llama 3 (8B) | (LLM) Deploy Llama 3 | |
ANY Hugging face model | (Text) Deploy any Hugging Face model | |
Hugging face BERT model | (Text) Deploy model for tasks like text generation and more | |
Open AI CLIP | (Multimodal) Deploy Open AI CLIP for tasks like image understanding | |
Open AI Whisper | (Audio) Deploy Open AI Whisper for tasks like speech to text | |
Stable diffusion 2 | (Vision) Deploy Stable diffusion 2 for tasks like image generation | |
Text-speech (XTTS V2) | (Speech) Deploy a text to speech voice cloning API. |
Install LitServe via pip:
pip install litserve
Advanced install options
Install the main branch:
pip install git+https://github.com/Lightning-AI/litserve.git@main
Install from source:
git clone https://github.com/Lightning-AI/LitServe
cd LitServe
pip install -e '.[all]'
LitServe has a minimal API that allows enterprise-scale, with full control.
- Implement the LitAPI class which describes the inference process for the model(s).
- Enable the specific optimizations (such as batching or streaming) in the LitServer.
Here's a hello world example:
# server.py
import litserve as ls
# STEP 1: DEFINE YOUR MODEL API
class SimpleLitAPI(ls.LitAPI):
def setup(self, device):
# Setup the model so it can be called in `predict`.
self.model = lambda x: x**2
def decode_request(self, request):
# Convert the request payload to your model input.
return request["input"]
def predict(self, x):
# Run the model on the input and return the output.
return self.model(x)
def encode_response(self, output):
# Convert the model output to a response payload.
return {"output": output}
# STEP 2: START THE SERVER
if __name__ == "__main__":
api = SimpleLitAPI()
server = ls.LitServer(api, accelerator="auto")
server.run(port=8000)
Now run the server via the command-line
python server.py
LitServe automatically generates a client when it starts. Use this client to test the server:
python client.py
Or ping the server yourself directly
import requests
response = requests.post("http://127.0.0.1:8000/predict", json={"input": 4.0})
The server expects the client to send a POST
to the /predict
URL with a JSON payload.
The way the payload is structured is up to the implementation of the LitAPI
subclass.
LitServe supports multiple advanced state-of-the-art features.
Feature | description |
---|---|
Accelerators | CPU, GPU, Multi-GPU, mps |
Auto-GPU | Detects and auto-runs on all GPUs on a machine |
Model types | LLMs, Vision, Time series, any model type... |
ML frameworks | PyTorch, Jax, Tensorflow, numpy, etc... |
Batching | ✅ |
API authentication | ✅ |
Multiple models in a single API | ✅ |
Full request/response control | ✅ |
Automatic schema validation | ✅ |
Handle timeouts | ✅ |
Handle disconnects | ✅ |
Streaming | ✅ |
Note
Our goal is not to jump on every hype train, but instead support features that scale under the most demanding enterprise deployments.
Explore each feature in detail:
Use accelerators automatically (GPUs, CPU, mps)
LitServe automatically detects GPUs on a machine and uses them when available:
import litserve as ls
from litserve.examples import SimpleLitAPI
# Automatically selects the available accelerator
api = SimpleLitAPI() # defined by you with ls.LitAPI
# when running on GPUs these are equivalent. It's best to let Lightning decide by not specifying it!
server = ls.LitServer(api)
server = ls.LitServer(api, accelerator="cuda")
server = ls.LitServer(api, accelerator="auto")
LitServer
accepts an accelerator
argument which defaults to "auto"
. It can also be explicitly set to "cpu"
, "cuda"
, or
"mps"
if you wish to manually control the device placement.
The following example shows how to set the accelerator manually:
import litserve as ls
from litserve.examples import SimpleLitAPI
# Run on CUDA-supported GPUs
server = ls.LitServer(SimpleLitAPI(), accelerator="cuda")
# Run on Apple's Metal-powered GPUs
server = ls.LitServer(SimpleLitAPI(), accelerator="mps")
Serve on multi-GPUs
LitServer
has the ability to coordinate serving from multiple GPUs.
LitServer
accepts a devices
argument which defaults to "auto"
. On multi-GPU machines, LitServe
will run a copy of the model on each device detected on the machine.
The devices
argument can also be explicitly set to the desired number of devices to use on the machine.
import litserve as ls
from litserve.examples import SimpleLitAPI
# Automatically selects the available accelerators
api = SimpleLitAPI() # defined by you with ls.LitAPI
# when running on a 4-GPUs machine these are equivalent.
# It's best to let Lightning decide by not specifying accelerator and devices!
server = ls.LitServer(api)
server = ls.LitServer(api, accelerator="cuda", devices=4)
server = ls.LitServer(api, accelerator="auto", devices="auto")
For example, running the API server on a 4-GPU machine, with a PyTorch model served on each GPU:
import torch, torch.nn as nn
import litserve as ls
class Linear(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(1, 1)
self.linear.weight.data.fill_(2.0)
self.linear.bias.data.fill_(1.0)
def forward(self, x):
return self.linear(x)
class SimpleTorchAPI(ls.LitAPI):
def setup(self, device):
# move the model to the correct device
# keep track of the device for moving data accordingly
self.model = Linear().to(device)
self.device = device
def decode_request(self, request):
# get the input and create a 1D tensor on the correct device
content = request["input"]
return torch.tensor([content], device=self.device)
def predict(self, x):
# the model expects a batch dimension, so create it
return self.model(x[None, :])
def encode_response(self, output):
# float will take the output value directly onto CPU memory
return {"output": float(output)}
if __name__ == "__main__":
# accelerator="auto" (or "cuda"), devices="auto" (or 4) will lead to 4 workers serving
# the model from "cuda:0", "cuda:1", "cuda:2", "cuda:3" respectively
server = ls.LitServer(SimpleTorchAPI(), accelerator="auto", devices="auto")
server.run(port=8000)
The devices
argument can also be an array specifying what device id to
run the model on:
import litserve as ls
from litserve.examples import SimpleTorchAPI
server = ls.LitServer(SimpleTorchAPI(), accelerator="cuda", devices=[0, 3])
Last, you can run multiple copies of the same model from the same device, if the model is small. The following will load two copies of the model on each of the 4 GPUs:
import litserve as ls
from litserve.examples import SimpleTorchAPI
server = ls.LitServer(SimpleTorchAPI(), accelerator="cuda", devices=4, workers_per_device=2)
Timeouts and disconnections
The server will remove a queued request if the client requesting it disconnects.
You can configure a timeout (in seconds) after which clients will receive a 504
HTTP
response (Gateway Timeout) indicating that their request has timed out.
For example, this is how you can configure the server with a timeout of 30 seconds per response.
import litserve as ls
from litserve.examples import SimpleLitAPI
server = ls.LitServer(SimpleLitAPI(), timeout=30)
This is useful to avoid requests queuing up beyond the ability of the server to respond.
To disable the timeout for long-running tasks, set timeout=False
or timeout=-1
:
import litserve as ls
from litserve.examples import SimpleLitAPI
server = ls.LitServer(SimpleLitAPI(), timeout=False)
Use API key authentication
In order to secure the API behind an API key, just define the env var when starting the server
LIT_SERVER_API_KEY=supersecretkey python main.py
Clients are expected to auth with the same API key set in the X-API-Key
HTTP header.
Dynamic batching
LitServe can combine individual requests into a batch to improve throughput.
To enable batching, you need to set the max_batch_size
argument to match the batch size that your model can handle
and implement LitAPI.predict
to process batched inputs.
import numpy as np
import litserve as ls
class SimpleBatchedAPI(ls.LitAPI):
def setup(self, device) -> None:
self.model = lambda x: x ** 2
def decode_request(self, request):
return np.asarray(request["input"])
def predict(self, x):
result = self.model(x)
return result
def encode_response(self, output):
return {"output": output}
if __name__ == "__main__":
api = SimpleBatchedAPI()
server = ls.LitServer(api, max_batch_size=4, batch_timeout=0.05)
server.run(port=8000)
You can control the wait time to aggregate requests into a batch with the batch_timeout
argument.
In the above example, the server will wait for 0.05 seconds to combine 4 requests together.
LitServe automatically stacks NumPy arrays and PyTorch tensors along the batch dimension before calling the
LitAPI.predict
method, and splits the output across requests afterward. You can customize this behavior by overriding the
LitAPI.batch
and LitAPI.unbatch
methods to handle different data types.
import litserve as ls
from litserve.examples import SimpleBatchedAPI
import numpy as np
class CustomBatchedAPI(SimpleBatchedAPI):
def batch(self, inputs):
return np.stack(inputs)
def unbatch(self, output):
return list(output)
if __name__ == "__main__":
api = CustomBatchedAPI()
server = ls.LitServer(api, max_batch_size=4, batch_timeout=0.05)
server.run(port=8000)
Stream long responses
LitServe can stream outputs from the model in real-time, such as returning text one word at a time from a language model.
To enable streaming, you need to set LitServer(..., stream=True)
and implement LitAPI.predict
and LitAPI.encode_response
as a generator (a Python function that yields output).
For example, streaming long responses generated over time:
import litserve as ls
class SimpleStreamAPI(ls.LitAPI):
def setup(self, device) -> None:
self.model = lambda x, y: x * y
def decode_request(self, request):
return request["input"]
def predict(self, x):
for i in range(10):
yield self.model(x, i)
def encode_response(self, output):
for out in output:
yield {"output": out}
if __name__ == "__main__":
api = SimpleStreamAPI()
server = ls.LitServer(api, stream=True)
server.run(port=8000)
To consume a streaming response, your client should iterate through the response content as follows, otherwise your response content would be concatenated as a byte string.
import requests
url = "http://127.0.0.1:8000/predict"
resp = requests.post(url, json={"input": 1.0}, headers=None, stream=True)
for line in resp.iter_content(5000):
if line:
print(line.decode("utf-8"))
Automatic schema validation
Define the request and response as Pydantic models, to automatically validate the request.
from pydantic import BaseModel
import litserve as ls
class PredictRequest(BaseModel):
input: float
class PredictResponse(BaseModel):
output: float
class SimpleLitAPI(ls.LitAPI):
def setup(self, device):
self.model = lambda x: x**2
def decode_request(self, request: PredictRequest) -> float:
return request.input
def predict(self, x):
return self.model(x)
def encode_response(self, output: float) -> PredictResponse:
return PredictResponse(output=output)
if __name__ == "__main__":
api = SimpleLitAPI()
server = ls.LitServer(api, accelerator="auto")
server.run(port=8000)
OpenAI compatible API
You can serve LLMs like OpenAI API endpoint specification with LitServe's OpenAISpec
by
providing the spec
argument to LitServer:
from transformers import pipeline
import litserve as ls
class GPT2LitAPI(ls.LitAPI):
def setup(self, device):
self.generator = pipeline('text-generation', model='gpt2', device=device)
def predict(self, prompt):
out = self.generator(prompt)
yield out[0]["generated_text"]
if __name__ == '__main__':
api = GPT2LitAPI()
server = ls.LitServer(api, accelerator='auto', spec=ls.OpenAISpec())
server.run(port=8000)
By default, LitServe will use decode_request
and encode_response
provided by OpenAISpec
,
so you don't need to provide them in LitAPI
.
In this case, predict
is expected to take an input with the following shape:
[{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello there"},
{"role": "assistant", "content": "Hello, how can I help?"},
{"role": "user", "content": "What is the capital of Australia?"}]
and produce an output with one of the following shapes:
"Canberra"
{"content": "Canberra"}
[{"content": "Canberra"}]
{"role": "assistant", "content": "Canberra"}
[{"role": "assistant", "content": "Canberra"}]
The above server can be queried using a standard OpenAI client:
import requests
response = requests.post("http://127.0.0.1:8000/v1/chat/completions", json={
"model": "my-gpt2",
"stream": False, # You can stream chunked response by setting this True
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
]
})
You can also customize the behavior of decode_request
and encode_response
by
overriding them in LitAPI
. In this case:
decode_request
takes alitserve.specs.openai.ChatCompletionRequest
in inputencode_response
yields alitserve.specs.openai.ChatMessage
See the OpenAI Pydantic models for reference.
Here is an example of overriding encode_response
in LitAPI
:
import litserve as ls
from litserve.specs.openai import ChatMessage
class GPT2LitAPI(ls.LitAPI):
def setup(self, device):
self.model = None
def predict(self, x):
yield {"role": "assistant", "content": "This is a generated output"}
def encode_response(self, output: dict) -> ChatMessage:
yield ChatMessage(role="assistant", content="This is a custom encoded output")
if __name__ == "__main__":
server = ls.LitServer(GPT2LitAPI(), spec=ls.OpenAISpec())
server.run(port=8000)
LitServe's OpenAISpec
can also handle images in the input. Here is an example:
import litserve as ls
from litserve.specs.openai import ChatMessage
class OpenAISpecLitAPI(ls.LitAPI):
def setup(self, device):
self.model = None
def predict(self, x):
if isinstance(x[1]["content"], list):
# handle mixed text and image input
image_url = x[1]["content"][1]["image_url"]
yield {"role": "assistant", "content": "\n\nThis image shows a wooden boardwalk extending through a lush green marshland."}
else:
yield {"role": "assistant", "content": "This is a generated output"}
def encode_response(self, output: dict) -> ChatMessage:
yield ChatMessage(role="assistant", content="This is a custom encoded output")
if __name__ == "__main__":
server = ls.LitServer(OpenAISpecLitAPI(), spec=ls.OpenAISpec())
server.run(port=8000)
When using OpenAISpec
, predict
is expected to take an input with the following shape:
-
Text Input Example:
[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello there"}, {"role": "assistant", "content": "Hello, how can I help?"}, {"role": "user", "content": "What is the capital of Australia?"}]
-
Mixed Text and Image Input Example:
[{"role": "system", "content": "You are a helpful assistant."}, { "role": "user", "content": [ {"type": "text", "text": "What's in this image?"}, { "type": "image_url", "image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg", }, ] }, {"role": "assistant", "content": "This image shows a wooden boardwalk extending through a lush green marshland."}, {"role": "user", "content": "What is the weather like in the image?"}]
The above server can be queried using a standard OpenAI client:
import requests
response = requests.post("http://127.0.0.1:8000/v1/chat/completions", json={
"model": "lit",
"stream": False, # You can stream chunked response by setting this True
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{
"type": "image_url",
"image_url": "https://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Gfp-wisconsin-madison-the-nature-boardwalk.jpg/2560px-Gfp-wisconsin-madison-the-nature-boardwalk.jpg",
},
]
}
]
})
LitServe is a community project accepting contributions. Let's make the world's most advanced AI inference engine.
Run tests
Use pytest
to run tests locally.
First, install test dependencies:
pip install -r _requirements/test.txt
Run the tests
pytest tests
litserve is released under the Apache 2.0 license. See LICENSE file for details.