This is where all the stochastic parrots of the llm-serve live. Each model is defined by a YAML configuration file in this directory.
To modify an existing model, simply edit the YAML file for that model. Each config file consists of three parts:
deployment_config
model_conf
scaling_config
It is better to check out examples of existing models to see how they are configured.
To give you a brief overview, the deployment_config
section corresponds to
Ray Serve configuration
and specifies how to auto-scale the model
(via autoscaling_config
) and what specific options you may need for your
Ray Actors during deployments (using ray_actor_options
).
The model_conf
section specifies the Hugging Face model ID (model_id
), how to initialize it (initialization
) and what parameters to use when generating tokens
with an LLM (generation
). We use Hugging Face Transformers under the hood.
llm-serve implements several types of initializer:
- SingleDevice - just load the model onto a single GPU.
- DeviceMap - use the
device_map
argument to load the model onto multiple GPUs on a single node. - llamaCpp - use llama-cpp-python to
load the model. llama.cpp is separate from Torch & Hugging Face Transformers and uses its own model format.
The model files are still downloaded from Hugging Face Hub - specify
model_filename
to control which file in the repository will be loaded. - vLLM - use vLLM to load the model. vLLM is a fast and easy-to-use library for LLM inference and serving. The model files are still downloaded from Hugging Face Hub.
Finally, the scaling_config
section specifies what resources should be used to
serve the model - this corresponds to Ray AIR ScalingConfig.
Notably, we use resources_per_worker
to set Ray custom resources
to force the models onto specific node types - the corresponding resources are set
in node definitions.
If you need to learn more about a specific configuration option, or need to add a new one, please reach out to the team.
To add an entirely new model to the zoo, you will need to create a new YAML file.
This file should follow the naming convention <organisation-name>--<model-name>-<model-parameters>-<extra-info>.yaml
.
For instance, the YAML example shown below is stored in a file called mosaicml--mpt-7b-instruct.yaml
:
deployment_config:
# This corresponds to Ray Serve settings, as generated with
# `serve build`.
autoscaling_config:
min_replicas: 1
initial_replicas: 1
max_replicas: 8
target_ongoing_requests: 1.0
metrics_interval_s: 10.0
look_back_period_s: 30.0
smoothing_factor: 1.0
downscale_delay_s: 300.0
upscale_delay_s: 90.0
ray_actor_options:
resources:
instance_type_m5: 0.01
model_conf:
# Hugging Face model id
model_id: mosaicml/mpt-7b-instruct
initialization:
# Optional runtime environment configuration.
# Add dependent libraries
runtime_env:
# Optional configuration for loading the model from S3 instead of
# Hugging Face Hub. You can use this to speed up downloads.
s3_mirror_config:
bucket_uri: s3://large-dl-models-mirror/models--mosaicml--mpt-7b-instruct/main-safetensors/
# How to initialize the model.
initializer:
# Initializer type. Can be one of:
# - SingleDevice - just load the model onto a single GPU
# - DeviceMap - use the `device_map` argument to load the model onto multiple
# GPUs on a single node
# - DeepSpeed - use DeepSpeed to load the model onto multiple GPUs on a single
# or multiple nodes and run the model in tensor parallel mode (`deepspeed.init_inference`)
type: SingleDevice
# dtype to use when loading the model
dtype: bfloat16
# kwargs to pass to `AutoModel.from_pretrained`
from_pretrained_kwargs:
trust_remote_code: true
use_cache: true
# Whether to use Hugging Face Optimum BetterTransformer to inject flash attention
# (may not work with all models)
use_bettertransformer: false
# Whether to use Torch 2.0 `torch.compile` to compile the model
torch_compile:
backend: inductor
mode: max-autotune
# llm-serve pipeline class. This is separate from Hugging Face pipelines.
# Leave as default for now.
pipeline: default
generation:
# Max batch size to use when generating tokens
max_batch_size: 22
# Kwargs passed to `model.generate`
generate_kwargs:
do_sample: true
max_new_tokens: 512
min_new_tokens: 16
top_p: 1.0
top_k: 0
temperature: 0.1
repetition_penalty: 1.1
# Prompt format to wrap queries in. Must be empty or contain `{instruction}`.
prompt_format: "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{instruction}\n### Response:\n"
# Stopping sequences. The generation will stop when it encounters any of the sequences, or the tokenizer EOS token.
# Those can be strings, integers (token ids) or lists of integers.
stopping_sequences: ["### Response:", "### End"]
# Resources assigned to the model. This corresponds to Ray AIR ScalingConfig.
scaling_config:
# DeepSpeed requires one worker per GPU - keep num_gpus_per_worker at 1 and
# change num_workers.
# For other initializers, you should set num_workers to 1 and instead change
# num_gpus_per_worker.
num_workers: 1
num_gpus_per_worker: 1
num_cpus_per_worker: 4
resources_per_worker:
# You can use custom resources to specify the instance type / accelerator type
# to use for the model.
instance_type_g5: 0.01
You will notice that many of the deviations between models are small. For instance, the version of "chat" stored in mosaicml--mpt-7b-chat.yaml has only four values that differ from the example above.
...
model_conf:
model_id: mosaicml/mpt-7b-chat
initialization:
s3_mirror_config:
bucket_uri: s3://large-dl-models-mirror/models--mosaicml--mpt-7b-instruct/main-safetensors/
generation:
prompt_format: "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{instruction}\n### Response:\n"
stopping_sequences: ["### Response:", "### End"]