Skip to content

Latest commit

 

History

History

models

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

llm-serve model registry

This is where all the stochastic parrots of the llm-serve live. Each model is defined by a YAML configuration file in this directory.

Modify an existing model

To modify an existing model, simply edit the YAML file for that model. Each config file consists of three parts:

  • deployment_config
  • model_config
  • scaling_config

It is better to check out examples of existing models to see how they are configured.

To give you a brief overview, the deployment_config section corresponds to Ray Serve configuration and specifies how to auto-scale the model (via autoscaling_config) and what specific options you may need for your Ray Actors during deployments (using ray_actor_options).

The model_config section specifies the Hugging Face model ID (model_id), how to initialize it (initialization) and what parameters to use when generating tokens with an LLM (generation). We use Hugging Face Transformers under the hood. llm-serve implements several types of initializer:

  • SingleDevice - just load the model onto a single GPU.
  • DeviceMap - use the device_map argument to load the model onto multiple GPUs on a single node.
  • llamaCpp - use llama-cpp-python to load the model. llama.cpp is separate from Torch & Hugging Face Transformers and uses its own model format. The model files are still downloaded from Hugging Face Hub - specify model_filename to control which file in the repository will be loaded.
  • vLLM - use vLLM to load the model. vLLM is a fast and easy-to-use library for LLM inference and serving. The model files are still downloaded from Hugging Face Hub.

Finally, the scaling_config section specifies what resources should be used to serve the model - this corresponds to Ray AIR ScalingConfig. Notably, we use resources_per_worker to set Ray custom resources to force the models onto specific node types - the corresponding resources are set in node definitions.

If you need to learn more about a specific configuration option, or need to add a new one, please reach out to the team.

Adding a new model

To add an entirely new model to the zoo, you will need to create a new YAML file. This file should follow the naming convention <organisation-name>--<model-name>-<model-parameters>-<extra-info>.yaml.

For instance, the YAML example shown below is stored in a file called mosaicml--mpt-7b-instruct.yaml:

deployment_config:
  # This corresponds to Ray Serve settings, as generated with
  # `serve build`.
  autoscaling_config:
    min_replicas: 1
    initial_replicas: 1
    max_replicas: 8
    target_num_ongoing_requests_per_replica: 1.0
    metrics_interval_s: 10.0
    look_back_period_s: 30.0
    smoothing_factor: 1.0
    downscale_delay_s: 300.0
    upscale_delay_s: 90.0
  ray_actor_options:
    resources:
      instance_type_m5: 0.01

model_config:
  # Hugging Face model id
  model_id: mosaicml/mpt-7b-instruct
  initialization:
    # Optional runtime environment configuration. 
    # Add dependent libraries
    runtime_env:
    # Optional configuration for loading the model from S3 instead of
    # Hugging Face Hub. You can use this to speed up downloads.
    s3_mirror_config:
      bucket_uri: s3://large-dl-models-mirror/models--mosaicml--mpt-7b-instruct/main-safetensors/
    # How to initialize the model.
    initializer:
      # Initializer type. Can be one of:
      # - SingleDevice - just load the model onto a single GPU
      # - DeviceMap - use the `device_map` argument to load the model onto multiple
      #   GPUs on a single node
      # - DeepSpeed - use DeepSpeed to load the model onto multiple GPUs on a single
      #   or multiple nodes and run the model in tensor parallel mode (`deepspeed.init_inference`)
      type: SingleDevice
      # dtype to use when loading the model
      dtype: bfloat16
      # kwargs to pass to `AutoModel.from_pretrained`
      from_pretrained_kwargs:
        trust_remote_code: true
        use_cache: true
      # Whether to use Hugging Face Optimum BetterTransformer to inject flash attention
      # (may not work with all models)
      use_bettertransformer: false
      # Whether to use Torch 2.0 `torch.compile` to compile the model
      torch_compile:
        backend: inductor
        mode: max-autotune
    # llm-serve pipeline class. This is separate from Hugging Face pipelines.
    # Leave as default for now.
    pipeline: default
  generation:
    # Max batch size to use when generating tokens
    max_batch_size: 22
    # Kwargs passed to `model.generate`
    generate_kwargs:
      do_sample: true
      max_new_tokens: 512
      min_new_tokens: 16
      top_p: 1.0
      top_k: 0
      temperature: 0.1
      repetition_penalty: 1.1
    # Prompt format to wrap queries in. Must be empty or contain `{instruction}`.
    prompt_format: "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{instruction}\n### Response:\n"
    # Stopping sequences. The generation will stop when it encounters any of the sequences, or the tokenizer EOS token.
    # Those can be strings, integers (token ids) or lists of integers.
    stopping_sequences: ["### Response:", "### End"]

# Resources assigned to the model. This corresponds to Ray AIR ScalingConfig.
scaling_config:
  # DeepSpeed requires one worker per GPU - keep num_gpus_per_worker at 1 and
  # change num_workers.
  # For other initializers, you should set num_workers to 1 and instead change
  # num_gpus_per_worker.
  num_workers: 1
  num_gpus_per_worker: 1
  num_cpus_per_worker: 4
  resources_per_worker:
    # You can use custom resources to specify the instance type / accelerator type
    # to use for the model.
    instance_type_g5: 0.01

You will notice that many of the deviations between models are small. For instance, the version of "chat" stored in mosaicml--mpt-7b-chat.yaml has only four values that differ from the example above.

...
model_config:
  model_id: mosaicml/mpt-7b-chat
  initialization:
    s3_mirror_config:
      bucket_uri: s3://large-dl-models-mirror/models--mosaicml--mpt-7b-instruct/main-safetensors/
  generation:
    prompt_format: "Below is an instruction that describes a task. Write a response that appropriately completes the request.\n### Instruction:\n{instruction}\n### Response:\n"
    stopping_sequences: ["### Response:", "### End"]