Skip to content

Latest commit

 

History

History

max-openai-api

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

MAX OpenAI API Helm chart

The MAX platform unifies the leading AI development frameworks (TensorFlow, PyTorch, ONNX) and hardware backends in order to simplify deployment for AI production teams and accelerate innovation for AI developers.

For more information about using this Helm chart, see the tutorial to Deploy Llama 3 on GPU-powered Kubernetes clusters

Homepage: https://www.modular.com/

Source Code

Usage

Installing the chart

To install this chart using Helm 3, run the following command:

helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
  --version <insert-version> \
  --set huggingfaceRepoId=<insert-huggingface-model-id>
  --set maxServe.maxLength=512 \
  --set maxServe.maxBatchSize=16 \
  --set envSecret.HF_TOKEN=<insert-huggingface-token> \
  --set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
  --wait

The command deploys MAX OpenAI API on the Kubernetes cluster in the default configuration. The Values reference section below lists the parameters that can be configured during installation.

Upgrading the chart

To upgrade the chart with the release name max-openai-api:

helm upgrade max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart

Uninstalling the chart

To uninstall/delete the max-openai-api deployment:

helm delete max-openai-api

End-to-end example that provisions an K8s cluster and installs MAX OpenAI API

To provision a k8s cluster via eksctl and then install MAX OpenAI API, run the following commands:

# provision a k8s cluster (takes 10-15 minutes)
eksctl create cluster \
  --name max-openai-api-demo \
  --region us-east-1 \
  --node-type g5.4xlarge \
  --nodes 1

# create a k8s namespace
kubectl create namespace max-openai-api-demo

# deploy MAX OpenAI API via helm chart (takes 10 minutes)
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
  --version <insert-version> \
  --namespace max-openai-api-demo \
  --set huggingfaceRepoId=modularai/Llama-3.1-8B-Instruct-GGUF
  --set maxServe.maxLength=512 \
  --set maxServe.maxBatchSize=16 \
  --set envSecret.HF_TOKEN=<insert-huggingface-token> \
  --set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
  --timeout 10m0s \
  --wait

# forward the remote k8s port to the local network to access the service locally
# the command is blocking and takes the terminal
# user another terminal for subsequent curl and ctrl-c to stop the port forwarding
POD_NAME=$(kubectl get pods --namespace max-openai-api-demo -l "app.kubernetes.io/name=max-openai-api-chart,app.kubernetes.io/instance=max-openai-api" -o jsonpath="{.items[0].metadata.name}")
CONTAINER_PORT=$(kubectl get pod --namespace max-openai-api-demo $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace max-openai-api-demo &

# test the service
curl -N http://localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{
        "model": "modularai/Llama-3.1-8B-Instruct-GGUF",
        "stream": true,
        "messages": [
            {"role": "system", "content": "You are a helpful assistant."},
            {"role": "user", "content": "Who won the world series in 2020?"}
        ]
    }'

# uninstall MAX OpenAI API
helm uninstall max-openai-api --namespace max-openai-api-demo

# Delete the namespace
kubectl delete namespace max-openai-api-demo

# delete the k8s cluster
eksctl delete cluster \
  --name max-openai-api-demo \
  --region us-east-1

Values

Key Type Default Description
affinity object {} Affinity to be added to all deployments
env object {} Environment variables that will be passed into pods
envFromSecret string "{{ template \"max.fullname\" . }}-env" The name of the secret which we will use to populate env vars in deployed pods This can be useful for secret keys, etc.
envFromSecrets list [] This can be a list of templated strings
envRaw list [] Environment variables in RAW format that will be passed into pods
envSecret object {} Environment variables to pass as secrets
fullnameOverride string nil Provide a name to override the full names of resources
image.pullPolicy string "IfNotPresent"
image.repository string "modular/max-openai-api"
image.tag string "latest"
imagePullSecrets list []
inferenceServer.affinity object {} Affinity to be added to inferenceServer deployment
inferenceServer.args list See values.yaml Arguments to pass to the node entrypoint. If defined it overwrites the default args value set by .Values.max-serve
inferenceServer.autoscaling.enabled bool false
inferenceServer.autoscaling.maxReplicas int 2
inferenceServer.autoscaling.minReplicas int 1
inferenceServer.autoscaling.targetCPUUtilizationPercentage int 80
inferenceServer.containerSecurityContext object {}
inferenceServer.deploymentAnnotations object {} Annotations to be added to inferenceServer deployment
inferenceServer.deploymentLabels object {} Labels to be added to inferenceServer deployment
inferenceServer.env object {}
inferenceServer.extraContainers list [] Launch additional containers into inferenceServer pod
inferenceServer.livenessProbe.failureThreshold int 3
inferenceServer.livenessProbe.httpGet.path string "/v1/health"
inferenceServer.livenessProbe.httpGet.port string "http"
inferenceServer.livenessProbe.initialDelaySeconds int 1
inferenceServer.livenessProbe.periodSeconds int 15
inferenceServer.livenessProbe.successThreshold int 1
inferenceServer.livenessProbe.timeoutSeconds int 1
inferenceServer.nodeSelector object {} NodeSelector to be added to inferenceServer deployment
inferenceServer.podAnnotations object {} Annotations to be added to inferenceServer pods
inferenceServer.podLabels object {} Labels to be added to inferenceServer pods
inferenceServer.podSecurityContext object {}
inferenceServer.readinessProbe.failureThreshold int 3
inferenceServer.readinessProbe.httpGet.path string "/v1/health"
inferenceServer.readinessProbe.httpGet.port string "http"
inferenceServer.readinessProbe.initialDelaySeconds int 1
inferenceServer.readinessProbe.periodSeconds int 15
inferenceServer.readinessProbe.successThreshold int 1
inferenceServer.readinessProbe.timeoutSeconds int 1
inferenceServer.replicaCount int 1
inferenceServer.resources object {} Resource settings for the inferenceServer pods - these settings overwrite existing values from the global resources object defined above.
inferenceServer.startupProbe.failureThreshold int 60
inferenceServer.startupProbe.httpGet.path string "/v1/health"
inferenceServer.startupProbe.httpGet.port string "http"
inferenceServer.startupProbe.initialDelaySeconds int 1
inferenceServer.startupProbe.periodSeconds int 5
inferenceServer.startupProbe.successThreshold int 1
inferenceServer.startupProbe.timeoutSeconds int 1
inferenceServer.strategy object {}
inferenceServer.tolerations list [] Tolerations to be added to inferenceServer deployment
inferenceServer.topologySpreadConstraints list [] TopologySpreadConstrains to be added to inferenceServer deployments
inferenceServer.volumeMounts list [] Volumes to mount into inferenceServer pod
inferenceServer.volumes list [] Volumes to mount into inferenceServer pod
ingress.annotations object {}
ingress.enabled bool false
ingress.extraHostsRaw list []
ingress.hosts list []
ingress.ingressClassName string nil
ingress.path string "/"
ingress.pathType string "ImplementationSpecific"
ingress.tls list []
maxServe object {"cacheStrategy":"continuous","huggingfaceRepoId":"modularai/Llama-3.1-8B-Instruct-GGUF","maxBatchSize":"250","maxLength":"2048","maxNumSteps":"10"} MAX Serve arguments
nameOverride string nil Provide a name to override the name of the chart
nodeSelector object {} NodeSelector to be added to all deployments
resources object {}
runAsUser int 0 User ID directive. This user must have enough permissions to run the bootstrap script Running containers as root is not recommended in production. Change this to another UID - e.g. 1000 to be more secure
service.annotations object {}
service.loadBalancerIP string nil
service.ports[0].name string "http"
service.ports[0].port int 8000
service.ports[0].protocol string "TCP"
service.ports[0].targetPort int 8000
service.type string "ClusterIP"
serviceAccount.annotations object {}
serviceAccount.create bool false Create custom service account for MAX Serving. If create: true and serviceAccountName is not provided, max.fullname will be used.
serviceAccountName string nil Specify service account name to be used
tolerations list [] Tolerations to be added to all deployments
topologySpreadConstraints list [] TopologySpreadConstraints to be added to all deployments
volumeMounts list []
volumes list []