The MAX platform unifies the leading AI development frameworks (TensorFlow, PyTorch, ONNX) and hardware backends in order to simplify deployment for AI production teams and accelerate innovation for AI developers.
For more information about using this Helm chart, see the tutorial to Deploy Llama 3 on GPU-powered Kubernetes clusters
Homepage: https://www.modular.com/
To install this chart using Helm 3, run the following command:
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version <insert-version> \
--set huggingfaceRepoId=<insert-huggingface-model-id>
--set maxServe.maxLength=512 \
--set maxServe.maxBatchSize=16 \
--set envSecret.HF_TOKEN=<insert-huggingface-token> \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--wait
The command deploys MAX OpenAI API on the Kubernetes cluster in the default configuration. The Values reference section below lists the parameters that can be configured during installation.
To upgrade the chart with the release name max-openai-api
:
helm upgrade max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart
To uninstall/delete the max-openai-api
deployment:
helm delete max-openai-api
To provision a k8s cluster via eksctl
and then install MAX OpenAI API, run the following commands:
# provision a k8s cluster (takes 10-15 minutes)
eksctl create cluster \
--name max-openai-api-demo \
--region us-east-1 \
--node-type g5.4xlarge \
--nodes 1
# create a k8s namespace
kubectl create namespace max-openai-api-demo
# deploy MAX OpenAI API via helm chart (takes 10 minutes)
helm install max-openai-api oci://registry-1.docker.io/modular/max-openai-api-chart \
--version <insert-version> \
--namespace max-openai-api-demo \
--set huggingfaceRepoId=modularai/Llama-3.1-8B-Instruct-GGUF
--set maxServe.maxLength=512 \
--set maxServe.maxBatchSize=16 \
--set envSecret.HF_TOKEN=<insert-huggingface-token> \
--set env.HF_HUB_ENABLE_HF_TRANSFER=1 \
--timeout 10m0s \
--wait
# forward the remote k8s port to the local network to access the service locally
# the command is blocking and takes the terminal
# user another terminal for subsequent curl and ctrl-c to stop the port forwarding
POD_NAME=$(kubectl get pods --namespace max-openai-api-demo -l "app.kubernetes.io/name=max-openai-api-chart,app.kubernetes.io/instance=max-openai-api" -o jsonpath="{.items[0].metadata.name}")
CONTAINER_PORT=$(kubectl get pod --namespace max-openai-api-demo $POD_NAME -o jsonpath="{.spec.containers[0].ports[0].containerPort}")
kubectl port-forward $POD_NAME 8000:$CONTAINER_PORT --namespace max-openai-api-demo &
# test the service
curl -N http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "modularai/Llama-3.1-8B-Instruct-GGUF",
"stream": true,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Who won the world series in 2020?"}
]
}'
# uninstall MAX OpenAI API
helm uninstall max-openai-api --namespace max-openai-api-demo
# Delete the namespace
kubectl delete namespace max-openai-api-demo
# delete the k8s cluster
eksctl delete cluster \
--name max-openai-api-demo \
--region us-east-1
Key | Type | Default | Description |
---|---|---|---|
affinity | object | {} |
Affinity to be added to all deployments |
env | object | {} |
Environment variables that will be passed into pods |
envFromSecret | string | "{{ template \"max.fullname\" . }}-env" |
The name of the secret which we will use to populate env vars in deployed pods This can be useful for secret keys, etc. |
envFromSecrets | list | [] |
This can be a list of templated strings |
envRaw | list | [] |
Environment variables in RAW format that will be passed into pods |
envSecret | object | {} |
Environment variables to pass as secrets |
fullnameOverride | string | nil |
Provide a name to override the full names of resources |
image.pullPolicy | string | "IfNotPresent" |
|
image.repository | string | "modular/max-openai-api" |
|
image.tag | string | "latest" |
|
imagePullSecrets | list | [] |
|
inferenceServer.affinity | object | {} |
Affinity to be added to inferenceServer deployment |
inferenceServer.args | list | See values.yaml |
Arguments to pass to the node entrypoint. If defined it overwrites the default args value set by .Values.max-serve |
inferenceServer.autoscaling.enabled | bool | false |
|
inferenceServer.autoscaling.maxReplicas | int | 2 |
|
inferenceServer.autoscaling.minReplicas | int | 1 |
|
inferenceServer.autoscaling.targetCPUUtilizationPercentage | int | 80 |
|
inferenceServer.containerSecurityContext | object | {} |
|
inferenceServer.deploymentAnnotations | object | {} |
Annotations to be added to inferenceServer deployment |
inferenceServer.deploymentLabels | object | {} |
Labels to be added to inferenceServer deployment |
inferenceServer.env | object | {} |
|
inferenceServer.extraContainers | list | [] |
Launch additional containers into inferenceServer pod |
inferenceServer.livenessProbe.failureThreshold | int | 3 |
|
inferenceServer.livenessProbe.httpGet.path | string | "/v1/health" |
|
inferenceServer.livenessProbe.httpGet.port | string | "http" |
|
inferenceServer.livenessProbe.initialDelaySeconds | int | 1 |
|
inferenceServer.livenessProbe.periodSeconds | int | 15 |
|
inferenceServer.livenessProbe.successThreshold | int | 1 |
|
inferenceServer.livenessProbe.timeoutSeconds | int | 1 |
|
inferenceServer.nodeSelector | object | {} |
NodeSelector to be added to inferenceServer deployment |
inferenceServer.podAnnotations | object | {} |
Annotations to be added to inferenceServer pods |
inferenceServer.podLabels | object | {} |
Labels to be added to inferenceServer pods |
inferenceServer.podSecurityContext | object | {} |
|
inferenceServer.readinessProbe.failureThreshold | int | 3 |
|
inferenceServer.readinessProbe.httpGet.path | string | "/v1/health" |
|
inferenceServer.readinessProbe.httpGet.port | string | "http" |
|
inferenceServer.readinessProbe.initialDelaySeconds | int | 1 |
|
inferenceServer.readinessProbe.periodSeconds | int | 15 |
|
inferenceServer.readinessProbe.successThreshold | int | 1 |
|
inferenceServer.readinessProbe.timeoutSeconds | int | 1 |
|
inferenceServer.replicaCount | int | 1 |
|
inferenceServer.resources | object | {} |
Resource settings for the inferenceServer pods - these settings overwrite existing values from the global resources object defined above. |
inferenceServer.startupProbe.failureThreshold | int | 60 |
|
inferenceServer.startupProbe.httpGet.path | string | "/v1/health" |
|
inferenceServer.startupProbe.httpGet.port | string | "http" |
|
inferenceServer.startupProbe.initialDelaySeconds | int | 1 |
|
inferenceServer.startupProbe.periodSeconds | int | 5 |
|
inferenceServer.startupProbe.successThreshold | int | 1 |
|
inferenceServer.startupProbe.timeoutSeconds | int | 1 |
|
inferenceServer.strategy | object | {} |
|
inferenceServer.tolerations | list | [] |
Tolerations to be added to inferenceServer deployment |
inferenceServer.topologySpreadConstraints | list | [] |
TopologySpreadConstrains to be added to inferenceServer deployments |
inferenceServer.volumeMounts | list | [] |
Volumes to mount into inferenceServer pod |
inferenceServer.volumes | list | [] |
Volumes to mount into inferenceServer pod |
ingress.annotations | object | {} |
|
ingress.enabled | bool | false |
|
ingress.extraHostsRaw | list | [] |
|
ingress.hosts | list | [] |
|
ingress.ingressClassName | string | nil |
|
ingress.path | string | "/" |
|
ingress.pathType | string | "ImplementationSpecific" |
|
ingress.tls | list | [] |
|
maxServe | object | {"cacheStrategy":"continuous","huggingfaceRepoId":"modularai/Llama-3.1-8B-Instruct-GGUF","maxBatchSize":"250","maxLength":"2048","maxNumSteps":"10"} |
MAX Serve arguments |
nameOverride | string | nil |
Provide a name to override the name of the chart |
nodeSelector | object | {} |
NodeSelector to be added to all deployments |
resources | object | {} |
|
runAsUser | int | 0 |
User ID directive. This user must have enough permissions to run the bootstrap script Running containers as root is not recommended in production. Change this to another UID - e.g. 1000 to be more secure |
service.annotations | object | {} |
|
service.loadBalancerIP | string | nil |
|
service.ports[0].name | string | "http" |
|
service.ports[0].port | int | 8000 |
|
service.ports[0].protocol | string | "TCP" |
|
service.ports[0].targetPort | int | 8000 |
|
service.type | string | "ClusterIP" |
|
serviceAccount.annotations | object | {} |
|
serviceAccount.create | bool | false |
Create custom service account for MAX Serving. If create: true and serviceAccountName is not provided, max.fullname will be used. |
serviceAccountName | string | nil |
Specify service account name to be used |
tolerations | list | [] |
Tolerations to be added to all deployments |
topologySpreadConstraints | list | [] |
TopologySpreadConstraints to be added to all deployments |
volumeMounts | list | [] |
|
volumes | list | [] |