You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
User reported terminating 70B models served with vLLM would stall for 20 min.
service:
readiness_probe: /v1/models
resources:
# Can change to use more via `--gpus A100:N`. N can be 1 to 8.
accelerators: A100:2
cpus: 22
memory: 500
# Note: Big models need LOTS of disk space, especially if saved in float32.
# So specify a lot of disk.
disk_size: 400
# Keep fixed.
cloud: kubernetes
ports: 8000
image_id: docker:vllm/vllm-openai:latest
envs:
# Specify the training config via `--env MODEL=<>`
MODEL: ""
setup: |
conda deactivate
python3 -c "import huggingface_hub; huggingface_hub.login('${HUGGINGFACE_TOKEN}')"
run: |
conda deactivate
python3 -u -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \
--trust-remote-code \
--model $MODEL
From the kubelet:
error killing pod: [failed to “KillContainer” for “ray-node” with KillContainerError: “rpc error: code = DeadlineExceeded desc = context deadline exceeded”, failed to “KillPodSandbox” for “57e4f054-de56-4a2e-ad68-bd1d786fb02a” with KillPodSandboxError: “rpc error: code = Unknown desc = failed to stop container \“b0da7a28782608dd40df232ac4cb8d75b11a5fe64eeb45745a6dfa6bceee87b7\“: failed to kill container \“b0da7a28782608dd40df232ac4cb8d75b11a5fe64eeb45745a6dfa6bceee87b7\“: context deadline exceeded: unknown”]
kubectl delete pod NAME --grace-period=0 --force fixes it. I've seen this issue before when running training on Kubernetes outside of SkyPilot, and IIRC it is related to erring processes leaving file descriptors open that the kubelet keeps waiting for to be closed.
We should probably have this --grace-period=0 --force logic in our pod termination:
User reported terminating 70B models served with vLLM would stall for 20 min.
From the kubelet:
kubectl delete pod NAME --grace-period=0 --force
fixes it. I've seen this issue before when running training on Kubernetes outside of SkyPilot, and IIRC it is related to erring processes leaving file descriptors open that the kubelet keeps waiting for to be closed.We should probably have this
--grace-period=0 --force
logic in our pod termination:skypilot/sky/provision/kubernetes/instance.py
Lines 616 to 617 in 465d36c
The text was updated successfully, but these errors were encountered: