Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster autoscaling with nos #43

Open
ktzsh opened this issue Jul 18, 2023 · 1 comment
Open

Cluster autoscaling with nos #43

ktzsh opened this issue Jul 18, 2023 · 1 comment

Comments

@ktzsh
Copy link

ktzsh commented Jul 18, 2023

Hi I wanted to use nos as an autoscaler too, scaling in and out gpu nodes within the cluster while using MPS. Since nos already watches resource requests and availability it should be possible to add nodes to the cluster depending upon the resources requested leading to additional cost saving on top of higher GPU utilization.

Is this feature part of roadmap? Or if someone familiar with nos can help direct the best way to implement this within nos.

@khageshsaini
Copy link

Cluster Autoscaler had support for specifying custom resources in the node templates. This feature wasn't working up until recently when this was fixed.

These custom resources can be used by cluster autoscaler while making scale-up decisions. This all works fine if gpu-partitioner is able to schedule the pending pod on the new node. If it doesn't autoscaler will delete the node and add a new node.

GPU-partitioner seems to be not validating the node presence in the cluster while computing the desired partitioning state. In this case, it may partition a node which no longer exists in the cluster and hence, the pending pod remains unscheduled forever. GPU-partitioner should cleanup the device plugin config map when it detects a node no longer exists in the cluster.

The issue #41 also seems to be causing issue while deploying and testing NOS with autoscaler.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants