You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi I wanted to use nos as an autoscaler too, scaling in and out gpu nodes within the cluster while using MPS. Since nos already watches resource requests and availability it should be possible to add nodes to the cluster depending upon the resources requested leading to additional cost saving on top of higher GPU utilization.
Is this feature part of roadmap? Or if someone familiar with nos can help direct the best way to implement this within nos.
The text was updated successfully, but these errors were encountered:
Cluster Autoscaler had support for specifying custom resources in the node templates. This feature wasn't working up until recently when this was fixed.
These custom resources can be used by cluster autoscaler while making scale-up decisions. This all works fine if gpu-partitioner is able to schedule the pending pod on the new node. If it doesn't autoscaler will delete the node and add a new node.
GPU-partitioner seems to be not validating the node presence in the cluster while computing the desired partitioning state. In this case, it may partition a node which no longer exists in the cluster and hence, the pending pod remains unscheduled forever. GPU-partitioner should cleanup the device plugin config map when it detects a node no longer exists in the cluster.
The issue #41 also seems to be causing issue while deploying and testing NOS with autoscaler.
Hi I wanted to use nos as an autoscaler too, scaling in and out gpu nodes within the cluster while using MPS. Since nos already watches resource requests and availability it should be possible to add nodes to the cluster depending upon the resources requested leading to additional cost saving on top of higher GPU utilization.
Is this feature part of roadmap? Or if someone familiar with nos can help direct the best way to implement this within nos.
The text was updated successfully, but these errors were encountered: