You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, replicas scale based on QPS. It would also be beneficial to be able to scale based on response time. This can help ensure a maximum response time when a system gets overloaded due to long running queries as opposed to the number of queries.
The text was updated successfully, but these errors were encountered:
More generally, it would be useful we can can plugin our own scaling logic so we can tailor it to our needs (for example, target prompt tokens per second)
Currently, replicas scale based on QPS. It would also be beneficial to be able to scale based on response time. This can help ensure a maximum response time when a system gets overloaded due to long running queries as opposed to the number of queries.
The text was updated successfully, but these errors were encountered: