Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Serve] Allow scaling of replicas based on response time #3686

Closed
JGSweets opened this issue Jun 24, 2024 · 3 comments
Closed

[Serve] Allow scaling of replicas based on response time #3686

JGSweets opened this issue Jun 24, 2024 · 3 comments
Labels

Comments

@JGSweets
Copy link
Contributor

Currently, replicas scale based on QPS. It would also be beneficial to be able to scale based on response time. This can help ensure a maximum response time when a system gets overloaded due to long running queries as opposed to the number of queries.

@WesleyYue
Copy link

More generally, it would be useful we can can plugin our own scaling logic so we can tailor it to our needs (for example, target prompt tokens per second)

Copy link

github-actions bot commented Nov 7, 2024

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the Stale label Nov 7, 2024
Copy link

This issue was closed because it has been stalled for 10 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Nov 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants