Skip to content

Commit

Permalink
[tune] Increase the minimum number of allowed pending trials for fast…
Browse files Browse the repository at this point in the history
…er auto-scaleup (ray-project#43455)

This PR bumps up the minimum number of allowed pending trials from 16 to 200. This increases the speed of autoscaling for a Tune job that starts with a small cluster.

---------

Signed-off-by: Justin Yu <[email protected]>
  • Loading branch information
justinvyu authored Feb 28, 2024
1 parent 783da64 commit eda7d7e
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 8 deletions.
2 changes: 1 addition & 1 deletion doc/source/tune/api/env.rst
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,7 @@ These are the environment variables Ray Tune currently considers:
* **TUNE_MAX_LEN_IDENTIFIER**: Maximum length of trial subdirectory names (those
with the parameter values in them)
* **TUNE_MAX_PENDING_TRIALS_PG**: Maximum number of pending trials when placement groups are used. Defaults
to ``auto``, which will be updated to ``max(16, cluster_cpus * 1.1)`` for random/grid search and ``1``
to ``auto``, which will be updated to ``max(200, cluster_cpus * 1.1)`` for random/grid search and ``1``
for any other search algorithms.
* **TUNE_NODE_SYNCING_MIN_ITER_THRESHOLD**: When syncing trial data between nodes, only sync if this many
iterations were recorded for the trial or the minimum time threshold was met. This will prevent unnecessary
Expand Down
14 changes: 7 additions & 7 deletions python/ray/tune/execution/tune_controller.py
Original file line number Diff line number Diff line change
Expand Up @@ -2172,14 +2172,14 @@ def _get_max_pending_trials(search_alg: SearchAlgorithm) -> int:
if not isinstance(search_alg, BasicVariantGenerator):
return 1

# Use a minimum of 16 to trigger fast autoscaling
# Scale up to at most the number of available cluster CPUs
# Allow up to at least 200 pending trials to trigger fast autoscaling
min_autoscaling_rate = 200

# Allow more pending trials for larger clusters (based on number of CPUs)
cluster_cpus = ray.cluster_resources().get("CPU", 1.0)
max_pending_trials = min(
max(search_alg.total_samples, 16), max(16, int(cluster_cpus * 1.1))
)
max_pending_trials = max(min_autoscaling_rate, int(cluster_cpus * 1.1))

if max_pending_trials > 128:
if max_pending_trials > min_autoscaling_rate:
logger.warning(
f"The maximum number of pending trials has been "
f"automatically set to the number of available "
Expand All @@ -2189,7 +2189,7 @@ def _get_max_pending_trials(search_alg: SearchAlgorithm) -> int:
f"of trials, this could lead to scheduling overhead. "
f"In this case, consider setting the "
f"`TUNE_MAX_PENDING_TRIALS_PG` environment variable "
f"to the desired maximum number of concurrent trials."
f"to the desired maximum number of concurrent pending trials."
)

return max_pending_trials
Expand Down

0 comments on commit eda7d7e

Please sign in to comment.