[tune] Increase the minimum number of allowed pending trials for fast…

…er auto-scaleup (ray-project#43455) This PR bumps up the minimum number of allowed pending trials from 16 to 200. This increases the speed of autoscaling for a Tune job that starts with a small cluster. --------- Signed-off-by: Justin Yu <[email protected]>
chwjbn · Feb 28, 2024 · eda7d7e · eda7d7e
1 parent 783da64
commit eda7d7e
Show file tree

Hide file tree

Showing 2 changed files with 8 additions and 8 deletions.
diff --git a/doc/source/tune/api/env.rst b/doc/source/tune/api/env.rst
@@ -45,7 +45,7 @@ These are the environment variables Ray Tune currently considers:
 * **TUNE_MAX_LEN_IDENTIFIER**: Maximum length of trial subdirectory names (those
   with the parameter values in them)
 * **TUNE_MAX_PENDING_TRIALS_PG**: Maximum number of pending trials when placement groups are used. Defaults
-  to ``auto``, which will be updated to ``max(16, cluster_cpus * 1.1)`` for random/grid search and ``1``
+  to ``auto``, which will be updated to ``max(200, cluster_cpus * 1.1)`` for random/grid search and ``1``
   for any other search algorithms.
 * **TUNE_NODE_SYNCING_MIN_ITER_THRESHOLD**: When syncing trial data between nodes, only sync if this many
   iterations were recorded for the trial or the minimum time threshold was met. This will prevent unnecessary

diff --git a/python/ray/tune/execution/tune_controller.py b/python/ray/tune/execution/tune_controller.py
@@ -2172,14 +2172,14 @@ def _get_max_pending_trials(search_alg: SearchAlgorithm) -> int:
     if not isinstance(search_alg, BasicVariantGenerator):
         return 1
 
-    # Use a minimum of 16 to trigger fast autoscaling
-    # Scale up to at most the number of available cluster CPUs
+    # Allow up to at least 200 pending trials to trigger fast autoscaling
+    min_autoscaling_rate = 200
+
+    # Allow more pending trials for larger clusters (based on number of CPUs)
     cluster_cpus = ray.cluster_resources().get("CPU", 1.0)
-    max_pending_trials = min(
-        max(search_alg.total_samples, 16), max(16, int(cluster_cpus * 1.1))
-    )
+    max_pending_trials = max(min_autoscaling_rate, int(cluster_cpus * 1.1))
 
-    if max_pending_trials > 128:
+    if max_pending_trials > min_autoscaling_rate:
         logger.warning(
             f"The maximum number of pending trials has been "
             f"automatically set to the number of available "
@@ -2189,7 +2189,7 @@ def _get_max_pending_trials(search_alg: SearchAlgorithm) -> int:
             f"of trials, this could lead to scheduling overhead. "
             f"In this case, consider setting the "
             f"`TUNE_MAX_PENDING_TRIALS_PG` environment variable "
-            f"to the desired maximum number of concurrent trials."
+            f"to the desired maximum number of concurrent pending trials."
         )
 
     return max_pending_trials