[tune] clean up population based training prototype (ray-project#1478)

* patch up pbt * Sat Jan 27 01:00:03 PST 2018 * Sat Jan 27 01:04:14 PST 2018 * Sat Jan 27 01:04:21 PST 2018 * Sat Jan 27 01:15:15 PST 2018 * Sat Jan 27 01:15:42 PST 2018 * Sat Jan 27 01:16:14 PST 2018 * Sat Jan 27 01:38:42 PST 2018 * Sat Jan 27 01:39:21 PST 2018 * add pbt * Sat Jan 27 01:41:19 PST 2018 * Sat Jan 27 01:44:21 PST 2018 * Sat Jan 27 01:45:46 PST 2018 * Sat Jan 27 16:54:42 PST 2018 * Sat Jan 27 16:57:53 PST 2018 * clean up test * Sat Jan 27 18:01:15 PST 2018 * Sat Jan 27 18:02:54 PST 2018 * Sat Jan 27 18:11:18 PST 2018 * Sat Jan 27 18:11:55 PST 2018 * Sat Jan 27 18:14:09 PST 2018 * review * try out a ppo example * some tweaks to ppo example * add postprocess hook * Sun Jan 28 15:00:40 PST 2018 * clean up custom explore fn * Sun Jan 28 15:10:21 PST 2018 * Sun Jan 28 15:14:53 PST 2018 * Sun Jan 28 15:17:04 PST 2018 * Sun Jan 28 15:33:13 PST 2018 * Sun Jan 28 15:56:40 PST 2018 * Sun Jan 28 15:57:36 PST 2018 * Sun Jan 28 16:00:35 PST 2018 * Sun Jan 28 16:02:58 PST 2018 * Sun Jan 28 16:29:50 PST 2018 * Sun Jan 28 16:30:36 PST 2018 * Sun Jan 28 16:31:44 PST 2018 * improve tune doc * concepts * update humanoid * Fri Feb 2 18:03:33 PST 2018 * fix example * show error file
rice668 · Feb 3, 2018 · b948405 · b948405
1 parent a936468
commit b948405
Show file tree

Hide file tree

Showing 22 changed files with 702 additions and 292 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -128,13 +128,18 @@ script:
   - python test/multi_node_test.py
   - python test/recursion_test.py
   - python test/monitor_test.py
-  - python test/trial_runner_test.py
-  - python test/trial_scheduler_test.py
-  - python test/tune_server_test.py
   - python test/cython_test.py
+
+  # ray dataframe tests
   - python -m pytest python/ray/dataframe/test/test_dataframe.py
   - python -m pytest python/ray/dataframe/test/test_series.py
 
+  # ray tune tests
+  - python -m pytest python/ray/tune/test/trial_runner_test.py
+  - python -m pytest python/ray/tune/test/trial_scheduler_test.py
+  - python -m pytest python/ray/tune/test/tune_server_test.py
+
+  # ray rllib tests
   - python -m pytest python/ray/rllib/test/test_catalog.py
   - python -m pytest python/ray/rllib/test/test_filters.py
   - python -m pytest python/ray/rllib/test/test_optimizers.py

diff --git a/doc/source/pbt.png b/doc/source/pbt.png
diff --git a/doc/source/rllib.rst b/doc/source/rllib.rst
@@ -290,6 +290,9 @@ in the ``config`` section of the experiments.
     ray.init()
     run_experiments(experiment)
 
+For an advanced example of using Population Based Training (PBT) with RLlib,
+see the `PPO + PBT Walker2D training example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_ppo_example.py>`__.
+
 Contributing to RLlib
 ---------------------
 

diff --git a/doc/source/tune.rst b/doc/source/tune.rst
@@ -1,18 +1,30 @@
 Ray Tune: Hyperparameter Optimization Framework
 ===============================================
 
-This document describes Ray Tune, a hyperparameter tuning framework for long-running tasks such as RL and deep learning training. It has the following features:
+This document describes Ray Tune, a hyperparameter tuning framework for long-running tasks such as RL and deep learning training. Ray Tune makes it easy to go from running one or more experiments on a single machine to running on a large cluster with efficient search algorithms.
 
--  Early stopping algorithms such as `Median Stopping Rule <https://research.google.com/pubs/pub46180.html>`__ and `HyperBand <https://arxiv.org/abs/1603.06560>`__.
+It has the following features:
+
+-  Scalable implementations of search algorithms such as `Population Based Training (PBT) <#population-based-training>`__, `Median Stopping Rule <https://research.google.com/pubs/pub46180.html>`__, and `HyperBand <https://arxiv.org/abs/1603.06560>`__.
 
 -  Integration with visualization tools such as `TensorBoard <https://www.tensorflow.org/get_started/summaries_and_tensorboard>`__, `rllab's VisKit <https://media.readthedocs.org/pdf/rllab/latest/rllab.pdf>`__, and a `parallel coordinates visualization <https://en.wikipedia.org/wiki/Parallel_coordinates>`__.
 
 -  Flexible trial variant generation, including grid search, random search, and conditional parameter distributions.
 
 -  Resource-aware scheduling, including support for concurrent runs of algorithms that may themselves be parallel and distributed.
 
+
 You can find the code for Ray Tune `here on GitHub <https://github.com/ray-project/ray/tree/master/python/ray/tune>`__.
 
+Concepts
+--------
+
+Ray Tune schedules a number of *trials* in a cluster. Each trial runs a user-defined Python function or class and is parameterized by a json *config* variation passed to the user code.
+
+Ray Tune provides a ``run_experiments(spec)`` function that generates and runs the trials described by the experiment specification. The trials are scheduled and managed by a *trial scheduler* that implements the search algorithm (default is FIFO).
+
+Ray Tune can be used anywhere Ray can, e.g. on your laptop with ``ray.init()`` embedded in a Python script, or in an `auto-scaling cluster <autoscaling.html>`__ for massive parallelism.
+
 Getting Started
 ---------------
 
@@ -133,7 +145,7 @@ To reduce costs, long-running trials can often be early stopped if their initial
 
 An example of this can be found in `hyperband_example.py <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__. The progress of one such HyperBand run is shown below.
 
-Note that some trial schedulers such as HyperBand require your Trainable to support checkpointing, which is described in the next section. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
+Note that some trial schedulers such as HyperBand and PBT require your Trainable to support checkpointing, which is described in the next section. Checkpointing enables the scheduler to multiplex many concurrent trials onto a limited size cluster.
 
 ::
 
@@ -172,10 +184,19 @@ Currently we support the following early stopping algorithms, or you can write y
 .. autoclass:: ray.tune.median_stopping_rule.MedianStoppingRule
 .. autoclass:: ray.tune.hyperband.HyperBandScheduler
 
+Population Based Training
+-------------------------
+
+Ray Tune includes a distributed implementation of `Population Based Training (PBT) <https://deepmind.com/blog/population-based-training-neural-networks>`__. PBT also requires your Trainable to support checkpointing. You can run this `toy PBT example <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/pbt_example.py>`__ to get an idea of how how PBT operates. When training in PBT mode, the set of trial variations is treated as the population, so a single trial may see many different hyperparameters over its lifetime, which is recorded in the ``result.json`` file. The following figure generated by the example shows PBT discovering new hyperparams over the course of a single experiment:
+
+.. image:: pbt.png
+
+.. autoclass:: ray.tune.pbt.PopulationBasedTraining
+
 Trial Checkpointing
 -------------------
 
-To enable checkpoint / resume, you must subclass ``Trainable`` and implement its ``_train``, ``_save``, and ``_restore`` abstract methods `(example) <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__: Implementing this interface is required to support resource multiplexing in schedulers such as HyperBand.
+To enable checkpoint / resume, you must subclass ``Trainable`` and implement its ``_train``, ``_save``, and ``_restore`` abstract methods `(example) <https://github.com/ray-project/ray/blob/master/python/ray/tune/examples/hyperband_example.py>`__: Implementing this interface is required to support resource multiplexing in schedulers such as HyperBand and PBT.
 
 .. autoclass:: ray.tune.trainable.Trainable
 

diff --git a/python/ray/rllib/ppo/ppo_evaluator.py b/python/ray/rllib/ppo/ppo_evaluator.py
@@ -76,13 +76,13 @@ def __init__(self, registry, env_creator, config, logdir, is_remote):
         # Value function predictions before the policy update.
         self.prev_vf_preds = tf.placeholder(tf.float32, shape=(None,))
 
-        assert config["sgd_batchsize"] % len(devices) == 0, \
-            "Batch size must be evenly divisible by devices"
         if is_remote:
             self.batch_size = config["rollout_batchsize"]
             self.per_device_batch_size = config["rollout_batchsize"]
         else:
-            self.batch_size = config["sgd_batchsize"]
+            self.batch_size = int(
+                config["sgd_batchsize"] / len(devices)) * len(devices)
+            assert self.batch_size % len(devices) == 0
             self.per_device_batch_size = int(self.batch_size / len(devices))
 
         def build_loss(obs, vtargets, advs, acts, plog, pvf_preds):

diff --git a/python/ray/rllib/test/test_checkpoint_restore.py b/python/ray/rllib/test/test_checkpoint_restore.py
diff --git a/python/ray/tune/examples/hyperband_example.py b/python/ray/tune/examples/hyperband_example.py
@@ -4,6 +4,7 @@
 from __future__ import division
 from __future__ import print_function
 
+import argparse
 import json
 import os
 import random
@@ -49,6 +50,10 @@ def _restore(self, checkpoint_path):
 register_trainable("my_class", MyTrainableClass)
 
 if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--smoke-test", action="store_true", help="Finish quickly for testing")
+    args, _ = parser.parse_known_args()
     ray.init()
 
     # Hyperband early stopping, configured with `episode_reward_mean` as the
@@ -60,7 +65,8 @@ def _restore(self, checkpoint_path):
     run_experiments({
         "hyperband_test": {
             "run": "my_class",
-            "repeat": 100,
+            "stop": {"training_iteration": 1 if args.smoke_test else 99999},
+            "repeat": 20,
             "resources": {"cpu": 1, "gpu": 0},
             "config": {
                 "width": lambda spec: 10 + int(90 * random.random()),

diff --git a/python/ray/tune/examples/pbt_example.py b/python/ray/tune/examples/pbt_example.py
@@ -0,0 +1,88 @@
+#!/usr/bin/env python
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import argparse
+import json
+import os
+import random
+import time
+
+import ray
+from ray.tune import Trainable, TrainingResult, register_trainable, \
+    run_experiments
+from ray.tune.pbt import PopulationBasedTraining
+
+
+class MyTrainableClass(Trainable):
+    """Fake agent whose learning rate is determined by dummy factors."""
+
+    def _setup(self):
+        self.timestep = 0
+        self.current_value = 0.0
+
+    def _train(self):
+        time.sleep(0.1)
+
+        # Reward increase is parabolic as a function of factor_2, with a
+        # maxima around factor_1=10.0.
+        self.current_value += max(
+            0.0, random.gauss(5.0 - (self.config["factor_1"] - 10.0)**2, 2.0))
+
+        # Flat increase by factor_2
+        self.current_value += random.gauss(self.config["factor_2"], 1.0)
+
+        # Here we use `episode_reward_mean`, but you can also report other
+        # objectives such as loss or accuracy (see tune/result.py).
+        return TrainingResult(
+            episode_reward_mean=self.current_value, timesteps_this_iter=1)
+
+    def _save(self, checkpoint_dir):
+        path = os.path.join(checkpoint_dir, "checkpoint")
+        with open(path, "w") as f:
+            f.write(json.dumps(
+                {"timestep": self.timestep, "value": self.current_value}))
+        return path
+
+    def _restore(self, checkpoint_path):
+        with open(checkpoint_path) as f:
+            data = json.loads(f.read())
+            self.timestep = data["timestep"]
+            self.current_value = data["value"]
+
+
+register_trainable("my_class", MyTrainableClass)
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument(
+        "--smoke-test", action="store_true", help="Finish quickly for testing")
+    args, _ = parser.parse_known_args()
+    ray.init()
+
+    pbt = PopulationBasedTraining(
+        time_attr="training_iteration", reward_attr="episode_reward_mean",
+        perturbation_interval=10,
+        hyperparam_mutations={
+            # Allow for scaling-based perturbations, with a uniform backing
+            # distribution for resampling.
+            "factor_1": lambda config: random.uniform(0.0, 20.0),
+            # Only allows resampling from this list as a perturbation.
+            "factor_2": [1, 2],
+        })
+
+    # Try to find the best factor 1 and factor 2
+    run_experiments({
+        "pbt_test": {
+            "run": "my_class",
+            "stop": {"training_iteration": 2 if args.smoke_test else 99999},
+            "repeat": 10,
+            "resources": {"cpu": 1, "gpu": 0},
+            "config": {
+                "factor_1": 4.0,
+                "factor_2": 1.0,
+            },
+        }
+    }, scheduler=pbt, verbose=False)
diff --git a/python/ray/tune/examples/pbt_ppo_example.py b/python/ray/tune/examples/pbt_ppo_example.py
@@ -0,0 +1,71 @@
+#!/usr/bin/env python
+
+"""Example of using PBT with RLlib.
+
+Note that this requires a cluster with at least 8 GPUs in order for all trials
+to run concurrently, otherwise PBT will round-robin train the trials which
+is less efficient (or you can set {"gpu": 0} to use CPUs for SGD instead).
+"""
+
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import random
+
+import ray
+from ray.tune import run_experiments
+from ray.tune.pbt import PopulationBasedTraining
+
+if __name__ == "__main__":
+
+    # Postprocess the perturbed config to ensure it's still valid
+    def explore(config):
+        # ensure we collect enough timesteps to do sgd
+        if config["timesteps_per_batch"] < config["sgd_batchsize"] * 2:
+            config["timesteps_per_batch"] = config["sgd_batchsize"] * 2
+        # ensure we run at least one sgd iter
+        if config["num_sgd_iter"] < 1:
+            config["num_sgd_iter"] = 1
+        return config
+
+    pbt = PopulationBasedTraining(
+        time_attr="time_total_s", reward_attr="episode_reward_mean",
+        perturbation_interval=120,
+        resample_probability=0.25,
+        # Specifies the resampling distributions of these hyperparams
+        hyperparam_mutations={
+            "lambda": lambda config: random.uniform(0.9, 1.0),
+            "clip_param": lambda config: random.uniform(0.01, 0.5),
+            "sgd_stepsize": lambda config: random.uniform(.00001, .001),
+            "num_sgd_iter": lambda config: random.randint(1, 30),
+            "sgd_batchsize": lambda config: random.randint(128, 16384),
+            "timesteps_per_batch":
+                lambda config: random.randint(2000, 160000),
+        },
+        custom_explore_fn=explore)
+
+    ray.init()
+    run_experiments({
+        "pbt_humanoid_test": {
+            "run": "PPO",
+            "env": "Humanoid-v1",
+            "repeat": 8,
+            "resources": {"cpu": 4, "gpu": 1},
+            "config": {
+                "kl_coeff": 1.0,
+                "num_workers": 8,
+                "devices": ["/gpu:0"],
+                "model": {"free_log_std": True},
+                # These params are tuned from their starting value
+                "lambda": 0.95,
+                "clip_param": 0.2,
+                # Start off with several random variations
+                "sgd_stepsize": lambda spec: random.uniform(.00001, .001),
+                "num_sgd_iter": lambda spec: random.choice([10, 20, 30]),
+                "sgd_batchsize": lambda spec: random.choice([128, 512, 2048]),
+                "timesteps_per_batch":
+                    lambda spec: random.choice([10000, 20000, 40000])
+            },
+        },
+    }, scheduler=pbt)
diff --git a/python/ray/tune/examples/tune_mnist_ray.py b/python/ray/tune/examples/tune_mnist_ray.py
@@ -205,7 +205,7 @@ def train(config={'activation': 'relu'}, reporter=None):
 if __name__ == '__main__':
     parser = argparse.ArgumentParser()
     parser.add_argument(
-        '--fast', action='store_true', help='Finish quickly for testing')
+        '--smoke-test', action='store_true', help='Finish quickly for testing')
     args, _ = parser.parse_known_args()
 
     register_trainable('train_mnist', train)
@@ -220,7 +220,7 @@ def train(config={'activation': 'relu'}, reporter=None):
         },
     }
 
-    if args.fast:
+    if args.smoke_test:
         mnist_spec['stop']['training_iteration'] = 2
 
     ray.init()

diff --git a/python/ray/tune/hyperband.py b/python/ray/tune/hyperband.py
@@ -207,7 +207,7 @@ def on_trial_error(self, trial_runner, trial):
         """Cleans up trial info from bracket if trial errored early."""
         self.on_trial_remove(trial_runner, trial)
 
-    def choose_trial_to_run(self, trial_runner, *args):
+    def choose_trial_to_run(self, trial_runner):
         """Fair scheduling within iteration by completion percentage.
 
         List of trials not used since all trials are tracked as state

diff --git a/python/ray/tune/logger.py b/python/ray/tune/logger.py
@@ -63,7 +63,6 @@ def _init(self):
                 print("TF not installed - cannot log with {}...".format(cls))
                 continue
             self._loggers.append(cls(self.config, self.logdir, self.uri))
-        print("Unified logger created with logdir '{}'".format(self.logdir))
 
     def on_result(self, result):
         for logger in self._loggers:

diff --git a/python/ray/tune/median_stopping_rule.py b/python/ray/tune/median_stopping_rule.py
@@ -31,7 +31,7 @@ class MedianStoppingRule(FIFOScheduler):
     """
 
     def __init__(
-            self, time_attr='time_total_s', reward_attr='episode_reward_mean',
+            self, time_attr="time_total_s", reward_attr="episode_reward_mean",
             grace_period=60.0, min_samples_required=3, hard_stop=True):
         FIFOScheduler.__init__(self)
         self._stopped_trials = set()