Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to algorithms dir …

…and rename policy and trainer classes. (ray-project#25346)" (ray-project#25420) This reverts commit e4ceae1. Reverts ray-project#25346 linux://python/ray/tests:test_client_library_integration never fail before this PR. In the CI of the reverted PR, it also fails (https://buildkite.com/ray-project/ray-builders-pr/builds/34079#01812442-c541-4145-af22-2a012655c128). So high likely it's because of this PR. And test output failure seems related as well (https://buildkite.com/ray-project/ray-builders-branch/builds/7923#018125c2-4812-4ead-a42f-7fddb344105b)
Mistobaan · Jun 3, 2022 · fd0f967 · fd0f967
1 parent 6589a4f
commit fd0f967
Showing 110 changed files with 508 additions and 649 deletions.
diff --git a/.buildkite/pipeline.ml.yml b/.buildkite/pipeline.ml.yml
@@ -129,7 +129,7 @@
     # Test all tests in the `agents` (soon to be "trainers") dir:
     - bazel test --config=ci $(./ci/run/bazel_export_options)
       --build_tests_only
-      --test_tag_filters=algorithms_dir_generic,-multi_gpu
+      --test_tag_filters=trainers_dir_generic,-multi_gpu
       --test_env=RAY_USE_MULTIPROCESSING_CPU_COUNT=1
       rllib/...
 
@@ -141,7 +141,7 @@
     # Test all tests in the `agents` (soon to be "trainers") dir:
     - bazel test --config=ci $(./ci/run/bazel_export_options)
       --build_tests_only
-      --test_tag_filters=algorithms_dir,-algorithms_dir_generic,-multi_gpu
+      --test_tag_filters=trainers_dir,-trainers_dir_generic,-multi_gpu
       --test_env=RAY_USE_MULTIPROCESSING_CPU_COUNT=1
       rllib/...
 
@@ -154,7 +154,7 @@
     # "learning_tests|quick_train|examples|tests_dir".
     - bazel test --config=ci $(./ci/run/bazel_export_options)
       --build_tests_only
-      --test_tag_filters=-learning_tests,-quick_train,-memory_leak_tests,-examples,-tests_dir,-algorithms_dir,-documentation,-multi_gpu
+      --test_tag_filters=-learning_tests,-quick_train,-memory_leak_tests,-examples,-tests_dir,-trainers_dir,-documentation,-multi_gpu
       --test_env=RAY_USE_MULTIPROCESSING_CPU_COUNT=1
       rllib/...
 

diff --git a/README.rst b/README.rst
@@ -179,7 +179,7 @@ It offers high scalability and unified APIs for a
 .. code-block:: python
 
     import gym
-    from ray.rllib.algorithms.ppo import PPO
+    from ray.rllib.agents.ppo import PPOTrainer
 
 
     # Define your problem using python and openAI's gym API:
@@ -229,7 +229,7 @@ It offers high scalability and unified APIs for a
 
 
     # Create an RLlib Trainer instance.
-    trainer = PPO(
+    trainer = PPOTrainer(
         config={
             # Env class to use (here: our gym.Env sub-class from above).
             "env": SimpleCorridor,

diff --git a/doc/source/ray-overview/doc_test/ray_rllib.py b/doc/source/ray-overview/doc_test/ray_rllib.py
@@ -1,8 +1,8 @@
 from ray import tune
-from ray.rllib.algorithms.ppo import PPO
+from ray.rllib.agents.ppo import PPOTrainer
 
 tune.run(
-    PPO,
+    PPOTrainer,
     stop={"episode_len_mean": 20},
     config={"env": "CartPole-v0", "framework": "torch", "log_level": "INFO"},
 )
diff --git a/doc/source/rllib/core-concepts.rst b/doc/source/rllib/core-concepts.rst
@@ -25,14 +25,14 @@ Trainers also implement the :ref:`Tune Trainable API <tune-60-seconds>` for easy
 
 You have three ways to interact with a trainer. You can use the basic Python API or the command line to train it, or you
 can use Ray Tune to tune hyperparameters of your reinforcement learning algorithm.
-The following example shows three equivalent ways of interacting with the ``PPO`` Trainer,
+The following example shows three equivalent ways of interacting with the ``PPOTrainer``,
 which implements the proximal policy optimization algorithm in RLlib.
 
 .. tabbed:: Basic RLlib Trainer
 
     .. code-block:: python
 
-        trainer = PPO(env="CartPole-v0", config={"train_batch_size": 4000})
+        trainer = PPOTrainer(env="CartPole-v0", config={"train_batch_size": 4000})
         while True:
             print(trainer.train())
 
@@ -47,7 +47,7 @@ which implements the proximal policy optimization algorithm in RLlib.
     .. code-block:: python
 
         from ray import tune
-        tune.run(PPO, config={"env": "CartPole-v0", "train_batch_size": 4000})
+        tune.run(PPOTrainer, config={"env": "CartPole-v0", "train_batch_size": 4000})
 
 
 

diff --git a/doc/source/rllib/doc_code/training.py b/doc/source/rllib/doc_code/training.py
@@ -22,9 +22,9 @@
 # __query_action_dist_start__
 # Get a reference to the policy
 import numpy as np
-from ray.rllib.algorithms.ppo import PPO
+from ray.rllib.agents.ppo import PPOTrainer
 
-trainer = PPO(env="CartPole-v0", config={"framework": "tf2", "num_workers": 0})
+trainer = PPOTrainer(env="CartPole-v0", config={"framework": "tf2", "num_workers": 0})
 policy = trainer.get_policy()
 # <ray.rllib.policy.eager_tf_policy.PPOTFPolicy_eager object at 0x7fd020165470>
 

diff --git a/doc/source/rllib/package_ref/policy/custom_policies.rst b/doc/source/rllib/package_ref/policy/custom_policies.rst
@@ -23,4 +23,4 @@ framework-agnostic policy),
 * :py:meth:`~ray.rllib.policy.policy.Policy.postprocess_trajectory`
 * :py:meth:`~ray.rllib.policy.policy.Policy.loss`
 
-`See here for an example on how to override TorchPolicy <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo_torch_policy.py>`_.
+`See here for an example on how to override TorchPolicy <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo_torch_policy.py>`_.
diff --git a/doc/source/rllib/rllib-algorithms.rst b/doc/source/rllib/rllib-algorithms.rst
@@ -130,7 +130,7 @@ Importance Weighted Actor-Learner Architecture (IMPALA)
 -------------------------------------------------------
 |pytorch| |tensorflow|
 `[paper] <https://arxiv.org/abs/1802.01561>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/impala/impala.py>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/impala/impala.py>`__
 In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference `V-trace code <https://github.com/deepmind/scalable_agent/blob/master/vtrace.py>`__. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a `custom model <rllib-models.html#custom-models-tensorflow>`__. Multiple learner GPUs and experience replay are also supported.
 
 .. figure:: images/impala-arch.svg
@@ -168,7 +168,7 @@ SpaceInvaders  843                              ~300
 
 **IMPALA-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
 
-.. literalinclude:: ../../../rllib/algorithms/impala/impala.py
+.. literalinclude:: ../../../rllib/agents/impala/impala.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
@@ -179,7 +179,7 @@ Asynchronous Proximal Policy Optimization (APPO)
 ------------------------------------------------
 |pytorch| |tensorflow|
 `[paper] <https://arxiv.org/abs/1707.06347>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/appo/appo.py>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/appo.py>`__
 We include an asynchronous variant of Proximal Policy Optimization (PPO) based on the IMPALA architecture. This is similar to IMPALA but using a surrogate policy loss with clipping. Compared to synchronous PPO, APPO is more efficient in wall-clock time due to its use of asynchronous sampling. Using a clipped loss also allows for multiple SGD passes, and therefore the potential for better sample efficiency compared to IMPALA. V-trace can also be enabled to correct for off-policy samples.
 
 .. tip::
@@ -190,11 +190,11 @@ We include an asynchronous variant of Proximal Policy Optimization (PPO) based o
 
     APPO architecture (same as IMPALA)
 
-Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/appo/pong-appo.yaml>`__
+Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/pong-appo.yaml>`__
 
 **APPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
 
-.. literalinclude:: ../../../rllib/algorithms/appo/appo.py
+.. literalinclude:: ../../../rllib/agents/ppo/appo.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
@@ -205,7 +205,7 @@ Decentralized Distributed Proximal Policy Optimization (DD-PPO)
 ---------------------------------------------------------------
 |pytorch|
 `[paper] <https://arxiv.org/abs/1911.00357>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ddppo/ddppo.py>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ddppo.py>`__
 Unlike APPO or PPO, with DD-PPO policy improvement is no longer done centralized in the trainer process. Instead, gradients are computed remotely on each rollout worker and all-reduced at each mini-batch using `torch distributed <https://pytorch.org/docs/stable/distributed.html>`__. This allows each worker's GPU to be used both for sampling and for training.
 
 .. tip::
@@ -216,11 +216,11 @@ Unlike APPO or PPO, with DD-PPO policy improvement is no longer done centralized
 
     DD-PPO architecture (both sampling and learning are done on worker GPUs)
 
-Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddppo/cartpole-ddppo.yaml>`__, `BreakoutNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddppo/atari-ddppo.yaml>`__
+Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/cartpole-ddppo.yaml>`__, `BreakoutNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/atari-ddppo.yaml>`__
 
 **DDPPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
 
-.. literalinclude:: ../../../rllib/algorithms/ddppo/ddppo.py
+.. literalinclude:: ../../../rllib/agents/ppo/ddppo.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__
@@ -396,7 +396,7 @@ Proximal Policy Optimization (PPO)
 ----------------------------------
 |pytorch| |tensorflow|
 `[paper] <https://arxiv.org/abs/1707.06347>`__
-`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py>`__
+`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo.py>`__
 PPO's clipped objective supports multiple SGD passes over the same batch of experiences. RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. PPO scales out using multiple workers for experience collection, and also to multiple GPUs for SGD.
 
 .. tip::
@@ -445,7 +445,7 @@ HalfCheetah    9664                       ~7700
 
 **PPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):
 
-.. literalinclude:: ../../../rllib/algorithms/ppo/ppo.py
+.. literalinclude:: ../../../rllib/agents/ppo/ppo.py
    :language: python
    :start-after: __sphinx_doc_begin__
    :end-before: __sphinx_doc_end__

diff --git a/doc/source/rllib/rllib-concepts.rst b/doc/source/rllib/rllib-concepts.rst
@@ -210,11 +210,11 @@ You might be wondering how RLlib makes the advantages placeholder automatically
 
 In the above section you saw how to compose a simple policy gradient algorithm with RLlib.
 In this example, we'll dive into how PPO is defined within RLlib and how you can modify it.
-First, check out the `PPO trainer definition <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py>`__:
+First, check out the `PPO trainer definition <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo.py>`__:
 
 .. code-block:: python
 
-    class PPO(Trainer):
+    class PPOTrainer(Trainer):
         @classmethod
         @override(Trainer)
         def get_default_config(cls) -> TrainerConfigDict:
@@ -280,7 +280,7 @@ Suppose we want to customize PPO to use an asynchronous-gradient optimization st
 
 .. code-block:: python
 
-    from ray.rllib.algorithms.ppo import PPO
+    from ray.rllib.agents.ppo import PPOTrainer
     from ray.rllib.execution.rollout_ops import AsyncGradients
     from ray.rllib.execution.train_ops import ApplyGradients
     from ray.rllib.execution.metric_ops import StandardMetricsReporting
@@ -307,7 +307,7 @@ Now let's look at each PPO policy definition:
 
     PPOTFPolicy = build_tf_policy(
         name="PPOTFPolicy",
-        get_default_config=lambda: ray.rllib.algorithms.ppo.ppo.PPOConfig().to_dict(),
+        get_default_config=lambda: ray.rllib.agents.ppo.ppo.DEFAULT_CONFIG,
         loss_fn=ppo_surrogate_loss,
         stats_fn=kl_and_loss_stats,
         extra_action_out_fn=vf_preds_and_logits_fetches,
@@ -562,8 +562,8 @@ You can use the ``with_updates`` method on Trainers and Policy objects built wit
 
 .. code-block:: python
 
-    from ray.rllib.algorithms.ppo import PPO
-    from ray.rllib.algorithms.ppo.ppo_tf_policy import PPOTFPolicy
+    from ray.rllib.agents.ppo import PPOTrainer
+    from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy
 
     CustomPolicy = PPOTFPolicy.with_updates(
         name="MyCustomPPOTFPolicy",

diff --git a/doc/source/rllib/rllib-env.rst b/doc/source/rllib/rllib-env.rst
@@ -33,7 +33,7 @@ You can pass either a string name or a Python class to specify an environment. B
             return <obs>, <reward: float>, <done: bool>, <info: dict>
 
     ray.init()
-    trainer = ppo.PPO(env=MyEnv, config={
+    trainer = ppo.PPOTrainer(env=MyEnv, config={
         "env_config": {},  # config to pass to env class
     })
 
@@ -50,7 +50,7 @@ You can also register a custom env creator function with a string name. This fun
         return MyEnv(...)  # return an env instance
 
     register_env("my_env", env_creator)
-    trainer = ppo.PPO(env="my_env")
+    trainer = ppo.PPOTrainer(env="my_env")
 
 For a full runnable code example using the custom environment API, see `custom_env.py <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__.
 

diff --git a/doc/source/rllib/rllib-models.rst b/doc/source/rllib/rllib-models.rst
@@ -215,7 +215,7 @@ Once implemented, your TF model can then be registered and used in place of a bu
 .. code-block:: python
 
     import ray
-    import ray.rllib.algorithms.ppo as ppo
+    import ray.rllib.agents.ppo as ppo
     from ray.rllib.models import ModelCatalog
     from ray.rllib.models.tf.tf_modelv2 import TFModelV2
 
@@ -227,7 +227,7 @@ Once implemented, your TF model can then be registered and used in place of a bu
     ModelCatalog.register_custom_model("my_tf_model", MyModelClass)
 
     ray.init()
-    trainer = ppo.PPO(env="CartPole-v0", config={
+    trainer = ppo.PPOTrainer(env="CartPole-v0", config={
         "model": {
             "custom_model": "my_tf_model",
             # Extra kwargs to be passed to your model's c'tor.
@@ -282,7 +282,7 @@ Once implemented, your PyTorch model can then be registered and used in place of
     ModelCatalog.register_custom_model("my_torch_model", CustomTorchModel)
 
     ray.init()
-    trainer = ppo.PPO(env="CartPole-v0", config={
+    trainer = ppo.PPOTrainer(env="CartPole-v0", config={
         "framework": "torch",
         "model": {
             "custom_model": "my_torch_model",
@@ -488,7 +488,7 @@ Similar to custom models and preprocessors, you can also specify a custom action
 .. code-block:: python
 
     import ray
-    import ray.rllib.algorithms.ppo as ppo
+    import ray.rllib.agents.ppo as ppo
     from ray.rllib.models import ModelCatalog
     from ray.rllib.models.preprocessors import Preprocessor
 
@@ -508,7 +508,7 @@ Similar to custom models and preprocessors, you can also specify a custom action
     ModelCatalog.register_custom_action_dist("my_dist", MyActionDist)
 
     ray.init()
-    trainer = ppo.PPO(env="CartPole-v0", config={
+    trainer = ppo.PPOTrainer(env="CartPole-v0", config={
         "model": {
             "custom_action_dist": "my_dist",
         },

diff --git a/doc/source/rllib/rllib-offline.rst b/doc/source/rllib/rllib-offline.rst
@@ -238,7 +238,7 @@ You can configure experience input for an agent using the following options:
     objects, which have the advantage of being type safe, allowing users to set different config settings within
     meaningful sub-categories (e.g. ``my_config.offline_data(input_=[xyz])``), and offer the ability to
     construct a Trainer instance from these config objects (via their ``.build()`` method).
-    So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPO`,
+    So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.agents.ppo.ppo.PPOTrainer`,
     but we are rolling this out right now across all RLlib.
 
 
@@ -335,7 +335,7 @@ You can configure experience output for an agent using the following options:
     objects, which have the advantage of being type safe, allowing users to set different config settings within
     meaningful sub-categories (e.g. ``my_config.offline_data(input_=[xyz])``), and offer the ability to
     construct a Trainer instance from these config objects (via their ``.build()`` method).
-    So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPO`,
+    So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.agents.ppo.ppo.PPOTrainer`,
     but we are rolling this out right now across all RLlib.
 
 .. code-block:: python

diff --git a/doc/source/rllib/rllib-training.rst b/doc/source/rllib/rllib-training.rst
@@ -164,7 +164,7 @@ Common Parameters
     objects, which have the advantage of being type safe, allowing users to set different config settings within
     meaningful sub-categories (e.g. ``my_config.training(lr=0.0003)``), and offer the ability to
     construct a Trainer instance from these config objects (via their ``build()`` method).
-    So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPO`,
+    So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.agents.ppo.ppo.PPOTrainer`,
     but we are rolling this out right now across all RLlib.
 
 The following is a list of the common algorithm hyper-parameters:
@@ -705,14 +705,14 @@ Here is an example of the basic usage (for a more complete example, see `custom_
 .. code-block:: python
 
     import ray
-    import ray.rllib.algorithms.ppo as ppo
+    import ray.rllib.agents.ppo as ppo
     from ray.tune.logger import pretty_print
 
     ray.init()
     config = ppo.DEFAULT_CONFIG.copy()
     config["num_gpus"] = 0
     config["num_workers"] = 1
-    trainer = ppo.PPO(config=config, env="CartPole-v0")
+    trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")
 
     # Can optionally call trainer.restore(path) to load a checkpoint.
 
@@ -783,7 +783,7 @@ It also simplifies saving the trained agent. For example:
     # tune.run() allows setting a custom log directory (other than ``~/ray-results``)
     # and automatically saving the trained agent
     analysis = ray.tune.run(
-        ppo.PPO,
+        ppo.PPOTrainer,
         config=config,
         local_dir=log_dir,
         stop=stop_criteria,
@@ -807,7 +807,7 @@ Loading and restoring a trained agent from a checkpoint is simple:
 
 .. code-block:: python
 
-    agent = ppo.PPO(config=config, env=env_class)
+    agent = ppo.PPOTrainer(config=config, env=env_class)
     agent.restore(checkpoint_path)
 
 
@@ -1340,10 +1340,10 @@ customizations to your training loop.
 
     import ray
     from ray import tune
-    from ray.rllib.algorithms.ppo import PPO
+    from ray.rllib.agents.ppo import PPOTrainer
 
     def train(config, reporter):
-        trainer = PPO(config=config, env=YourEnv)
+        trainer = PPOTrainer(config=config, env=YourEnv)
         while True:
             result = trainer.train()
             reporter(**result)