Skip to content

Commit

Permalink
Revert "[RLlib] Move (A/DD)?PPO and IMPALA algos to algorithms dir …
Browse files Browse the repository at this point in the history
…and rename policy and trainer classes. (ray-project#25346)" (ray-project#25420)

This reverts commit e4ceae1.

Reverts ray-project#25346

linux://python/ray/tests:test_client_library_integration never fail before this PR.

In the CI of the reverted PR, it also fails (https://buildkite.com/ray-project/ray-builders-pr/builds/34079#01812442-c541-4145-af22-2a012655c128). So high likely it's because of this PR.

And test output failure seems related as well (https://buildkite.com/ray-project/ray-builders-branch/builds/7923#018125c2-4812-4ead-a42f-7fddb344105b)
  • Loading branch information
fishbone authored Jun 3, 2022
1 parent 6589a4f commit fd0f967
Showing 110 changed files with 508 additions and 649 deletions.
6 changes: 3 additions & 3 deletions .buildkite/pipeline.ml.yml
Original file line number Diff line number Diff line change
@@ -129,7 +129,7 @@
# Test all tests in the `agents` (soon to be "trainers") dir:
- bazel test --config=ci $(./ci/run/bazel_export_options)
--build_tests_only
--test_tag_filters=algorithms_dir_generic,-multi_gpu
--test_tag_filters=trainers_dir_generic,-multi_gpu
--test_env=RAY_USE_MULTIPROCESSING_CPU_COUNT=1
rllib/...

@@ -141,7 +141,7 @@
# Test all tests in the `agents` (soon to be "trainers") dir:
- bazel test --config=ci $(./ci/run/bazel_export_options)
--build_tests_only
--test_tag_filters=algorithms_dir,-algorithms_dir_generic,-multi_gpu
--test_tag_filters=trainers_dir,-trainers_dir_generic,-multi_gpu
--test_env=RAY_USE_MULTIPROCESSING_CPU_COUNT=1
rllib/...

@@ -154,7 +154,7 @@
# "learning_tests|quick_train|examples|tests_dir".
- bazel test --config=ci $(./ci/run/bazel_export_options)
--build_tests_only
--test_tag_filters=-learning_tests,-quick_train,-memory_leak_tests,-examples,-tests_dir,-algorithms_dir,-documentation,-multi_gpu
--test_tag_filters=-learning_tests,-quick_train,-memory_leak_tests,-examples,-tests_dir,-trainers_dir,-documentation,-multi_gpu
--test_env=RAY_USE_MULTIPROCESSING_CPU_COUNT=1
rllib/...

4 changes: 2 additions & 2 deletions README.rst
Original file line number Diff line number Diff line change
@@ -179,7 +179,7 @@ It offers high scalability and unified APIs for a
.. code-block:: python
import gym
from ray.rllib.algorithms.ppo import PPO
from ray.rllib.agents.ppo import PPOTrainer
# Define your problem using python and openAI's gym API:
@@ -229,7 +229,7 @@ It offers high scalability and unified APIs for a
# Create an RLlib Trainer instance.
trainer = PPO(
trainer = PPOTrainer(
config={
# Env class to use (here: our gym.Env sub-class from above).
"env": SimpleCorridor,
4 changes: 2 additions & 2 deletions doc/source/ray-overview/doc_test/ray_rllib.py
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
from ray import tune
from ray.rllib.algorithms.ppo import PPO
from ray.rllib.agents.ppo import PPOTrainer

tune.run(
PPO,
PPOTrainer,
stop={"episode_len_mean": 20},
config={"env": "CartPole-v0", "framework": "torch", "log_level": "INFO"},
)
6 changes: 3 additions & 3 deletions doc/source/rllib/core-concepts.rst
Original file line number Diff line number Diff line change
@@ -25,14 +25,14 @@ Trainers also implement the :ref:`Tune Trainable API <tune-60-seconds>` for easy

You have three ways to interact with a trainer. You can use the basic Python API or the command line to train it, or you
can use Ray Tune to tune hyperparameters of your reinforcement learning algorithm.
The following example shows three equivalent ways of interacting with the ``PPO`` Trainer,
The following example shows three equivalent ways of interacting with the ``PPOTrainer``,
which implements the proximal policy optimization algorithm in RLlib.

.. tabbed:: Basic RLlib Trainer

.. code-block:: python
trainer = PPO(env="CartPole-v0", config={"train_batch_size": 4000})
trainer = PPOTrainer(env="CartPole-v0", config={"train_batch_size": 4000})
while True:
print(trainer.train())
@@ -47,7 +47,7 @@ which implements the proximal policy optimization algorithm in RLlib.
.. code-block:: python
from ray import tune
tune.run(PPO, config={"env": "CartPole-v0", "train_batch_size": 4000})
tune.run(PPOTrainer, config={"env": "CartPole-v0", "train_batch_size": 4000})
4 changes: 2 additions & 2 deletions doc/source/rllib/doc_code/training.py
Original file line number Diff line number Diff line change
@@ -22,9 +22,9 @@
# __query_action_dist_start__
# Get a reference to the policy
import numpy as np
from ray.rllib.algorithms.ppo import PPO
from ray.rllib.agents.ppo import PPOTrainer

trainer = PPO(env="CartPole-v0", config={"framework": "tf2", "num_workers": 0})
trainer = PPOTrainer(env="CartPole-v0", config={"framework": "tf2", "num_workers": 0})
policy = trainer.get_policy()
# <ray.rllib.policy.eager_tf_policy.PPOTFPolicy_eager object at 0x7fd020165470>

2 changes: 1 addition & 1 deletion doc/source/rllib/package_ref/policy/custom_policies.rst
Original file line number Diff line number Diff line change
@@ -23,4 +23,4 @@ framework-agnostic policy),
* :py:meth:`~ray.rllib.policy.policy.Policy.postprocess_trajectory`
* :py:meth:`~ray.rllib.policy.policy.Policy.loss`

`See here for an example on how to override TorchPolicy <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo_torch_policy.py>`_.
`See here for an example on how to override TorchPolicy <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo_torch_policy.py>`_.
20 changes: 10 additions & 10 deletions doc/source/rllib/rllib-algorithms.rst
Original file line number Diff line number Diff line change
@@ -130,7 +130,7 @@ Importance Weighted Actor-Learner Architecture (IMPALA)
-------------------------------------------------------
|pytorch| |tensorflow|
`[paper] <https://arxiv.org/abs/1802.01561>`__
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/impala/impala.py>`__
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/impala/impala.py>`__
In IMPALA, a central learner runs SGD in a tight loop while asynchronously pulling sample batches from many actor processes. RLlib's IMPALA implementation uses DeepMind's reference `V-trace code <https://github.com/deepmind/scalable_agent/blob/master/vtrace.py>`__. Note that we do not provide a deep residual network out of the box, but one can be plugged in as a `custom model <rllib-models.html#custom-models-tensorflow>`__. Multiple learner GPUs and experience replay are also supported.

.. figure:: images/impala-arch.svg
@@ -168,7 +168,7 @@ SpaceInvaders 843 ~300

**IMPALA-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):

.. literalinclude:: ../../../rllib/algorithms/impala/impala.py
.. literalinclude:: ../../../rllib/agents/impala/impala.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__
@@ -179,7 +179,7 @@ Asynchronous Proximal Policy Optimization (APPO)
------------------------------------------------
|pytorch| |tensorflow|
`[paper] <https://arxiv.org/abs/1707.06347>`__
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/appo/appo.py>`__
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/appo.py>`__
We include an asynchronous variant of Proximal Policy Optimization (PPO) based on the IMPALA architecture. This is similar to IMPALA but using a surrogate policy loss with clipping. Compared to synchronous PPO, APPO is more efficient in wall-clock time due to its use of asynchronous sampling. Using a clipped loss also allows for multiple SGD passes, and therefore the potential for better sample efficiency compared to IMPALA. V-trace can also be enabled to correct for off-policy samples.

.. tip::
@@ -190,11 +190,11 @@ We include an asynchronous variant of Proximal Policy Optimization (PPO) based o

APPO architecture (same as IMPALA)

Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/appo/pong-appo.yaml>`__
Tuned examples: `PongNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/pong-appo.yaml>`__

**APPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):

.. literalinclude:: ../../../rllib/algorithms/appo/appo.py
.. literalinclude:: ../../../rllib/agents/ppo/appo.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__
@@ -205,7 +205,7 @@ Decentralized Distributed Proximal Policy Optimization (DD-PPO)
---------------------------------------------------------------
|pytorch|
`[paper] <https://arxiv.org/abs/1911.00357>`__
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ddppo/ddppo.py>`__
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ddppo.py>`__
Unlike APPO or PPO, with DD-PPO policy improvement is no longer done centralized in the trainer process. Instead, gradients are computed remotely on each rollout worker and all-reduced at each mini-batch using `torch distributed <https://pytorch.org/docs/stable/distributed.html>`__. This allows each worker's GPU to be used both for sampling and for training.

.. tip::
@@ -216,11 +216,11 @@ Unlike APPO or PPO, with DD-PPO policy improvement is no longer done centralized

DD-PPO architecture (both sampling and learning are done on worker GPUs)

Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddppo/cartpole-ddppo.yaml>`__, `BreakoutNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ddppo/atari-ddppo.yaml>`__
Tuned examples: `CartPole-v0 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/cartpole-ddppo.yaml>`__, `BreakoutNoFrameskip-v4 <https://github.com/ray-project/ray/blob/master/rllib/tuned_examples/ppo/atari-ddppo.yaml>`__

**DDPPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):

.. literalinclude:: ../../../rllib/algorithms/ddppo/ddppo.py
.. literalinclude:: ../../../rllib/agents/ppo/ddppo.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__
@@ -396,7 +396,7 @@ Proximal Policy Optimization (PPO)
----------------------------------
|pytorch| |tensorflow|
`[paper] <https://arxiv.org/abs/1707.06347>`__
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py>`__
`[implementation] <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo.py>`__
PPO's clipped objective supports multiple SGD passes over the same batch of experiences. RLlib's multi-GPU optimizer pins that data in GPU memory to avoid unnecessary transfers from host memory, substantially improving performance over a naive implementation. PPO scales out using multiple workers for experience collection, and also to multiple GPUs for SGD.

.. tip::
@@ -445,7 +445,7 @@ HalfCheetah 9664 ~7700

**PPO-specific configs** (see also `common configs <rllib-training.html#common-parameters>`__):

.. literalinclude:: ../../../rllib/algorithms/ppo/ppo.py
.. literalinclude:: ../../../rllib/agents/ppo/ppo.py
:language: python
:start-after: __sphinx_doc_begin__
:end-before: __sphinx_doc_end__
12 changes: 6 additions & 6 deletions doc/source/rllib/rllib-concepts.rst
Original file line number Diff line number Diff line change
@@ -210,11 +210,11 @@ You might be wondering how RLlib makes the advantages placeholder automatically

In the above section you saw how to compose a simple policy gradient algorithm with RLlib.
In this example, we'll dive into how PPO is defined within RLlib and how you can modify it.
First, check out the `PPO trainer definition <https://github.com/ray-project/ray/blob/master/rllib/algorithms/ppo/ppo.py>`__:
First, check out the `PPO trainer definition <https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo.py>`__:

.. code-block:: python
class PPO(Trainer):
class PPOTrainer(Trainer):
@classmethod
@override(Trainer)
def get_default_config(cls) -> TrainerConfigDict:
@@ -280,7 +280,7 @@ Suppose we want to customize PPO to use an asynchronous-gradient optimization st

.. code-block:: python
from ray.rllib.algorithms.ppo import PPO
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.execution.rollout_ops import AsyncGradients
from ray.rllib.execution.train_ops import ApplyGradients
from ray.rllib.execution.metric_ops import StandardMetricsReporting
@@ -307,7 +307,7 @@ Now let's look at each PPO policy definition:
PPOTFPolicy = build_tf_policy(
name="PPOTFPolicy",
get_default_config=lambda: ray.rllib.algorithms.ppo.ppo.PPOConfig().to_dict(),
get_default_config=lambda: ray.rllib.agents.ppo.ppo.DEFAULT_CONFIG,
loss_fn=ppo_surrogate_loss,
stats_fn=kl_and_loss_stats,
extra_action_out_fn=vf_preds_and_logits_fetches,
@@ -562,8 +562,8 @@ You can use the ``with_updates`` method on Trainers and Policy objects built wit

.. code-block:: python
from ray.rllib.algorithms.ppo import PPO
from ray.rllib.algorithms.ppo.ppo_tf_policy import PPOTFPolicy
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy
CustomPolicy = PPOTFPolicy.with_updates(
name="MyCustomPPOTFPolicy",
4 changes: 2 additions & 2 deletions doc/source/rllib/rllib-env.rst
Original file line number Diff line number Diff line change
@@ -33,7 +33,7 @@ You can pass either a string name or a Python class to specify an environment. B
return <obs>, <reward: float>, <done: bool>, <info: dict>
ray.init()
trainer = ppo.PPO(env=MyEnv, config={
trainer = ppo.PPOTrainer(env=MyEnv, config={
"env_config": {}, # config to pass to env class
})
@@ -50,7 +50,7 @@ You can also register a custom env creator function with a string name. This fun
return MyEnv(...) # return an env instance
register_env("my_env", env_creator)
trainer = ppo.PPO(env="my_env")
trainer = ppo.PPOTrainer(env="my_env")
For a full runnable code example using the custom environment API, see `custom_env.py <https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py>`__.

10 changes: 5 additions & 5 deletions doc/source/rllib/rllib-models.rst
Original file line number Diff line number Diff line change
@@ -215,7 +215,7 @@ Once implemented, your TF model can then be registered and used in place of a bu
.. code-block:: python
import ray
import ray.rllib.algorithms.ppo as ppo
import ray.rllib.agents.ppo as ppo
from ray.rllib.models import ModelCatalog
from ray.rllib.models.tf.tf_modelv2 import TFModelV2
@@ -227,7 +227,7 @@ Once implemented, your TF model can then be registered and used in place of a bu
ModelCatalog.register_custom_model("my_tf_model", MyModelClass)
ray.init()
trainer = ppo.PPO(env="CartPole-v0", config={
trainer = ppo.PPOTrainer(env="CartPole-v0", config={
"model": {
"custom_model": "my_tf_model",
# Extra kwargs to be passed to your model's c'tor.
@@ -282,7 +282,7 @@ Once implemented, your PyTorch model can then be registered and used in place of
ModelCatalog.register_custom_model("my_torch_model", CustomTorchModel)
ray.init()
trainer = ppo.PPO(env="CartPole-v0", config={
trainer = ppo.PPOTrainer(env="CartPole-v0", config={
"framework": "torch",
"model": {
"custom_model": "my_torch_model",
@@ -488,7 +488,7 @@ Similar to custom models and preprocessors, you can also specify a custom action
.. code-block:: python
import ray
import ray.rllib.algorithms.ppo as ppo
import ray.rllib.agents.ppo as ppo
from ray.rllib.models import ModelCatalog
from ray.rllib.models.preprocessors import Preprocessor
@@ -508,7 +508,7 @@ Similar to custom models and preprocessors, you can also specify a custom action
ModelCatalog.register_custom_action_dist("my_dist", MyActionDist)
ray.init()
trainer = ppo.PPO(env="CartPole-v0", config={
trainer = ppo.PPOTrainer(env="CartPole-v0", config={
"model": {
"custom_action_dist": "my_dist",
},
4 changes: 2 additions & 2 deletions doc/source/rllib/rllib-offline.rst
Original file line number Diff line number Diff line change
@@ -238,7 +238,7 @@ You can configure experience input for an agent using the following options:
objects, which have the advantage of being type safe, allowing users to set different config settings within
meaningful sub-categories (e.g. ``my_config.offline_data(input_=[xyz])``), and offer the ability to
construct a Trainer instance from these config objects (via their ``.build()`` method).
So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPO`,
So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.agents.ppo.ppo.PPOTrainer`,
but we are rolling this out right now across all RLlib.


@@ -335,7 +335,7 @@ You can configure experience output for an agent using the following options:
objects, which have the advantage of being type safe, allowing users to set different config settings within
meaningful sub-categories (e.g. ``my_config.offline_data(input_=[xyz])``), and offer the ability to
construct a Trainer instance from these config objects (via their ``.build()`` method).
So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPO`,
So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.agents.ppo.ppo.PPOTrainer`,
but we are rolling this out right now across all RLlib.

.. code-block:: python
14 changes: 7 additions & 7 deletions doc/source/rllib/rllib-training.rst
Original file line number Diff line number Diff line change
@@ -164,7 +164,7 @@ Common Parameters
objects, which have the advantage of being type safe, allowing users to set different config settings within
meaningful sub-categories (e.g. ``my_config.training(lr=0.0003)``), and offer the ability to
construct a Trainer instance from these config objects (via their ``build()`` method).
So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.algorithms.ppo.ppo.PPO`,
So far, this is only supported for some Trainer classes, such as :py:class:`~ray.rllib.agents.ppo.ppo.PPOTrainer`,
but we are rolling this out right now across all RLlib.

The following is a list of the common algorithm hyper-parameters:
@@ -705,14 +705,14 @@ Here is an example of the basic usage (for a more complete example, see `custom_
.. code-block:: python
import ray
import ray.rllib.algorithms.ppo as ppo
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print
ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 1
trainer = ppo.PPO(config=config, env="CartPole-v0")
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")
# Can optionally call trainer.restore(path) to load a checkpoint.
@@ -783,7 +783,7 @@ It also simplifies saving the trained agent. For example:
# tune.run() allows setting a custom log directory (other than ``~/ray-results``)
# and automatically saving the trained agent
analysis = ray.tune.run(
ppo.PPO,
ppo.PPOTrainer,
config=config,
local_dir=log_dir,
stop=stop_criteria,
@@ -807,7 +807,7 @@ Loading and restoring a trained agent from a checkpoint is simple:

.. code-block:: python
agent = ppo.PPO(config=config, env=env_class)
agent = ppo.PPOTrainer(config=config, env=env_class)
agent.restore(checkpoint_path)
@@ -1340,10 +1340,10 @@ customizations to your training loop.
import ray
from ray import tune
from ray.rllib.algorithms.ppo import PPO
from ray.rllib.agents.ppo import PPOTrainer
def train(config, reporter):
trainer = PPO(config=config, env=YourEnv)
trainer = PPOTrainer(config=config, env=YourEnv)
while True:
result = trainer.train()
reporter(**result)
Loading

0 comments on commit fd0f967

Please sign in to comment.