forked from ray-project/ray
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[RLLib] Readme.md Documentation for Almost All Algorithms in rllib/ag…
…ents (ray-project#13035)
- Loading branch information
1 parent
d811d65
commit eae7a1f
Showing
6 changed files
with
146 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,22 @@ | ||
# Advantage Actor-Critic (A2C, A3C) | ||
|
||
## Overview | ||
|
||
[Advantage Actor-Critic](https://arxiv.org/pdf/1602.01783.pdf) proposes two distributed model-free on-policy RL algorithms, A3C and A2C. These algorithms are distributed versions of the vanilla Policy Gradient (PG) algorithm with different distributed execution patterns. The paper suggests accelerating training via scaling data collection, i.e. introducing worker nodes, which carry copies of the central node's policy network and collect data from the environment in parallel. This data is used on each worker to compute gradients. The central node applies each of these gradients and then sends updated weights back to the workers. | ||
|
||
In A2C, the worker nodes synchronously collect data. The collected data forms a giant batch of data, from which the central node (the central policy) computes gradient updates. On the other hand, in A3C, the worker nodes generate data asynchronously, compute gradients from the data, and send computed gradients to the central node. Note that the workers in A3C may be slightly out-of-sync with the central node due to asynchrony, which may induce biases in learning. | ||
|
||
|
||
## Documentation & Implementation: | ||
|
||
1) A2C. | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#a3c)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a2c.py)** | ||
|
||
2) A3C. | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#a3c)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/a3c/a3c.py)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
# Augmented Random Search (ARS) | ||
|
||
## Overview | ||
|
||
[ARS](https://arxiv.org/abs/1803.07055) is a sample-efficient random search method that can outperform model-free RL algorithms. For each iteration, ARS discovers new policies via random noise from a central policy and sorts these policies by their performance in the environment. At the end of each iteration, the best policies ranked by performance are used to compute the final update for the central policy. | ||
|
||
## Documentation & Implementation: | ||
|
||
Augmented Random Search (ARS). | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#ars)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/ars/ars.py)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1 +1,60 @@ | ||
# Deep Q Networks (DQN) | ||
|
||
Code in this package is adapted from https://github.com/openai/baselines/tree/master/baselines/deepq. | ||
|
||
|
||
## Overview | ||
|
||
[DQN](https://www.cs.toronto.edu/~vmnih/docs/dqn.pdf) is a model-free off-policy RL algorithm and one of the first deep RL algorithms developed. DQN proposes using a neural network as a function approximator for the Q-function in Q-learning. The agent aims to minimize the L2 norm between the Q-value predictions and the Q-value targets, which is computed as 1-step TD. The paper proposes two important concepts, a target network and an experience replay buffer. The target network is a copy of the main Q network and is used to compute Q-value targets for loss-function calculations. To stabilize training, the target network lags slightly behind the main Q-network. Meanwhile, the experience replay stores all data encountered by the agent during training and is uniformly sampled from to generate gradient updates for the Q-value network. | ||
|
||
|
||
## Supported DQN Algorithms | ||
|
||
[Double DQN](https://arxiv.org/pdf/1509.06461.pdf) - As opposed to learning one Q network in vanilla DQN, Double DQN proposes learning two Q networks akin to double Q-learning. As a solution, Double DQN aims to solve the issue of vanilla DQN's overly-optimistic Q-values, which limits performance. | ||
|
||
[Dueling DQN](https://arxiv.org/pdf/1511.06581.pdf) - Dueling DQN proposes splitting learning a Q-value function approximator into learning two networks: a value and advantage approximator. | ||
|
||
[Distributional DQN](https://arxiv.org/pdf/1707.06887.pdf) - Usually, the Q network outputs the predicted Q-value of a state-action pair. Distributional DQN takes this further by predicting the distribution of Q-values (e.g. mean and std of a normal distribution) of a state-action pair. Doing this captures uncertainty of the Q-value and can improve the performance of DQN algorithms. | ||
|
||
[APEX-DQN](https://arxiv.org/pdf/1803.00933.pdf) - Standard DQN algorithms propose using a experience replay buffer to sample data uniformly and compute gradients from the sampled data. APEX introduces the notion of weighted replay data, where elements in the replay buffer are more or less likely to be sampled depending on the TD-error. | ||
|
||
[Rainbow](https://arxiv.org/pdf/1710.02298.pdf) - Rainbow DQN, as the word Rainbow suggests, aggregates the many improvements discovered in research to improve DQN performance. This includes a multi-step distributional loss (extended from Distributional DQN), prioritized replay (inspired from APEX-DQN), double Q-networks (inspired from Double DQN), and dueling networks (inspired from Dueling DQN). | ||
|
||
|
||
## Documentation & Implementation: | ||
|
||
1) Vanilla DQN (DQN). | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#dqn)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/simple_q.py)** | ||
|
||
2) Double DQN. | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#dqn)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/dqn.py)** | ||
|
||
3) Dueling DQN | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#dqn)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/dqn.py)** | ||
|
||
3) Distributional DQN | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#dqn)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/dqn.py)** | ||
|
||
4) APEX DQN | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#dqn)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/apex.py)** | ||
|
||
5) Rainbow DQN | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#dqn)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/dqn.py)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Dreamer | ||
|
||
 | ||
|
||
## Overview | ||
|
||
[Dreamer](https://arxiv.org/abs/1912.01603) is a model-based off-policy RL algorithm that learns by imagining and works well in visual-based enviornments. Like all model-based algorithms, Dreamer learns the environment's transiton dynamics via a latent-space model called [PlaNet](https://ai.googleblog.com/2019/02/introducing-planet-deep-planning.html). PlaNet learns to encode visual space into latent vectors, which can be used as pseudo-observations in Dreamer. | ||
|
||
Dreamer is a gradient-based RL algorithm. This means that the agent imagines ahead using its learned transition dynamics model (PlaNet) to discover new rewards and states. Because imagining ahead is fully differentiable, the RL objective (maximizing the sum of rewards) is fully differentiable and does not need to be optimized indirectly such as policy gradient methods. This feature of gradient-based learning, in conjunction with PlaNet, enables the agent to learn in a latent space and achieves much better sample complexity and performance than other visual-based agents. | ||
|
||
For more details, there is a Ray/RLlib [blogpost](https://medium.com/distributed-computing-with-ray/model-based-reinforcement-learning-with-ray-rllib-73f47df33839˜) that better covers the components of PlaNet and the distributed execution plan. | ||
|
||
## Documentation & Implementation: | ||
|
||
Dreamer. | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#dqn)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/simple_q.py)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
# Model Agnostic Meta-learning (MAML) | ||
|
||
## Overview | ||
|
||
[MAML](https://arxiv.org/abs/1703.03400) is an on-policy meta RL algorithm. Unlike standard RL algorithms, which aim to maximize the sum of rewards into the future for a single task (e.g. HalfCheetah), meta RL algorithms seek to maximize the sum of rewards for *a given distribution of tasks*. | ||
|
||
On a high level, MAML seeks to learn quick adaptation across different tasks (e.g. different velocities for HalfCheetah). Quick adaptation is defined by the number of gradient steps it takes to adapt. MAML aims to maximize the RL objective for each task after `X` gradient steps. Doing this requires partitioning the algorithm into two steps. The first step is data collection. This involves collecting data for each task for each step of adaptation (from `1, 2, ..., X`). The second step is the meta-update step. This second step takes all the aggregated ddata from the first step and computes the meta-gradient. | ||
|
||
|
||
## Documentation & Implementation: | ||
|
||
MAML. | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#model-agnostic-meta-learning-maml)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/maml/maml.py)** |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,17 @@ | ||
# Model-based Meta-Policy Optimization (MB-MPO) | ||
|
||
Code in this package is adapted from https://github.com/jonasrothfuss/model_ensemble_meta_learning. | ||
|
||
## Overview | ||
|
||
[MBMPO](https://arxiv.org/abs/1809.05214) is an on-policy model-based algorithm. On a high level, MBMPO is model-based [MAML](https://arxiv.org/abs/1703.03400). On top of MAML, MBMPO learns an *ensemble of dynamics models*. MBMPO trains the dynamics models with real-life data and the actor/critic networks with fake data generated by the dynamics models. The actor and critic are updated via the MAML algorithm. For the distributed execution plan, MBMPO alternates between training the dynanmics model and training the actor and critic network. | ||
|
||
More details can be found [here](https://medium.com/distributed-computing-with-ray/model-based-reinforcement-learning-with-ray-rllib-73f47df33839). | ||
|
||
## Documentation & Implementation: | ||
|
||
MBMPO. | ||
|
||
**[Detailed Documentation](https://docs.ray.io/en/master/rllib-algorithms.html#mbmpo)** | ||
|
||
**[Implementation](https://github.com/ray-project/ray/blob/master/rllib/agents/mbmpo/mbmpo.py)** |