This project aims to reproduce the results of several model-free RL algorithms in continuous action domain (mujuco environment).
This projects
- uses pytorch package
- implements different algorithms independently in seperate files / minimal files
- is written in simplest style
- tries to follow the original paper and reproduce their results
My first stage of work is to reproduce this figure in the PPO paper.
- A2C
- ACER (A2C + Trust Region): It seems that this implementation has some problems ... (welcome bug report)
- TRPO (TRPO single path)
- PPO (PPO clip)
- Vanilla PG
On the next stage, I want to implement
- Random Search (see Simple random search provides a competitive approach to reinforcement learning)
- NPG (natural policy gradient)
- SAC (soft actor-critic)
Then next stage, discrete action space problem and raw video input (Atari) problems:
- Rainbow: DQN and relevant techniques (target network / double Q-learning / prioritized experience replay / dueling network structure / distributional RL)
Rainbow on Atari with only 3M: It works but may need further tuning.
And then model-based algorithms (not planned)
- change the way reward counts, current way may underestimate the reward (evaluate a deterministic model rather a stochastic/exploratory model)
PPO implementation is of high quality - matches the performance of openai.baselines.