A collection of python implementations of the RL algorithms for the examples and figures in Sutton & Barto, Reinforcement Learning: An Introduction.
- Numbering of the examples is based on the January 1, 2018 complete draft to the 2nd edition.
- Epsilon-greedy action-value methods
- Upper-Confidence-Bound action selection
- Gradient bandit algorithms
- State-value function estimation under uniform and optimal policy
- Iterative policy evaluation
- Policy iteration
- Value iteration
- First-visit MC
- Exploring starts MC
- Off-policy prediction via importance sampling
- TD(0)
- Batch updating TD(0) and constant-alpha MC
- Sarsa on-policy TD control
- Q-learning off-policy TD control
- Expected Sarsa
- Double Q-learning
- n-step TD
- n-step Sarsa
- Tabular Dyna-Q
- Planning and non-planning Dyna-Q
- Dyna-Q+ prioritized sweeping for deterministic environments
- Trajectory sampling
- Gradient Monte Carlo
- Semi-gradient TD(0)
- n-step semi-gradient TD
- Gradient MC with Fourier and polynomial bases
- Coarse coding
- Tile coding
- State aggregation
- Episodic semi-gradient Sarsa
- n-step semi-gradient Sarsa
- Differential semi-gradient Sarsa
- Semi-gradient off-policy TD
- Semi-gradient DP
- TD(0) with gradient correction (TDC)
- Expected TDC
- Expected Emphatic TD
- Offline λ-return
- TD(λ)
- True online TD(λ)
- Sarsa(λ)
- REINFORCE
- REINFORCE with baseline
A full list of the generated figures and table is here.
Easiest way to run is to clone this repo and run
python filename.py
- python 3.6
- numpy
- scipy
- matplotlib
- seaborn
- tqdm
- tabulate
The key examples of each chapter are separated. There are inter-chapter dependences as examples are extended across topics. Base classes for an base RL agent, Gridworld and tile coding are separated and imported where relevant.