Skip to content

Commit

Permalink
experimenting with predictor model
Browse files Browse the repository at this point in the history
  • Loading branch information
rgilman33 committed Jan 7, 2018
1 parent b7ab871 commit 13404bf
Show file tree
Hide file tree
Showing 9 changed files with 21,087 additions and 7,149 deletions.
1,688 changes: 1,688 additions & 0 deletions .ipynb_checkpoints/0_A2C-monte-carlo-pred-checkpoint.ipynb

Large diffs are not rendered by default.

9,502 changes: 9,502 additions & 0 deletions .ipynb_checkpoints/1_A2C-monte-carlo-checkpoint.ipynb

Large diffs are not rendered by default.

641 changes: 641 additions & 0 deletions .ipynb_checkpoints/2_A2C-nstep-checkpoint.ipynb

Large diffs are not rendered by default.

413 changes: 413 additions & 0 deletions .ipynb_checkpoints/3_A2C-nstep-TUTORIAL-checkpoint.ipynb

Large diffs are not rendered by default.

1,688 changes: 1,688 additions & 0 deletions 0_A2C-monte-carlo-pred.ipynb

Large diffs are not rendered by default.

14,191 changes: 7,099 additions & 7,092 deletions 1_A2C-monte-carlo.ipynb

Large diffs are not rendered by default.

4 changes: 0 additions & 4 deletions 2_A2C-nstep.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,6 @@
"import matplotlib.pyplot as plt\n",
"\n",
"import torch\n",
"\n",
"\n",
"import torch.nn as nn\n",
"import torch.nn.functional as F\n",
"import torch.optim as optim\n",
Expand All @@ -39,8 +37,6 @@
}
],
"source": [
"#N_STEPS = 5\n",
"\n",
"SEED = 2\n",
"N_ACTIONS = 2\n",
"N_INPUTS = 4\n",
Expand Down
103 changes: 52 additions & 51 deletions 3_A2C-nstep-TUTORIAL.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -25,14 +25,14 @@
"source": [
"Simple A2C--Code\n",
"\n",
"This is a simple implementation of an Actor-Advantage-Critic (A2C) model. For an intuitive guide to the mechanics of the model itself please check out Simple A2C--Intuition. \n",
"This is a simple implementation of an Actor-Advantage-Critic (A2C) model. For an intuitive guide to the mechanics of the model itself please check out the comic in this repository. \n",
"\n",
"To keep things clear, we're using an easy challenge--Cartpole--and have pruned the A2C to only the necessary bits. We're building an n-step A2C with a single agent that takes in a simple Cartpole state as 4 float values, but notebooks for a Monte Carlo, multiple parallel agents, and raw pixels versions are in this directory as well. For a more industrial-strength A2C, check out our PyTorch implementation of the OpenAI Baselines A2C."
"To keep things clear, we're using an easy challenge--Cartpole--and have pruned the A2C to only the necessary bits, sacrificing a bit of performance. We're building an n-step A2C with a single agent that takes in a simple Cartpole state as 4 float values, but notebooks for a Monte Carlo, multiple parallel agents, and raw pixels versions are in this directory as well. For a more industrial-strength A2C, check out our PyTorch implementation of the OpenAI Baselines A2C."
]
},
{
"cell_type": "code",
"execution_count": 1,
"execution_count": 12,
"metadata": {
"scrolled": true
},
Expand All @@ -43,7 +43,7 @@
"\n",
"# LR of 3e-2 explodes the gradients, LR of 3e-4 trains slower\n",
"LR = 3e-3\n",
"N_GAMES = 1000\n",
"N_GAMES = 2000\n",
"\n",
"# OpenAI baselines uses nstep of 5.\n",
"N_STEPS = 20\n",
Expand All @@ -67,20 +67,11 @@
},
{
"cell_type": "code",
"execution_count": 22,
"execution_count": 13,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"<ipython-input-22-c0659aff9b30>:94: SyntaxWarning: name 'action_probs' is assigned to before global declaration\n",
" global state_values, action_probs\n"
]
}
],
"outputs": [],
"source": [
"state = env.reset()\n",
"finished_games = 0\n",
Expand Down Expand Up @@ -112,13 +103,6 @@
"The cell above contains everything. Now we'll go through and look at the individual parts of it."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
Expand Down Expand Up @@ -175,7 +159,7 @@
"metadata": {},
"outputs": [],
"source": [
"pic of MC returns"
"#pic of MC returns"
]
},
{
Expand All @@ -191,7 +175,7 @@
"metadata": {},
"outputs": [],
"source": [
"Pic of df backing up from bootstrapped v(s). Note that last value is an estimate."
"#Pic of df backing up from bootstrapped v(s). Note that last value is an estimate."
]
},
{
Expand All @@ -205,7 +189,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 7,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -217,7 +201,8 @@
" if dones[-1] == True: next_return = 0\n",
" \n",
" # If not terminal state, bootstrap v(s) using our critic\n",
" else: # just take from last value of states estimate\n",
" # TODO: don't need to estimate again, just take from last value of v(s) estimates\n",
" else: \n",
" s = torch.from_numpy(states[-1]).float().unsqueeze(0)\n",
" next_return = model.get_state_value(Variable(s)).data[0][0] \n",
" \n",
Expand All @@ -236,25 +221,6 @@
" return state_values_true"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": []
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
Expand All @@ -263,14 +229,12 @@
"\n",
"Once we have labels for our minibatch of training data, we treat it like we would any other supervised learning problem. We calculate the loss and backpropagate it through the model.\n",
"\n",
"Our first step is to send our states as input into the NN. In return we get a list of state values and a list of action recommendations. We use these, along with our lists of calculated true state values and actual actions taken to compute the advantage / TD error. \n",
"\n",
"Why couldn’t we just save our v(s) and action probabilities during the “gather data” phase? Because PyTorch needs a clear line of sight on all the calculations we’ve done in order to trace our graph during the backwards pass of our model. Check out PyTorch's docs on autodiff for more info."
"Our first step is to send our states as input into the NN. In return we get a list of state value predictions and a list of action recommendations. We use these, along with our lists of bootstrapped target state values and actual actions taken to compute the advantage / TD error. "
]
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -311,7 +275,7 @@
},
{
"cell_type": "code",
"execution_count": null,
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
Expand Down Expand Up @@ -364,7 +328,7 @@
},
{
"cell_type": "code",
"execution_count": 4,
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
Expand All @@ -386,6 +350,43 @@
" state = next_state\n",
" return score"
]
},
{
"cell_type": "code",
"execution_count": 17,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"124"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"test_model(model)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are a number of improvements that we can make to this model:\n",
"\n",
"--You'll notice that after reaching a perfect score of 200, the model's performance fluctuates wildly. We can fix this by only training on episodes where a failure occured--perfect games have no training signal bc returns for all states are identical. \n",
"\n",
"-- Further limiting our training to only those frames directly before a failure seems to speed training as well (more work required here), perhaps because we're downsampling \"uninteresting\" observations that have no variation in returns. \n",
"\n",
"-- We're not recording our scores or losses throughout training. Other versions in this repo chart progress.\n",
"\n",
"-- We haven't experimented with other step sizes, which significantly affect training. Other a2cs in this repo show experiments along these lines\n",
"\n",
"-- We haven't added in multiple actors, whichs help by decorrelating training data"
]
}
],
"metadata": {
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,12 @@ The notebooks in this repo build an A2C from scratch in PyTorch, starting with a
Notebooks:
1) Monte Carlo A2C
2) Adding N-Step
3) A simplified version of 2a used for teaching purposes. Compliment to comic (show link).
3) Code walk-through TUTORIAL: A simplified version of 2a used for teaching purposes. Compliment to comic.
4) Adding in multiple actors
5) Allowing model to take in a stack of "frames" rather that single frame. This in preparation for next step when we add in stack of frames from raw pixels.
6) Transitioning to raw pixel input. Changing FC NN to CNN.
6) Transitioning to raw pixel input. Changing FC NN to CNN. Takes hours on p2x large rather than seconds on laptop to train.

0) MC A2C which is also trained to predict its own next state and reward. Currently being used for experiments in transfer learning, prediction, data generation. If a model can predict its own future states, can it use this predictor to generate data for "mental training"?

For a deeper dive in deep RL, these are my favorite resources:

Expand Down

0 comments on commit 13404bf

Please sign in to comment.