experimenting with predictor model

wmelton · Jan 7, 2018 · 13404bf · 13404bf
1 parent b7ab871
commit 13404bf
Show file tree

Hide file tree

Showing 9 changed files with 21,087 additions and 7,149 deletions.
diff --git a/.ipynb_checkpoints/0_A2C-monte-carlo-pred-checkpoint.ipynb b/.ipynb_checkpoints/0_A2C-monte-carlo-pred-checkpoint.ipynb
diff --git a/.ipynb_checkpoints/1_A2C-monte-carlo-checkpoint.ipynb b/.ipynb_checkpoints/1_A2C-monte-carlo-checkpoint.ipynb
diff --git a/.ipynb_checkpoints/2_A2C-nstep-checkpoint.ipynb b/.ipynb_checkpoints/2_A2C-nstep-checkpoint.ipynb
diff --git a/.ipynb_checkpoints/3_A2C-nstep-TUTORIAL-checkpoint.ipynb b/.ipynb_checkpoints/3_A2C-nstep-TUTORIAL-checkpoint.ipynb
diff --git a/0_A2C-monte-carlo-pred.ipynb b/0_A2C-monte-carlo-pred.ipynb
diff --git a/1_A2C-monte-carlo.ipynb b/1_A2C-monte-carlo.ipynb
diff --git a/2_A2C-nstep.ipynb b/2_A2C-nstep.ipynb
@@ -12,8 +12,6 @@
     "import matplotlib.pyplot as plt\n",
     "\n",
     "import torch\n",
-    "\n",
-    "\n",
     "import torch.nn as nn\n",
     "import torch.nn.functional as F\n",
     "import torch.optim as optim\n",
@@ -39,8 +37,6 @@
     }
    ],
    "source": [
-    "#N_STEPS = 5\n",
-    "\n",
     "SEED = 2\n",
     "N_ACTIONS = 2\n",
     "N_INPUTS = 4\n",

diff --git a/3_A2C-nstep-TUTORIAL.ipynb b/3_A2C-nstep-TUTORIAL.ipynb
@@ -25,14 +25,14 @@
    "source": [
     "Simple A2C--Code\n",
     "\n",
-    "This is a simple implementation of an Actor-Advantage-Critic (A2C) model. For an intuitive guide to the mechanics of the model itself please check out Simple A2C--Intuition. \n",
+    "This is a simple implementation of an Actor-Advantage-Critic (A2C) model. For an intuitive guide to the mechanics of the model itself please check out the comic in this repository. \n",
     "\n",
-    "To keep things clear, we're using an easy challenge--Cartpole--and have pruned the A2C to only the necessary bits. We're building an n-step A2C with a single agent that takes in a simple Cartpole state as 4 float values, but notebooks for a Monte Carlo, multiple parallel agents, and raw pixels versions are in this directory as well. For a more industrial-strength A2C, check out our PyTorch implementation of the OpenAI Baselines A2C."
+    "To keep things clear, we're using an easy challenge--Cartpole--and have pruned the A2C to only the necessary bits, sacrificing a bit of performance. We're building an n-step A2C with a single agent that takes in a simple Cartpole state as 4 float values, but notebooks for a Monte Carlo, multiple parallel agents, and raw pixels versions are in this directory as well. For a more industrial-strength A2C, check out our PyTorch implementation of the OpenAI Baselines A2C."
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": 1,
+   "execution_count": 12,
    "metadata": {
     "scrolled": true
    },
@@ -43,7 +43,7 @@
     "\n",
     "# LR of 3e-2 explodes the gradients, LR of 3e-4 trains slower\n",
     "LR = 3e-3\n",
-    "N_GAMES = 1000\n",
+    "N_GAMES = 2000\n",
     "\n",
     "# OpenAI baselines uses nstep of 5.\n",
     "N_STEPS = 20\n",
@@ -67,20 +67,11 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 22,
+   "execution_count": 13,
    "metadata": {
     "scrolled": true
    },
-   "outputs": [
-    {
-     "name": "stderr",
-     "output_type": "stream",
-     "text": [
-      "<ipython-input-22-c0659aff9b30>:94: SyntaxWarning: name 'action_probs' is assigned to before global declaration\n",
-      "  global state_values, action_probs\n"
-     ]
-    }
-   ],
+   "outputs": [],
    "source": [
     "state = env.reset()\n",
     "finished_games = 0\n",
@@ -112,13 +103,6 @@
     "The cell above contains everything. Now we'll go through and look at the individual parts of it."
    ]
   },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -175,7 +159,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "pic of MC returns"
+    "#pic of MC returns"
    ]
   },
   {
@@ -191,7 +175,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "Pic of df backing up from bootstrapped v(s). Note that last value is an estimate."
+    "#Pic of df backing up from bootstrapped v(s). Note that last value is an estimate."
    ]
   },
   {
@@ -205,7 +189,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 7,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -217,7 +201,8 @@
     "    if dones[-1] == True: next_return = 0\n",
     "        \n",
     "    # If not terminal state, bootstrap v(s) using our critic\n",
-    "    else: # just take from last value of states estimate\n",
+    "    # TODO: don't need to estimate again, just take from last value of v(s) estimates\n",
+    "    else: \n",
     "        s = torch.from_numpy(states[-1]).float().unsqueeze(0)\n",
     "        next_return = model.get_state_value(Variable(s)).data[0][0] \n",
     "    \n",
@@ -236,25 +221,6 @@
     "    return state_values_true"
    ]
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": []
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": []
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "\n"
-   ]
-  },
   {
    "cell_type": "markdown",
    "metadata": {},
@@ -263,14 +229,12 @@
     "\n",
     "Once we have labels for our minibatch of training data, we treat it like we would any other supervised learning problem. We calculate the loss and backpropagate it through the model.\n",
     "\n",
-    "Our first step is to send our states as input into the NN. In return we get a list of state values and a list of action recommendations. We use these, along with our lists of calculated true state values and actual actions taken to compute the advantage / TD error. \n",
-    "\n",
-    "Why couldn’t we just save our v(s) and action probabilities during the “gather data” phase? Because PyTorch needs a clear line of sight on all the calculations we’ve done in order to trace our graph during the backwards pass of our model. Check out PyTorch's docs on autodiff for more info."
+    "Our first step is to send our states as input into the NN. In return we get a list of state value predictions and a list of action recommendations. We use these, along with our lists of bootstrapped target state values and actual actions taken to compute the advantage / TD error. "
    ]
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 4,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -311,7 +275,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": null,
+   "execution_count": 2,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -364,7 +328,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 4,
+   "execution_count": 3,
    "metadata": {},
    "outputs": [],
    "source": [
@@ -386,6 +350,43 @@
     "        state = next_state\n",
     "    return score"
    ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "124"
+      ]
+     },
+     "execution_count": 17,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "test_model(model)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are a number of improvements that we can make to this model:\n",
+    "\n",
+    "--You'll notice that after reaching a perfect score of 200, the model's performance fluctuates wildly. We can fix this by only training on episodes where a failure occured--perfect games have no training signal bc returns for all states are identical. \n",
+    "\n",
+    "-- Further limiting our training to only those frames directly before a failure seems to speed training as well (more work required here), perhaps because we're downsampling \"uninteresting\" observations that have no variation in returns. \n",
+    "\n",
+    "-- We're not recording our scores or losses throughout training. Other versions in this repo chart progress.\n",
+    "\n",
+    "-- We haven't experimented with other step sizes, which significantly affect training. Other a2cs in this repo show experiments along these lines\n",
+    "\n",
+    "-- We haven't added in multiple actors, whichs help by decorrelating training data"
+   ]
   }
  ],
  "metadata": {

diff --git a/README.md b/README.md
@@ -5,10 +5,12 @@ The notebooks in this repo build an A2C from scratch in PyTorch, starting with a
 Notebooks:
 1) Monte Carlo A2C
 2) Adding N-Step
-3) A simplified version of 2a used for teaching purposes. Compliment to comic (show link).
+3) Code walk-through TUTORIAL: A simplified version of 2a used for teaching purposes. Compliment to comic. 
 4) Adding in multiple actors
 5) Allowing model to take in a stack of "frames" rather that single frame. This in preparation for next step when we add in stack of frames from raw pixels.
-6) Transitioning to raw pixel input. Changing FC NN to CNN.
+6) Transitioning to raw pixel input. Changing FC NN to CNN. Takes hours on p2x large rather than seconds on laptop to train.
+
+0) MC A2C which is also trained to predict its own next state and reward. Currently being used for experiments in transfer learning, prediction, data generation. If a model can predict its own future states, can it use this predictor to generate data for "mental training"?
 
 For a deeper dive in deep RL, these are my favorite resources: