Fix Value Targets formula and description in A2C task in week6 (yande…

…xdataschool#476) Some time ago, in PR yandexdataschool#405 formula for Value Targets in A2C task in week6 was broken. Why it happened? Because actually there is some ambiguity in definitions of T. 1. T is partial trajectory while partial trajectory is "trajectory" input argument of "ComputeValueTargets.__call__". 2. T is still partial trajectory while partial trajectory for given moment "t" is part of input "trajectory" from moment "t" to the end (i.e. it's different at each iteration in the algorithm). And for different definitions different variants of formula can be created. But all these variants would be equivalent. Then how we can consider that formula is broken? E.g. because it doesn't pass sanity check: valid sequence of gamma powers. So, the original formula was for definition (2). And PR yandexdataschool#405 looks like an attempt to implement (1). So, let's revert it to previous version (2), i.e. change upper index of summation T-1-t back to T-1. What's also important, apart from formula fix, is update description for this formula. Because currently, as per student chat conversation, some students are confused because they don't know how to parse this phrase: "... list of interactions with the environment of specified length T — the size of partial trajectory." Either: "... list of interactions with the (environment of specified length T) — the size of partial trajectory." Or: "... (list of interactions with the environment) of specified length T — the size of partial trajectory." So, to make it less ambiguous, let's change it to: "... list of interactions with the environment. This list has length T that is the size of partial trajectory. Partial trajectory for given moment "t" is part of "ComputeValueTargets.__call__" input argument "trajectory" from moment "t" to the end (i.e. different at each iteration in the algorithm)."
suryabdev · Mar 30, 2021 · ea2dcdf · ea2dcdf
1 parent a17b5ec
commit ea2dcdf
Showing 1 changed file with 3 additions and 3 deletions.
diff --git a/week06_policy_based/a2c-optional.ipynb b/week06_policy_based/a2c-optional.ipynb
@@ -134,7 +134,7 @@
     "* 'actions'\n",
     "* all other keys that you defined in `Policy`\n",
     "\n",
-    "under each of these keys there is a python `list` of interactions with the environment of specified length $T$ &mdash; the size of partial trajectory. "
+    "under each of these keys there is a python `list` of interactions with the environment. This list has length $T$ that is size of partial trajectory. Partial trajectory for given moment `t` is part of `ComputeValueTargets.__call__` input argument `trajectory` from moment `t` to the end (i.e. it's different at each iteration in the algorithm)."
    ]
   },
   {
@@ -147,7 +147,7 @@
     "The formula for the value targets is simple:\n",
     "\n",
     "$$\n",
-    "\\hat v(s_t) = \\left( \\sum_{t'=0}^{T - 1 - t} \\gamma^{t'}r_{t+t'} \\right) + \\gamma^T \\hat{v}(s_{t+T}),\n",
+    "\\hat v(s_t) = \\left( \\sum_{t'=0}^{T - 1} \\gamma^{t'}r_{t+t'} \\right) + \\gamma^T \\hat{v}(s_{t+T}),\n",
     "$$\n",
     "\n",
     "In implementation, however, do not forget to use \n",
@@ -177,7 +177,7 @@
    "metadata": {},
    "source": [
     "After computing value targets we will transform lists of interactions into tensors\n",
-    "with the first dimension `batch_size` which is equal to `T * nenvs`, i.e. you essentially need\n",
+    "with the first dimension `batch_size` which is equal to `env_steps * nenvs`, i.e. you essentially need\n",
     "to flatten the first two dimensions. "
    ]
   },