Yishay's changes to Ch2

avivt · Feb 24, 2022 · 2162b3d · 2162b3d
1 parent 804f4f3
commit 2162b3d
Show file tree

Hide file tree

Showing 3 changed files with 121 additions and 38 deletions.
diff --git a/bib-lecture.bib b/bib-lecture.bib
@@ -405,4 +405,85 @@ @article{mnih2015human
   pages={529--533},
   year={2015},
   publisher={Nature Publishing Group}
+}
+
+@article{Samuel62,
+  author    = {Arthur L. Samuel},
+  title     = {Artificial intelligence - a frontier of automation},
+  journal   = {Elektron. Rechenanlagen},
+  volume    = {4},
+  number    = {4},
+  pages     = {173--177},
+  year      = {1962},
+  url       = {https://doi.org/10.1524/itit.1962.4.16.173},
+  doi       = {10.1524/itit.1962.4.16.173},
+  timestamp = {Mon, 18 May 2020 12:40:49 +0200},
+  biburl    = {https://dblp.org/rec/journals/it/Samuel62.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+@article{DeepBlue,
+title = {Deep Blue},
+journal = {Artificial Intelligence},
+volume = {134},
+number = {1},
+pages = {57-83},
+year = {2002},
+issn = {0004-3702},
+doi = {https://doi.org/10.1016/S0004-3702(01)00129-1},
+url = {https://www.sciencedirect.com/science/article/pii/S0004370201001291},
+author = {Murray Campbell and A.Joseph Hoane and Feng-hsiung Hsu},
+keywords = {Computer chess, Game tree search, Parallel search, Selective search, Search extensions, Evaluation function},
+abstract = {Deep Blue is the chess machine that defeated then-reigning World Chess Champion Garry Kasparov in a six-game match in 1997. There were a number of factors that contributed to this success, including: •a single-chip chess search engine,•a massively parallel system with multiple levels of parallelism,•a strong emphasis on search extensions,•a complex evaluation function, and•effective use of a Grandmaster game database. This paper describes the Deep Blue system, and gives some of the rationale that went into the design decisions behind Deep Blue.}
+}
+
+@article{Karp78,
+  author    = {Richard M. Karp},
+  title     = {A characterization of the minimum cycle mean in a digraph},
+  journal   = {Discret. Math.},
+  volume    = {23},
+  number    = {3},
+  pages     = {309--311},
+  year      = {1978},
+  url       = {https://doi.org/10.1016/0012-365X(78)90011-0},
+  doi       = {10.1016/0012-365X(78)90011-0},
+  timestamp = {Fri, 12 Feb 2021 13:44:46 +0100},
+  biburl    = {https://dblp.org/rec/journals/dm/Karp78.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+
+@article{ChaturvediM17,
+  author    = {Mmanu Chaturvedi and
+               Ross M. McConnell},
+  title     = {A note on finding minimum mean cycle},
+  journal   = {Inf. Process. Lett.},
+  volume    = {127},
+  pages     = {21--22},
+  year      = {2017},
+  url       = {https://doi.org/10.1016/j.ipl.2017.06.007},
+  doi       = {10.1016/j.ipl.2017.06.007},
+  timestamp = {Tue, 12 Sep 2017 17:58:15 +0200},
+  biburl    = {https://dblp.org/rec/journals/ipl/ChaturvediM17.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+
+@book{cormen2009introduction,
+  title={Introduction to algorithms},
+  author={Cormen, Thomas H and Leiserson, Charles E and Rivest, Ronald L and Stein, Clifford},
+  year={2009},
+  publisher={MIT press}
+}
+
+@book{KleinbergTardos06,
+  author = {Kleinberg, Jon and Tardos, \'Eva},
+  publisher = {Addison Wesley},
+  title = {Algorithm Design},
+  year = 2006
+}
+@book{DasguptaPapadimitriouVazirani08,
+  author    = {Sanjoy Dasgupta and
+               Christos H. Papadimitriou and
+               Umesh V. Vazirani},
+  title     = {Algorithms},
+  publisher = {McGraw-Hill},
+  year      = {2008}
 }
diff --git a/current_chapters/chapter1-intro.tex b/current_chapters/chapter1-intro.tex
@@ -36,7 +36,7 @@ \section{Motivation for RL}
 
 Over the years, reinforcement learning has proven to be highly
 successful for playing board games that require long horizon planning. 
-Early in 1962, Arthur Samuel developed a checkers game, which was at
+Early in 1962, Arthur Samuel \cite{Samuel62} developed a checkers game, which was at
 the level of the best human. His original framework included many of
 the ingredients which latter contributed to RL,
 as well as search heuristics for large domains.
@@ -51,7 +51,7 @@ \section{Motivation for RL}
 
 To complete the picture of computer board games, we should mention
 Deep Blue, from 1996, which was able to beat the world champion then,
-Kasparov. This program mainly built on heuristic search, and new hardware was developed to support it. Recently, DeepMind's
+Kasparov \cite{DeepBlue}. This program mainly built on heuristic search, and new hardware was developed to support it. Recently, DeepMind's
 AlphaZero matched the best chess
 programs (which are already much better than any human players), using a reinforcement learning approach \cite{silver2017mastering}.
 
@@ -100,7 +100,7 @@ \section{Book Organization}
 then, in Chapter \ref{chapter:MDP-FH} we introduce the finite horizon MDP model and a fundamental dynamic programming approach. Chapter \ref{chapter:disc} covers the infinite horizon discounted setting.
 % and episodic settings, respectively.
 
-\paragraph{Learning:} The learning theme covers decision making when the MDP model is \textit{not known in advance}. Chapter \ref{chapter-model-based} introduces the \textit{model-based} approach, where the agent explicitly learns an MDP model from its experience and uses it for planning decisions. Chapter \ref{chapter:learning-model-free} covers an alternative \textit{model-free} approach, where decisions are learned without explicitly building a model. Chapters \ref{chapter:function-approximation} and \ref{chapter:policy-gradient} address learning of approximately optimal solutions in \textit{large} problems, that is, problems where the underlying MDP model is intractable to solve. Chapter \ref{chapter:function-approximation} approaches this topic using approximation of the value function, while Chapter \ref{chapter:policy-gradient} considers policy approximations. 
+\paragraph{Learning:} The learning theme covers decision making when the MDP model is \textit{not known in advance}. Chapter \ref{chapter-model-based} introduces the \textit{model-based} approach, where the agent explicitly learns an MDP model from its experience and uses it for planning decisions. Chapter \ref{chapter:learning-model-free} covers an alternative \textit{model-free} approach, where decisions are learned without explicitly building a model. Chapters \ref{chapter:function-approximation} and \ref{chapter:policy-gradient} address learning of approximately optimal solutions in \textit{large} problems, that is, problems where the underlying MDP model is intractable to solve. Chapter \ref{chapter:function-approximation} approaches this topic using approximation of the value function, while Chapter \ref{chapter:policy-gradient} considers policy approximations. In Chapter \ref{chapter:MAB} we consider the special case of Multi-Arm Bandits, which can be viewed as a MDP with a single state and unknown rewards. 
 % To complete the picture, Chapter \ref{chapter:tree-based-search} considers online planning using tree-search methods.
 % \section{Markov Decision Process (MDP)}
 

diff --git a/current_chapters/chapter2-ddp.tex b/current_chapters/chapter2-ddp.tex
@@ -35,7 +35,7 @@ \section{Discrete Dynamic Systems}
 \Actions({\state_{\ttime}}).\]
 \end{remark}
 \begin{remark}
-    The state dynamics may be augmented by an output equation:
+    The state dynamics may be augmented by an output observation:
 \[{\observation_{\ttime}} = {\fObservation_{\ttime}}({\state_{\ttime}},{\action_{\ttime}}),\]
 where  $\observation_{\ttime}$ is the system observation, or the
 output. In most of this book we  implicitly assume that
@@ -413,7 +413,7 @@ \subsection{Reduction between control policies classes}
 cost to go from $\state_\ttime$, given that we follow $\policy$ from
 $\ttime+1$ to $\tHorizon$. Therefore the cost can only decrease.
 Formally, let $\E^\policy[\cdot]$ be the expectation with respect to
-policy $\policy$.
+policy $\policy$. We have,
 \begin{align*}
 \E^\policy_{\state_\ttime}[\Cost_{\ttime}(\state_\ttime)]
 %=\E^\policy[\Cost(\state_\ttime, \ldots , \state_\tHorizon)]
@@ -506,7 +506,7 @@ \subsection{Optimal Control Policies}
 them all.
 
 Fortunately, Dynamic Programming offers a drastic reduction of the
-computational complexity for this problem.
+computational complexity for this problem, as presented in the next Section.
 
 \section{Finite Horizon Dynamic Programming}
 
@@ -695,7 +695,7 @@ \section{Finite Horizon Dynamic Programming}
 
 % \section{Shortest Paths}
 % We can formulate a DDP problems similar to shortest path problems.
-% Given a directed graph $G(V,E)$,  there is a set of goal states
+% Given a directed graph $\graph(\nodes,\edges)$,  there is a set of goal states
 % $\States_G$, and the goal is to reach one of the goal states.
 % Formally, when we reach a goal state we stay there and have a zero
 % cost. For such a DDP the optimal policy would be to compute a
@@ -857,10 +857,11 @@ \section{Finite Horizon Dynamic Programming}
 
 \section{Shortest Path on a Graph}
 The problem of finding the shortest path over a graph is one of the most fundamental problems in graph theory and computer science. We shall briefly consider here three major algorithms for this problem that are closely related to dynamic programming, namely: The Bellman-Ford algorithm, Dijkstra's algorithm, and A$^*$.
+An extensive presentation of the topic can be found in almost any book on algorithms, such as \cite{cormen2009introduction,KleinbergTardos06,DasguptaPapadimitriouVazirani08}.
 
 \subsection{Problem Statement}
 We introduce several definitions from graph-theory.
-\begin{definition}\textbf{Weighted Graphs:} Consider a graph $\graph = (\nodes,\edges)$ that consists of a finite set of vertices (or nodes) $\nodes = \{ \node\} $ and a finite set of edges (or links) $\edges = \{ \edge\} $. We will consider directed graphs, where each edge $\edge$ is equivalent to an ordered pair $({\node_1},{\node_2}) \equiv (s(\edge),d(\edge))$ of vertices. To each edge we assign a real-valued weight (or cost) $\cost(\edge) = \cost({\node_1},{\node_2})$.
+\begin{definition}\textbf{Weighted Graphs:} Consider a graph $\graph = (\nodes,\edges)$ that consists of a finite set of vertices (or nodes) $\nodes = \{ \node\} $ and a finite set of edges (or links) $\edges = \{ \edge\} \subseteq \nodes\times\nodes$. We will consider directed graphs, where each edge $\edge$ is equivalent to an ordered pair $({\node_1},{\node_2}) \equiv (s(\edge),d(\edge))$ of vertices. To each edge we assign a real-valued weight (or cost) $\cost(\edge) = \cost({\node_1},{\node_2})$.
 \end{definition}
 \begin{definition}\textbf{Path:}
 A path $\spath$ on $\graph$ from ${\node_0}$ to ${\node_k}$ is a sequence $({\node_0},{\node_1},{\node_2}, \ldots ,{\node_k})$ of vertices such that $({\node_i},{\node_{i + 1}}) \in \edges$. A path is \textbf{simple} if all edges in the path are distinct.
@@ -874,10 +875,10 @@ \subsection{Problem Statement}
 A \textbf{shortest path} from $u$ to $v$ is a path from $u$ to $v$ that has the smallest length  $\pathlen(\spath)$ among such paths. Denote this minimal length as $\minlen(u,v)$ (with $\minlen(u,v) = \infty $ if no path exists from $u$ to $v$).
 The shortest path problem has the following variants:
 \begin{itemize}
-  \item Single pair problem:  Find the shortest path from a given source vertex $s$ to a given destination vertex $t$.
-  \item Single source problem: Find the shortest path from a given source vertex $s$ to all other vertices.
-  \item Single destination: Find the shortest path to a given destination node $t$ from all other vertices.
-  \item All pair problem.
+  \item Single pair problem:  Find the shortest path from a given source vertex $u$ to a given destination vertex $v$.
+  \item Single source problem: Find the shortest path from a given source vertex $u$ to all other vertices.
+  \item Single destination: Find the shortest path to a given destination node $v$ from all other vertices.
+  \item All pair problem: Find the shortest path from every source vertex $u$ to every destination vertex $v$.
 \end{itemize}
 
 We note that the single-source and single-destination problems are symmetric and can be treated as one.  The all-pair problem can of course be solved by multiple applications of the other algorithms, but there exist algorithms which are especially suited for this problem.
@@ -942,7 +943,10 @@ \subsection{The Bellman-Ford Algorithm}
 \subsection{Dijkstra's Algorithm}
 Dijkstra's algorithm (introduced in 1959) provides a more efficient algorithm for the single-destination shortest path problem. This algorithm is restricted to non-negative link weights, i.e., $\cost(\nodev,\nodeu) \ge 0$.
 
-The algorithm essentially determines the minimal distance $d(\nodev,\nodev_d)$ of the vertices to the destination in order of that distance, namely the closest vertex first, then the second-closest, etc.  The algorithm is roughly described below, with more details in the recitation.
+The algorithm essentially determines the minimal distance $d(\nodev,\nodev_d)$ of the vertices to the destination in order of that distance, namely the closest vertex first, then the second-closest, etc.  The algorithm is 
+%roughly 
+described below.
+%, with more details in the recitation.
 The algorithm maintains a set $\vertexset$ of vertices whose minimal distance to the destination has been determined. The other vertices $\nodes\backslash \vertexset$ are held in a queue. It proceeds as follows.
 
 \begin{algorithm_}\textbf{Dijkstra's Algorithm}
@@ -966,7 +970,7 @@ \subsection{Dijkstra's Algorithm}
 \tab{\tab{if $d[\nodev] > \cost(\nodev,\nodeu) + d[\nodeu]$,}}
 
 \tab{\tab{\tab{	set  $d[\nodev] = \cost(\nodev,\nodeu) + d[\nodeu]$,  $\policy [\nodev] = \nodeu$ }}}
-\item return $\{ d[\nodev],\policy [\nodev] \ |\ v \in V\} $
+\item return $\{ d[\nodev],\policy [\nodev] \ |\ v \in \nodes\} $
 \end{enumerate}
 \end{algorithm_}
 
@@ -1083,8 +1087,8 @@ \section{Average cost criteria}
 simple cycle, and the average cost is the average cost of the edges
 on the cycle. (Recall, we are considering only DDP.)
 
-Given a directed graph $G(V,E)$, let $\Omega$ be the collection of
-all cycles in $G(V,E)$. For each cycle $\omega=(v_1, \ldots ,
+Given a directed graph $\graph(\nodes,\edges)$, let $\Omega$ be the collection of
+all cycles in $\graph(\nodes,\edges)$. For each cycle $\omega=(v_1, \ldots ,
 v_{k})$, we define $c(\omega)=\sum_{i=1}^k c(v_i,v_{i+1})$, where
 $(v_i,v_{i+1})$ is the $i$-th edge in the cycle $\omega$. Let
 $\mu(\omega)=\frac{c(\omega)}{k}$. The {\em minimum average cost cycle}
@@ -1120,7 +1124,7 @@ \section{Average cost criteria}
 Delete $\omega$ from $\theta$, reducing the number of edges by
 $|\omega|$ and the cumulative cost by $\mu(\omega)|\omega|$. We
 continue the process until there is no remaining cycles, which
-implies that we have at most $|V|=n$ nodes remaining. Therefore, the
+implies that we have at most $|\nodes|=n$ nodes remaining. Therefore, the
 costs of $\omega$ was at least $(\tHorizon-n)\mu^*$. This implies
 that the average cost of $\theta$ is at least $\E[\Cost_{avg}^{\policy^{,}}]=\mu^*-\epsilon\geq
 (1-\frac{n}{\tHorizon})\mu^*$. For $\epsilon>\mu^* n/\tHorizon$ we
@@ -1129,21 +1133,20 @@ \section{Average cost criteria}
 
 Next we develop an algorithm for computing the minimum average cost
 cycle, which implies an optimal policy for DDP for average costs.
-The input is a directed graph $G(V,E)$ with edge cost $\cost:E\rightarrow {\mathbb R}$.
+The input is a directed graph $\graph(\nodes,\edges)$ with edge cost $\cost:\edges\rightarrow {\mathbb R}$.
 
 %The algorithm (due to Karp [1978]). \begin  Set a
-We first give a characterization of $\mu^*$. Set a root $r\in V$.
+We first give a characterization of $\mu^*$. Set a root $r\in \nodes$.
 Let $F_{k}(v)$ be paths of length $k$ from $r$ to $v$. Let
 $d_{k}(v)=\min_{p\in F_{k}(v)} \cost(p)$, where if
-$F_{k}(v)=\emptyset$ then $d_{k}(v)=\infty$. The following theorem
-(due to [R. Karp, A characterization of the minimum cycle mean in
-digraph, Discrete mathematics, 1978]) gives a characterization of
-$\mu^*$.
+$F_{k}(v)=\emptyset$ then $d_{k}(v)=\infty$. The following theorem of Karp \cite{Karp78}
+%(due to [R. Karp, A characterization of the minimum cycle mean in digraph, Discrete mathematics, 1978]) 
+gives a characterization of $\mu^*$.
 
 \begin{theorem}
 The value of the minimum cost cycle is
 \[
-\mu^*=\min_{v\in V} \max_{0\leq k \leq n-1}
+\mu^*=\min_{v\in \nodes} \max_{0\leq k \leq n-1}
 \frac{d_n(v)-d_{k}(v)}{n-k}\;,
 \]
 where we define $\infty-\infty$ as $\infty$.
@@ -1154,20 +1157,20 @@ \section{Average cost criteria}
 has no negative cycle (we can guarantee this by adding a large
 number $M$ to all the weights).
 
-We start with $\mu^*=0$. This implies that we have in $G(V,E)$ a
+We start with $\mu^*=0$. This implies that we have in $\graph(\nodes,\edges)$ a
 cycle of weight zero, but no negative cycle. For the theorem it is
 sufficient to show that
 \[
 \min_{v\in S} \max_{0\leq k \leq n-1} \{d_n(v)-d_{k}(v)\}=0.
 \]
 
-For every node $v\in V$ there is a path of length $k\in[0,n-1]$ of
+For every node $v\in \nodes$ there is a path of length $k\in[0,n-1]$ of
 cost $d(v)$, the cost of the shortest path from $r$ to $v$. This
 implies that
 \[
 \max_{0\leq k \leq n-1} \{d_n(v)-d_{k}(v)\}=d_n(v)-d(v)\geq 0
 \]
-We need to show that for some $v\in V$ we have $d_n(v)=d(v)$, which
+We need to show that for some $v\in \nodes$ we have $d_n(v)=d(v)$, which
 implies that $\min_{v\in S} \{d_n(v)-d(v)\}=0$.
 
 Consider a cycle $\omega$ of cost $\Cost(\omega)=0$ (there is one,
@@ -1176,7 +1179,7 @@ \section{Average cost criteria}
 length at least $n$. The path $P$ is a shortest path to $v$
 (although not necessarily simple). This implies that any sub-path of
 $P$ is also a shortest path. Let $P'$ be a sub-path of $P$ of length
-$n$ and let it end in $u\in V$.
+$n$ and let it end in $u\in \nodes$.
 %
 Path $P'$ is a shortest path to $u$, since it is a prefix of a
 shortest path $P$.
@@ -1192,24 +1195,24 @@ \section{Average cost criteria}
 case. It only remains to show that the formula changes by exactly
 $\Delta=\mu^*$.
 
-Formally, for every edge $e\in E$ let $\cost'(e)=\cost(e)-\Delta$.
+Formally, for every edge $e\in \edges$ let $\cost'(e)=\cost(e)-\Delta$.
 For any path $p$ we have $\Cost'(p)=\Cost(p)-|p|\Delta$, and for any
 cycle $\omega$ we have $\mu'(\omega)=\mu(\omega)-\Delta$. This
 implies that for $\Delta=\mu^*$ we have a cycle of cost zero and no
 negative cycles. We now consider the formula,
 \begin{align*}
-0=(\mu')^*=&\min_{v\in V} \max_{0\leq k\leq n-1}
+0=(\mu')^*=&\min_{v\in \nodes} \max_{0\leq k\leq n-1}
 \{\frac{d'_n(v)-d'_{k}(v)}{n-k}\}\\
-=&\min_{v\in V} \max_{0\leq k\leq n-1}
+=&\min_{v\in \nodes} \max_{0\leq k\leq n-1}
 \{\frac{d_n(v)-n\Delta-d_{k}(v)+k\Delta}{n-k}\}\\
-=&\min_{v\in V} \max_{0\leq k\leq n-1}
+=&\min_{v\in \nodes} \max_{0\leq k\leq n-1}
 \{\frac{d_n(v)-d_{k}(v)}{n-k}-\Delta\}\\
-=&\min_{v\in V} \max_{0\leq k\leq n-1}
+=&\min_{v\in \nodes} \max_{0\leq k\leq n-1}
 \{\frac{d_n(v)-d_{k}(v)}{n-k}\}-\Delta
 \end{align*}
 Therefore we have
 \[
-\mu^*=\Delta=\min_{v\in V} \max_{0\leq k\leq n-1}
+\mu^*=\Delta=\min_{v\in \nodes} \max_{0\leq k\leq n-1}
 \{\frac{d_n(v)-d_{k}(v)}{n-k}\}   
 \]
 which completes the proof.
@@ -1221,12 +1224,11 @@ \section{Average cost criteria}
 minimizing pair $(v,k)$ the path of length $n$ from $r$ to $v$ has a
 cycle of length $n-k$, which is the suffix of the path. The solution
 is that for the path $p$, from $r$ to $v$ of length $n$, any simple
-cycle is a minimum average cost cycle. (See, [``A note of finding
-minimum mean cycle'', Mmamu Chaturvedi and Ross M. McConnell, IPL
-2017]).
+cycle is a minimum average cost cycle. (See \cite{ChaturvediM17}.)
+%(See, [``A note of finding minimum mean cycle'', Mmamu Chaturvedi and Ross M. McConnell, IPL 2017]).
 
 The running time of computing the minimum average cost cycle is
-$O(|V|\cdot |E|)$.
+$O(|\nodes|\cdot |\edges|)$.