Skip to content

Commit

Permalink
Yishay's changes to Ch2
Browse files Browse the repository at this point in the history
  • Loading branch information
avivt committed Feb 24, 2022
1 parent 804f4f3 commit 2162b3d
Show file tree
Hide file tree
Showing 3 changed files with 121 additions and 38 deletions.
81 changes: 81 additions & 0 deletions bib-lecture.bib
Original file line number Diff line number Diff line change
Expand Up @@ -405,4 +405,85 @@ @article{mnih2015human
pages={529--533},
year={2015},
publisher={Nature Publishing Group}
}

@article{Samuel62,
author = {Arthur L. Samuel},
title = {Artificial intelligence - a frontier of automation},
journal = {Elektron. Rechenanlagen},
volume = {4},
number = {4},
pages = {173--177},
year = {1962},
url = {https://doi.org/10.1524/itit.1962.4.16.173},
doi = {10.1524/itit.1962.4.16.173},
timestamp = {Mon, 18 May 2020 12:40:49 +0200},
biburl = {https://dblp.org/rec/journals/it/Samuel62.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}
@article{DeepBlue,
title = {Deep Blue},
journal = {Artificial Intelligence},
volume = {134},
number = {1},
pages = {57-83},
year = {2002},
issn = {0004-3702},
doi = {https://doi.org/10.1016/S0004-3702(01)00129-1},
url = {https://www.sciencedirect.com/science/article/pii/S0004370201001291},
author = {Murray Campbell and A.Joseph Hoane and Feng-hsiung Hsu},
keywords = {Computer chess, Game tree search, Parallel search, Selective search, Search extensions, Evaluation function},
abstract = {Deep Blue is the chess machine that defeated then-reigning World Chess Champion Garry Kasparov in a six-game match in 1997. There were a number of factors that contributed to this success, including: •a single-chip chess search engine,•a massively parallel system with multiple levels of parallelism,•a strong emphasis on search extensions,•a complex evaluation function, and•effective use of a Grandmaster game database. This paper describes the Deep Blue system, and gives some of the rationale that went into the design decisions behind Deep Blue.}
}

@article{Karp78,
author = {Richard M. Karp},
title = {A characterization of the minimum cycle mean in a digraph},
journal = {Discret. Math.},
volume = {23},
number = {3},
pages = {309--311},
year = {1978},
url = {https://doi.org/10.1016/0012-365X(78)90011-0},
doi = {10.1016/0012-365X(78)90011-0},
timestamp = {Fri, 12 Feb 2021 13:44:46 +0100},
biburl = {https://dblp.org/rec/journals/dm/Karp78.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@article{ChaturvediM17,
author = {Mmanu Chaturvedi and
Ross M. McConnell},
title = {A note on finding minimum mean cycle},
journal = {Inf. Process. Lett.},
volume = {127},
pages = {21--22},
year = {2017},
url = {https://doi.org/10.1016/j.ipl.2017.06.007},
doi = {10.1016/j.ipl.2017.06.007},
timestamp = {Tue, 12 Sep 2017 17:58:15 +0200},
biburl = {https://dblp.org/rec/journals/ipl/ChaturvediM17.bib},
bibsource = {dblp computer science bibliography, https://dblp.org}
}

@book{cormen2009introduction,
title={Introduction to algorithms},
author={Cormen, Thomas H and Leiserson, Charles E and Rivest, Ronald L and Stein, Clifford},
year={2009},
publisher={MIT press}
}

@book{KleinbergTardos06,
author = {Kleinberg, Jon and Tardos, \'Eva},
publisher = {Addison Wesley},
title = {Algorithm Design},
year = 2006
}
@book{DasguptaPapadimitriouVazirani08,
author = {Sanjoy Dasgupta and
Christos H. Papadimitriou and
Umesh V. Vazirani},
title = {Algorithms},
publisher = {McGraw-Hill},
year = {2008}
}
6 changes: 3 additions & 3 deletions current_chapters/chapter1-intro.tex
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ \section{Motivation for RL}

Over the years, reinforcement learning has proven to be highly
successful for playing board games that require long horizon planning.
Early in 1962, Arthur Samuel developed a checkers game, which was at
Early in 1962, Arthur Samuel \cite{Samuel62} developed a checkers game, which was at
the level of the best human. His original framework included many of
the ingredients which latter contributed to RL,
as well as search heuristics for large domains.
Expand All @@ -51,7 +51,7 @@ \section{Motivation for RL}

To complete the picture of computer board games, we should mention
Deep Blue, from 1996, which was able to beat the world champion then,
Kasparov. This program mainly built on heuristic search, and new hardware was developed to support it. Recently, DeepMind's
Kasparov \cite{DeepBlue}. This program mainly built on heuristic search, and new hardware was developed to support it. Recently, DeepMind's
AlphaZero matched the best chess
programs (which are already much better than any human players), using a reinforcement learning approach \cite{silver2017mastering}.

Expand Down Expand Up @@ -100,7 +100,7 @@ \section{Book Organization}
then, in Chapter \ref{chapter:MDP-FH} we introduce the finite horizon MDP model and a fundamental dynamic programming approach. Chapter \ref{chapter:disc} covers the infinite horizon discounted setting.
% and episodic settings, respectively.

\paragraph{Learning:} The learning theme covers decision making when the MDP model is \textit{not known in advance}. Chapter \ref{chapter-model-based} introduces the \textit{model-based} approach, where the agent explicitly learns an MDP model from its experience and uses it for planning decisions. Chapter \ref{chapter:learning-model-free} covers an alternative \textit{model-free} approach, where decisions are learned without explicitly building a model. Chapters \ref{chapter:function-approximation} and \ref{chapter:policy-gradient} address learning of approximately optimal solutions in \textit{large} problems, that is, problems where the underlying MDP model is intractable to solve. Chapter \ref{chapter:function-approximation} approaches this topic using approximation of the value function, while Chapter \ref{chapter:policy-gradient} considers policy approximations.
\paragraph{Learning:} The learning theme covers decision making when the MDP model is \textit{not known in advance}. Chapter \ref{chapter-model-based} introduces the \textit{model-based} approach, where the agent explicitly learns an MDP model from its experience and uses it for planning decisions. Chapter \ref{chapter:learning-model-free} covers an alternative \textit{model-free} approach, where decisions are learned without explicitly building a model. Chapters \ref{chapter:function-approximation} and \ref{chapter:policy-gradient} address learning of approximately optimal solutions in \textit{large} problems, that is, problems where the underlying MDP model is intractable to solve. Chapter \ref{chapter:function-approximation} approaches this topic using approximation of the value function, while Chapter \ref{chapter:policy-gradient} considers policy approximations. In Chapter \ref{chapter:MAB} we consider the special case of Multi-Arm Bandits, which can be viewed as a MDP with a single state and unknown rewards.
% To complete the picture, Chapter \ref{chapter:tree-based-search} considers online planning using tree-search methods.
% \section{Markov Decision Process (MDP)}

Expand Down
72 changes: 37 additions & 35 deletions current_chapters/chapter2-ddp.tex
Original file line number Diff line number Diff line change
Expand Up @@ -35,7 +35,7 @@ \section{Discrete Dynamic Systems}
\Actions({\state_{\ttime}}).\]
\end{remark}
\begin{remark}
The state dynamics may be augmented by an output equation:
The state dynamics may be augmented by an output observation:
\[{\observation_{\ttime}} = {\fObservation_{\ttime}}({\state_{\ttime}},{\action_{\ttime}}),\]
where $\observation_{\ttime}$ is the system observation, or the
output. In most of this book we implicitly assume that
Expand Down Expand Up @@ -413,7 +413,7 @@ \subsection{Reduction between control policies classes}
cost to go from $\state_\ttime$, given that we follow $\policy$ from
$\ttime+1$ to $\tHorizon$. Therefore the cost can only decrease.
Formally, let $\E^\policy[\cdot]$ be the expectation with respect to
policy $\policy$.
policy $\policy$. We have,
\begin{align*}
\E^\policy_{\state_\ttime}[\Cost_{\ttime}(\state_\ttime)]
%=\E^\policy[\Cost(\state_\ttime, \ldots , \state_\tHorizon)]
Expand Down Expand Up @@ -506,7 +506,7 @@ \subsection{Optimal Control Policies}
them all.

Fortunately, Dynamic Programming offers a drastic reduction of the
computational complexity for this problem.
computational complexity for this problem, as presented in the next Section.

\section{Finite Horizon Dynamic Programming}

Expand Down Expand Up @@ -695,7 +695,7 @@ \section{Finite Horizon Dynamic Programming}

% \section{Shortest Paths}
% We can formulate a DDP problems similar to shortest path problems.
% Given a directed graph $G(V,E)$, there is a set of goal states
% Given a directed graph $\graph(\nodes,\edges)$, there is a set of goal states
% $\States_G$, and the goal is to reach one of the goal states.
% Formally, when we reach a goal state we stay there and have a zero
% cost. For such a DDP the optimal policy would be to compute a
Expand Down Expand Up @@ -857,10 +857,11 @@ \section{Finite Horizon Dynamic Programming}

\section{Shortest Path on a Graph}
The problem of finding the shortest path over a graph is one of the most fundamental problems in graph theory and computer science. We shall briefly consider here three major algorithms for this problem that are closely related to dynamic programming, namely: The Bellman-Ford algorithm, Dijkstra's algorithm, and A$^*$.
An extensive presentation of the topic can be found in almost any book on algorithms, such as \cite{cormen2009introduction,KleinbergTardos06,DasguptaPapadimitriouVazirani08}.

\subsection{Problem Statement}
We introduce several definitions from graph-theory.
\begin{definition}\textbf{Weighted Graphs:} Consider a graph $\graph = (\nodes,\edges)$ that consists of a finite set of vertices (or nodes) $\nodes = \{ \node\} $ and a finite set of edges (or links) $\edges = \{ \edge\} $. We will consider directed graphs, where each edge $\edge$ is equivalent to an ordered pair $({\node_1},{\node_2}) \equiv (s(\edge),d(\edge))$ of vertices. To each edge we assign a real-valued weight (or cost) $\cost(\edge) = \cost({\node_1},{\node_2})$.
\begin{definition}\textbf{Weighted Graphs:} Consider a graph $\graph = (\nodes,\edges)$ that consists of a finite set of vertices (or nodes) $\nodes = \{ \node\} $ and a finite set of edges (or links) $\edges = \{ \edge\} \subseteq \nodes\times\nodes$. We will consider directed graphs, where each edge $\edge$ is equivalent to an ordered pair $({\node_1},{\node_2}) \equiv (s(\edge),d(\edge))$ of vertices. To each edge we assign a real-valued weight (or cost) $\cost(\edge) = \cost({\node_1},{\node_2})$.
\end{definition}
\begin{definition}\textbf{Path:}
A path $\spath$ on $\graph$ from ${\node_0}$ to ${\node_k}$ is a sequence $({\node_0},{\node_1},{\node_2}, \ldots ,{\node_k})$ of vertices such that $({\node_i},{\node_{i + 1}}) \in \edges$. A path is \textbf{simple} if all edges in the path are distinct.
Expand All @@ -874,10 +875,10 @@ \subsection{Problem Statement}
A \textbf{shortest path} from $u$ to $v$ is a path from $u$ to $v$ that has the smallest length $\pathlen(\spath)$ among such paths. Denote this minimal length as $\minlen(u,v)$ (with $\minlen(u,v) = \infty $ if no path exists from $u$ to $v$).
The shortest path problem has the following variants:
\begin{itemize}
\item Single pair problem: Find the shortest path from a given source vertex $s$ to a given destination vertex $t$.
\item Single source problem: Find the shortest path from a given source vertex $s$ to all other vertices.
\item Single destination: Find the shortest path to a given destination node $t$ from all other vertices.
\item All pair problem.
\item Single pair problem: Find the shortest path from a given source vertex $u$ to a given destination vertex $v$.
\item Single source problem: Find the shortest path from a given source vertex $u$ to all other vertices.
\item Single destination: Find the shortest path to a given destination node $v$ from all other vertices.
\item All pair problem: Find the shortest path from every source vertex $u$ to every destination vertex $v$.
\end{itemize}

We note that the single-source and single-destination problems are symmetric and can be treated as one. The all-pair problem can of course be solved by multiple applications of the other algorithms, but there exist algorithms which are especially suited for this problem.
Expand Down Expand Up @@ -942,7 +943,10 @@ \subsection{The Bellman-Ford Algorithm}
\subsection{Dijkstra's Algorithm}
Dijkstra's algorithm (introduced in 1959) provides a more efficient algorithm for the single-destination shortest path problem. This algorithm is restricted to non-negative link weights, i.e., $\cost(\nodev,\nodeu) \ge 0$.

The algorithm essentially determines the minimal distance $d(\nodev,\nodev_d)$ of the vertices to the destination in order of that distance, namely the closest vertex first, then the second-closest, etc. The algorithm is roughly described below, with more details in the recitation.
The algorithm essentially determines the minimal distance $d(\nodev,\nodev_d)$ of the vertices to the destination in order of that distance, namely the closest vertex first, then the second-closest, etc. The algorithm is
%roughly
described below.
%, with more details in the recitation.
The algorithm maintains a set $\vertexset$ of vertices whose minimal distance to the destination has been determined. The other vertices $\nodes\backslash \vertexset$ are held in a queue. It proceeds as follows.

\begin{algorithm_}\textbf{Dijkstra's Algorithm}
Expand All @@ -966,7 +970,7 @@ \subsection{Dijkstra's Algorithm}
\tab{\tab{if $d[\nodev] > \cost(\nodev,\nodeu) + d[\nodeu]$,}}

\tab{\tab{\tab{ set $d[\nodev] = \cost(\nodev,\nodeu) + d[\nodeu]$, $\policy [\nodev] = \nodeu$ }}}
\item return $\{ d[\nodev],\policy [\nodev] \ |\ v \in V\} $
\item return $\{ d[\nodev],\policy [\nodev] \ |\ v \in \nodes\} $
\end{enumerate}
\end{algorithm_}

Expand Down Expand Up @@ -1083,8 +1087,8 @@ \section{Average cost criteria}
simple cycle, and the average cost is the average cost of the edges
on the cycle. (Recall, we are considering only DDP.)

Given a directed graph $G(V,E)$, let $\Omega$ be the collection of
all cycles in $G(V,E)$. For each cycle $\omega=(v_1, \ldots ,
Given a directed graph $\graph(\nodes,\edges)$, let $\Omega$ be the collection of
all cycles in $\graph(\nodes,\edges)$. For each cycle $\omega=(v_1, \ldots ,
v_{k})$, we define $c(\omega)=\sum_{i=1}^k c(v_i,v_{i+1})$, where
$(v_i,v_{i+1})$ is the $i$-th edge in the cycle $\omega$. Let
$\mu(\omega)=\frac{c(\omega)}{k}$. The {\em minimum average cost cycle}
Expand Down Expand Up @@ -1120,7 +1124,7 @@ \section{Average cost criteria}
Delete $\omega$ from $\theta$, reducing the number of edges by
$|\omega|$ and the cumulative cost by $\mu(\omega)|\omega|$. We
continue the process until there is no remaining cycles, which
implies that we have at most $|V|=n$ nodes remaining. Therefore, the
implies that we have at most $|\nodes|=n$ nodes remaining. Therefore, the
costs of $\omega$ was at least $(\tHorizon-n)\mu^*$. This implies
that the average cost of $\theta$ is at least $\E[\Cost_{avg}^{\policy^{,}}]=\mu^*-\epsilon\geq
(1-\frac{n}{\tHorizon})\mu^*$. For $\epsilon>\mu^* n/\tHorizon$ we
Expand All @@ -1129,21 +1133,20 @@ \section{Average cost criteria}

Next we develop an algorithm for computing the minimum average cost
cycle, which implies an optimal policy for DDP for average costs.
The input is a directed graph $G(V,E)$ with edge cost $\cost:E\rightarrow {\mathbb R}$.
The input is a directed graph $\graph(\nodes,\edges)$ with edge cost $\cost:\edges\rightarrow {\mathbb R}$.

%The algorithm (due to Karp [1978]). \begin Set a
We first give a characterization of $\mu^*$. Set a root $r\in V$.
We first give a characterization of $\mu^*$. Set a root $r\in \nodes$.
Let $F_{k}(v)$ be paths of length $k$ from $r$ to $v$. Let
$d_{k}(v)=\min_{p\in F_{k}(v)} \cost(p)$, where if
$F_{k}(v)=\emptyset$ then $d_{k}(v)=\infty$. The following theorem
(due to [R. Karp, A characterization of the minimum cycle mean in
digraph, Discrete mathematics, 1978]) gives a characterization of
$\mu^*$.
$F_{k}(v)=\emptyset$ then $d_{k}(v)=\infty$. The following theorem of Karp \cite{Karp78}
%(due to [R. Karp, A characterization of the minimum cycle mean in digraph, Discrete mathematics, 1978])
gives a characterization of $\mu^*$.

\begin{theorem}
The value of the minimum cost cycle is
\[
\mu^*=\min_{v\in V} \max_{0\leq k \leq n-1}
\mu^*=\min_{v\in \nodes} \max_{0\leq k \leq n-1}
\frac{d_n(v)-d_{k}(v)}{n-k}\;,
\]
where we define $\infty-\infty$ as $\infty$.
Expand All @@ -1154,20 +1157,20 @@ \section{Average cost criteria}
has no negative cycle (we can guarantee this by adding a large
number $M$ to all the weights).

We start with $\mu^*=0$. This implies that we have in $G(V,E)$ a
We start with $\mu^*=0$. This implies that we have in $\graph(\nodes,\edges)$ a
cycle of weight zero, but no negative cycle. For the theorem it is
sufficient to show that
\[
\min_{v\in S} \max_{0\leq k \leq n-1} \{d_n(v)-d_{k}(v)\}=0.
\]

For every node $v\in V$ there is a path of length $k\in[0,n-1]$ of
For every node $v\in \nodes$ there is a path of length $k\in[0,n-1]$ of
cost $d(v)$, the cost of the shortest path from $r$ to $v$. This
implies that
\[
\max_{0\leq k \leq n-1} \{d_n(v)-d_{k}(v)\}=d_n(v)-d(v)\geq 0
\]
We need to show that for some $v\in V$ we have $d_n(v)=d(v)$, which
We need to show that for some $v\in \nodes$ we have $d_n(v)=d(v)$, which
implies that $\min_{v\in S} \{d_n(v)-d(v)\}=0$.

Consider a cycle $\omega$ of cost $\Cost(\omega)=0$ (there is one,
Expand All @@ -1176,7 +1179,7 @@ \section{Average cost criteria}
length at least $n$. The path $P$ is a shortest path to $v$
(although not necessarily simple). This implies that any sub-path of
$P$ is also a shortest path. Let $P'$ be a sub-path of $P$ of length
$n$ and let it end in $u\in V$.
$n$ and let it end in $u\in \nodes$.
%
Path $P'$ is a shortest path to $u$, since it is a prefix of a
shortest path $P$.
Expand All @@ -1192,24 +1195,24 @@ \section{Average cost criteria}
case. It only remains to show that the formula changes by exactly
$\Delta=\mu^*$.

Formally, for every edge $e\in E$ let $\cost'(e)=\cost(e)-\Delta$.
Formally, for every edge $e\in \edges$ let $\cost'(e)=\cost(e)-\Delta$.
For any path $p$ we have $\Cost'(p)=\Cost(p)-|p|\Delta$, and for any
cycle $\omega$ we have $\mu'(\omega)=\mu(\omega)-\Delta$. This
implies that for $\Delta=\mu^*$ we have a cycle of cost zero and no
negative cycles. We now consider the formula,
\begin{align*}
0=(\mu')^*=&\min_{v\in V} \max_{0\leq k\leq n-1}
0=(\mu')^*=&\min_{v\in \nodes} \max_{0\leq k\leq n-1}
\{\frac{d'_n(v)-d'_{k}(v)}{n-k}\}\\
=&\min_{v\in V} \max_{0\leq k\leq n-1}
=&\min_{v\in \nodes} \max_{0\leq k\leq n-1}
\{\frac{d_n(v)-n\Delta-d_{k}(v)+k\Delta}{n-k}\}\\
=&\min_{v\in V} \max_{0\leq k\leq n-1}
=&\min_{v\in \nodes} \max_{0\leq k\leq n-1}
\{\frac{d_n(v)-d_{k}(v)}{n-k}-\Delta\}\\
=&\min_{v\in V} \max_{0\leq k\leq n-1}
=&\min_{v\in \nodes} \max_{0\leq k\leq n-1}
\{\frac{d_n(v)-d_{k}(v)}{n-k}\}-\Delta
\end{align*}
Therefore we have
\[
\mu^*=\Delta=\min_{v\in V} \max_{0\leq k\leq n-1}
\mu^*=\Delta=\min_{v\in \nodes} \max_{0\leq k\leq n-1}
\{\frac{d_n(v)-d_{k}(v)}{n-k}\}
\]
which completes the proof.
Expand All @@ -1221,12 +1224,11 @@ \section{Average cost criteria}
minimizing pair $(v,k)$ the path of length $n$ from $r$ to $v$ has a
cycle of length $n-k$, which is the suffix of the path. The solution
is that for the path $p$, from $r$ to $v$ of length $n$, any simple
cycle is a minimum average cost cycle. (See, [``A note of finding
minimum mean cycle'', Mmamu Chaturvedi and Ross M. McConnell, IPL
2017]).
cycle is a minimum average cost cycle. (See \cite{ChaturvediM17}.)
%(See, [``A note of finding minimum mean cycle'', Mmamu Chaturvedi and Ross M. McConnell, IPL 2017]).

The running time of computing the minimum average cost cycle is
$O(|V|\cdot |E|)$.
$O(|\nodes|\cdot |\edges|)$.



Expand Down

0 comments on commit 2162b3d

Please sign in to comment.