From a28e232a872ec48e6411405a2fa8f61664685d57 Mon Sep 17 00:00:00 2001 From: Marc Lanctot Date: Sun, 16 Nov 2014 18:39:54 +0000 Subject: [PATCH] small change to non-locality section --- iioos.tex | 70 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 37 insertions(+), 33 deletions(-) diff --git a/iioos.tex b/iioos.tex index 05edcdc..af2480c 100644 --- a/iioos.tex +++ b/iioos.tex @@ -104,8 +104,7 @@ \newcommand{\cI}{\mathcal{I}} \newcommand{\cC}{\mathcal{C}} \newcommand{\tta}{\mathtt{a}} -\newcommand{\tth}{\mathtt{h}} -\newcommand{\ttz}{\mathtt{z}} +\newcommand{\ttm}{\mathtt{m}} \newcommand{\PW}{\mbox{PW}} \newcommand{\BR}{\mbox{BR}} \newcommand{\defword}[1]{\textbf{\boldmath{#1}}} @@ -415,11 +414,11 @@ \subsection{Extensive-Form Games} % Merge-Nov15: Did you move this paragraph out of this section? In a \defword{match} (online game), each player is allowed little or no preparation time before playing (preventing the offline advance computation of approximate equilibria solutions). -There is a current \defword{match history}, $\tth$, initially the empty history $\emptyset$ representing the start of the match. Each turn, -the agent controlling $P(\tth)$ is given $t$ time units to decide on a \defword{match action} $\tta \in A(\tth)$ and the -match history then changes using $\tth \leftarrow \tth \tta$. There is a single referee who knows $\tth$, samples chance outcomes -as needed from $\sigma_c(\tth)$, and reveals $I(\tth)$ to $P(\tth)$ on their turn. The players play until the match is terminated, -giving each player $i$ a payoff of $u_i(\ttz)$. +The current \defword{match history}, $\ttm \in H$, is initially the empty history $\emptyset$ representing the start of the match. Each turn, +the agent controlling $P(\ttm)$ is given $t$ time units to decide on a \defword{match action} $\tta \in A(\ttm)$ and the +match history then changes using $\ttm \leftarrow \ttm \tta$. There is a single referee who knows $\ttm$, samples chance outcomes +as needed from $\sigma_c(\ttm)$, and reveals $I(\ttm)$ to $P(\ttm)$ on their turn. The players play until the match is terminated +($\ttm \in Z$) giving each player $i$ a payoff of $u_i(\ttm)$. \subsection{The Problem of Non-Locality} \label{sec:nonlocality} @@ -468,7 +467,7 @@ \subsection{The Problem of Non-Locality} \draw [dashed] (i2_1) -- (i2_2) node[midway, above] {I}; \end{tikzpicture} \end{center} -\caption{An game demonstrating the problem of non-locality with maximizing $\bigtriangleup$, minimizing $\bigtriangledown$ and chance $\bigcirc$ players. +\caption{An extensive-form game demonstrating the problem of non-locality with maximizing $\bigtriangleup$, minimizing $\bigtriangledown$ and chance $\bigcirc$ players. \label{fig:coordGame}} \end{figure} @@ -484,16 +483,20 @@ \subsection{The Problem of Non-Locality} %When hidden information is revealed for search, Russel and Norvig refer to this technique as ``averaging over clairvoyance''~\cite{russellnorvig}. However, even if the information structure is kept intact and information is aggregated during the searches, such as in Information Set Monte Carlo tree search (ISMCTS)~\cite{Cowling12ISMCTS}, the problem still occurs. If subtrees of $I$ are sampled equally often, a searching player will not have any preference between left and right and will recommend $(\frac{1}{2},\frac{1}{2})$, which is suboptimal. % However, mixing uniformly at $I$ is not part of an equilibrium in this game. The payoff to $\bigtriangleup$ for playing right would be $\frac{1}{2}\cdot 0 + \frac{1}{2} \cdot \frac{3}{2} = \frac{3}{4}$, which would give $\bigtriangleup$ incentive to switch to play left more often (since its expected value is $\frac{5}{4}$), in turn giving $\bigtriangledown$ incentive to deviate. -Note that the uniform probability of being in each state of $I$ corresponds to the distribution over the states given the optimal play. Therefore, the problem occurs even if subtrees are sampled from the proper belief distribution. Therefore, no search algorithm starting only from the current state without some extra informaiton cannot converge to the optimal strategy. +In this case, the uniform probability of being in each state of $I$ corresponds to the distribution over the states given the optimal play, so the problem occurs even if subtrees are sampled from the proper belief distribution. +% this is mentioned later (after the proof) +% and no search algorithm starting only from the current match history $\ttm$ without some extra informaiton cannot converge to the optimal strategy. +Note that this is a simple example; in larger games, this problem could occur over much longer paths or in many times in different parts of the game. %To overcome this problem, we propose a new approach. Instead of adapting perfect information search techniques to imperfect information games, we present online variants of Monte Carlo equilibrium approximation algorithms that have been successful in the offline setting. OOS overcomes this problem by starting each sample form the root of the game. If the computed strategy tends to come closer to the uniform strategy in $I$ ,the updates in the maximizing player's information set will modify the strategy to choose left more often. It will cause the following samples to reach $I$ more ofter at the state on the left and consequently modifying the strategy in $I$ in the right direction in the following iterations. - To the best of our knowledge, OOS is the first online search algorithm that solves this problem. As suggested by the analysis in~\cite{Long10Understanding}, the effect may be critical in games with low {\it disambiguation factor}, where private information is very slowly (or never) revealed throughout a match. \subsection{Offline Equilibrium Approximation} +There are many algorithms for computing approximate equilibrium strategies offline~\cite{Sandholm10The}. We focus on a popular choice among Poker researchers due to its sampling variants. + Counterfactual Regret (CFR) is a notion of regret at the information set level for extensive-form games~\cite{CFR}. The CFR algorithm iteratively learns strategies in self-play, converging to an equilibrium. The \defword{counterfactual value} of reaching information set $I$ is the expected payoff given that player $i$ played to reach $I$, the opponents played @@ -543,15 +546,14 @@ \subsection{Offline Equilibrium Approximation} on the sampled information sets values also eventually converges to the approximate equilibrium of the game with high probability. The required number of iterations for convergence is much larger, but each iteration is much faster. -In Poker, CFR and MCCFR have been used with much success as offline methods -for pre-computing approximate equilibria in abstract games~\cite{CFR,Johanson12CFRBR}; the same general -approach has also been used in Liar's Dice~\cite{Neller11,Lanctot12IR}. +%In Poker, CFR and MCCFR have been used with much success as offline methods +%for pre-computing approximate equilibria in abstract games~\cite{CFR,Johanson12CFRBR}; the same general +%approach has also been used in Liar's Dice~\cite{Neller11,Lanctot12IR}. \section{Online Outcome Sampling} When outcome sampling is used in the offline setting, data structures for all information sets are allocated and created before the first iteration starts. In each iteration, every information set that is sampled gets updated. - We make two essential modifications to adapt outcome sampling to the online search setting. {\bf Incremental Game Tree Building.} Before the match begins, only the very first (root) information set is added to memory. @@ -563,8 +565,8 @@ \section{Online Outcome Sampling} This way, only the relevant information sets will be stored in the memory. {\bf In-Match Search Targeting.} -Suppose several moves have been played since the start of the match leading to $\tth$. -Plain outcome sampling would continue to sample from the root of the game (not the current match history $\tth$), entirely +Suppose several moves have been played since the start of the match leading to $\ttm$. +Plain outcome sampling would continue to sample from the root of the game (not the current match history $\ttm$), entirely disregarding the region of the game space that the match has headed toward. Hence, the second modification we propose is directing the search towards the histories that are more likely to occur during the match currently played. Note that the complete history is typically unknown to the players, who know only their information sets. @@ -574,19 +576,20 @@ \section{Online Outcome Sampling} \subsection{Information Set Targeting (IST)} -Suppose the match history is $\tth$. IST samples histories reaching the current information set ($I(\tth)$), -i.e., $(h,z) \in Z_{I(\tth)}$, with higher probability than other histories. +Suppose the match history is $\ttm$. IST samples histories reaching the current information set ($I(\ttm)$), +i.e., $(h,z) \in Z_{I(\ttm)}$, with higher probability than other histories. The intuition is that these histories are particularly relevant since the searching player {\it knows} that one of these $z$ will describe the match at its completion. -However, focusing fully only on these histories may cause problems because of the non-locality and the convergence guarantees are lost. +However, focusing {\it only} on these histories may cause problems because of the non-locality and the convergence guarantees are lost. + Consider again the game in Figure~\ref{fig:coordGame}. If the minimizing player knows it is in the information set $I$ and focuses all its search only to this information set for sufficiently long, she computes the suboptimal uniform strategy. Any fixed non-zero probability of sampling the left chance action will eventually solve the problem. The regrets are multiplied by the reciprocal of the sampling probability; hence, they influence the strategy in the information set proportionally stronger if the samples are rare. -Note that previous methods, such as PIMC and ISMCTS, {\it always} target $I(\tth)$, \ie with probability 1, and do not -update predecessors of $I(\tth)$. In contrast, in IST {\it all} information sets in memory reached during each iteration requires updating +Note that previous methods, such as PIMC and ISMCTS, {\it always} target $I(\ttm)$, \ie with probability 1, and do not +update predecessors of $I(\ttm)$. In contrast, in IST {\it all} information sets in memory reached during each iteration requires updating to ensure eventual convergence to an equilibrium. \subsection{Public Subgame Targeting (PST)} @@ -600,8 +603,8 @@ \subsection{Public Subgame Targeting (PST)} Given a history $h$, let $p(h)$ be the sequence of public actions along $h$ in the same order that they were taken in $h$. Define the \defword{public subgame} induced by $I$ to be the one whose terminal history set is \[Z_{p,I(h)} = \{(h',z)~|~z \in Z, h' \in H, p(h') = p(h), h' \sqsubset z \}.\] -Now, suppose the match history is $\tth$. -Public subgame targeting samples $z \in Z_{p,I(\tth)}$ with higher probability than terminal histories outside this set. +Now, suppose the match history is $\ttm$. +Public subgame targeting samples $z \in Z_{p,I(\ttm)}$ with higher probability than terminal histories outside this set. A public subgame then, contains all the terminal histories consistent with the public actions played over the match and each combination of private chance events for both players. So, in a game of two-player limit Texas Hold'em poker, suppose the nature decides on the private cards of the players, @@ -619,7 +622,8 @@ \subsection{Algorithm} } \ElsIf{$P(h) = c$}{ Sample an outcome $a$ and let $\rho_1,\rho_2$ be its probability in targeted and untargeted setting \; \label{alg:chancesample} - \breturn OOS$(ha, \pi_i, \rho_2 \pi_{-i}, \rho_1 s_1 , \rho_2 s_2, i)$ \; + Let $(x, l, u) \gets $ OOS$(ha, \pi_i, \rho_2 \pi_{-i}, \rho_1 s_1 , \rho_2 s_2, i)$ \; + \breturn $(\rho_2 x, l, u)$ \; } $I \gets $ getInfoset$(h, P(h))$ \; Let $(a,s_1',s_2') \leftarrow $ Sample$(h, I, i, \epsilon)$ \; \label{alg:sample} @@ -661,7 +665,7 @@ \subsection{Algorithm} regret for each action $a \in A(I)$, and $s_I$ stores the cumulative average strategy probability of each action. -Depending on the targeting method that is chosen (IST or PST), $Z_{sub}$ is one of $Z_{I(\tth)}$ or $Z_{p,I(\tth)}$. +Depending on the targeting method that is chosen (IST or PST), $Z_{sub}$ is one of $Z_{I(\ttm)}$ or $Z_{p,I(\ttm)}$. The pseudo-code is presented as Algorithm~\ref{alg}. Each iteration is represented by two calls of OOS where the update player $i \in \{1,2\}$ is alternated. Before each iteration, a {\it scenario} is decided: @@ -706,20 +710,20 @@ \subsection{Algorithm} update player histories, while average strategy tables at opponent histories. Now we explain the role of the weighting factor $w_T$. Note that on lines 22 and 28, the regret and strategy updates are multiplied by the reciprocal of the probability of sampling the sample leaf and the current state. -Consider the updates in the current information set of the game $I_\mathtt{h}$. In the initial samples of the algorithm with empty match history, this information set was on average sampled with a very low probability $s_0$. Let's say that out of $10^6$ samples started from the root, 1000 samples reached this particular information set. In that case, $s_0=0.001$ and the regret updates caused by each of these samples were multiplied by $\frac{1}{s_0}=1000$. -Now the game has actually reached the history $\mathtt{h}$ and due to targeting, half of the next $10^6$ samples reach $I_\mathtt{h}$. It means that $s'_0=0.5$ and the regrets will be multiplied only by 2. +Consider the updates in the current information set of the game $I_\ttm$. In the initial samples of the algorithm with empty match history, this information set was on average sampled with a very low probability $s_0$. Let's say that out of $10^6$ samples started from the root, 1000 samples reached this particular information set. In that case, $s_0=0.001$ and the regret updates caused by each of these samples were multiplied by $\frac{1}{s_0}=1000$. +Now the game has actually reached the history $\ttm$ and due to targeting, half of the next $10^6$ samples reach $I_\ttm$. It means that $s'_0=0.5$ and the regrets will be multiplied only by 2. As a results, the updated of the first (generally less precise) 1000 samples are all togeather weighted the same as the $5\times 10^5$ later samples, which makes it almost impossible to compensate for the initial errors. In order to prevent this effect, we add the weighting factor $w_T$ to compensate the change of targeting and make each of the samples have similar weight. -In our example, $w_T=\frac{s'_0}{s_0}$. More formally, when running the iterations from match history $\mathtt{h}$, we define the weighting factor as the probability of reaching $I(\mathtt{h})$ without any targeting devided by the probability of reaching $I(\mathtt{h})$ with the current targeting, assuming the players play according to the current mean strategy profile $\bar{\pi}$: -\[\frac{1}{w_T(\mathtt{h})} = (1-\delta) + \delta\frac{\sum_{(h,z)\in I(\mathtt{h})} \bar{\pi}(h)}{\sum_{z\in Z_{sub(\mathtt{h})}} \bar{\pi}(z)}.\] +In our example, $w_T=\frac{s'_0}{s_0}$. More formally, when running the iterations from match history $\ttm$, we define the weighting factor as the probability of reaching $I(\ttm)$ without any targeting devided by the probability of reaching $I(\ttm)$ with the current targeting, assuming the players play according to the current mean strategy profile $\bar{\pi}$: +\[\frac{1}{w_T(\ttm)} = (1-\delta) + \delta\frac{\sum_{(h,z)\in I(\ttm)} \bar{\pi}(h)}{\sum_{z\in Z_{sub(\ttm)}} \bar{\pi}(z)}.\] \subsubsection{Consistency} \begin{theorem} -Let $\bar{\sigma}^t_m(\delta,\tth)$ be a strategy produced by OOS with scheme 1$m \in \{ \mbox{IST}, \mbox{PST} \}$ -using $\delta < 1$ started from $\tth$ run for $t$ iterations, with exploration $\epsilon > 0$. +Let $\bar{\sigma}^t_m(\delta,\ttm)$ be a strategy produced by OOS with scheme 1$m \in \{ \mbox{IST}, \mbox{PST} \}$ +using $\delta < 1$ started from $\ttm$ run for $t$ iterations, with exploration $\epsilon > 0$. For any $p \in (0, 1], \varepsilon > 0$ there exists $t < \infty$ such that with -probability $1-p$ the strategy $\bar{\sigma}^t_m(\delta,\tth)$ is a $\varepsilon$-equilibrium strategy. +probability $1-p$ the strategy $\bar{\sigma}^t_m(\delta,\ttm)$ is a $\varepsilon$-equilibrium strategy. \label{thm:consistency} \end{theorem} \begin{proof}(Sketch) Each terminal history has nonzero probability of being sampled, eventually every information @@ -728,7 +732,7 @@ \subsubsection{Consistency} \end{proof} Note that due to non-locality, this consistency property cannot hold generally for any search -algorithm that does not modify $\sigma(I)$ at previous $I(h)$ such that $h \sqsubset \tth$. However, +algorithm that does not modify $\sigma(I)$ at previous $I(h)$ such that $h \sqsubset \ttm$. However, it is an open question as to whether any of the previous algorithms could be modified to ensure consistency.