xjtukanif
diff --git a/‎chapter_recurrent-modern/beam-search.md
+75 b/‎chapter_recurrent-modern/beam-search.md
+75
diff --git a/‎chapter_recurrent-modern/beam-search_origin.md
+217 b/‎chapter_recurrent-modern/beam-search_origin.md
+217
@@ -0,0 +1,75 @@
+# 梁搜索
+:label:`sec_beam-search`
+
+在 :numref:`sec_seq2seq` 中，我们通过令牌预测输出序列令牌，直到 <eos> 预测特殊序列结束 “” 令牌。在本节中，我们将首先将这个 * 贪婪的搜索 * 策略形式化并探讨它的问题，然后将此策略与其他选择进行比较：
+*详尽搜索 * 和 * 光束搜索 *.
+
+在正式介绍贪婪搜索之前，让我们使用 :numref:`sec_seq2seq` 相同的数学符号正式化搜索问题。在任何时间步 $t'$，解码器输出 $y_{t'}$ 的概率取决于 $t'$ 之前的输出子序列 $y_1, \ldots, y_{t'-1}$ 和上下文变量 $\mathbf{c}$ 对输入序列的信息进行编码。要量化计算成本，请用 $\mathcal{Y}$（其中包含 <eos> “”）表示输出词汇。因此，这个词汇集的基数 $\left|\mathcal{Y}\right|$ 是词汇大小。让我们还将输出序列的最大标记数指定为 $T'$。因此，我们的目标是从所有 $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ 可能的输出序列中寻找理想的输出。当然，对于所有这些输出序列，包括 “<eos>” 之后的部分将被丢弃在实际输出中。
+
+## 贪婪搜索
+
+首先，让我们来看一个简单的策略：* 贪婪的搜索 *。该策略已被用来预测 :numref:`sec_seq2seq` 中的序列。在贪婪搜索中，在输出序列的任何时间步骤 $t'$，我们搜索具有 $\mathcal{Y}$ 最高条件概率的令牌，即
+
+$$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
+
+作为输出。输出 <eos> “” 或输出序列达到其最大长度 $T'$ 后，输出序列就完成。
+
+那么贪婪的搜索可能会出错？事实上，* 最优序列 * 应该是最大值为 $\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$ 的输出序列，这是基于输入序列生成输出序列的条件概率。不幸的是，不能保证最佳序列将通过贪婪的搜索获得。
+
+![At each time step, greedy search selects the token with the highest conditional probability.](../img/s2s-prob1.svg)
+:label:`fig_s2s-prob1`
+
+让我们用一个例子来说明这一点。假设 <eos> 输出字典中有四个标记 “A”、“B”、“C” 和 “”。在 :numref:`fig_s2s-prob1` 中，每个时间步长下的四个数字表示 <eos> 在该时间步长分别生成 “A”、“B”、“C” 和 “” 的条件概率。在每个时间步骤中，贪婪搜索都会选择具有最高条件概率的令牌。因此，<eos> 将在 :numref:`fig_s2s-prob1` 中预测输出序列 “A”、“B”、“C” 和 “”。此输出序列的条件概率为 $0.5\times0.4\times0.4\times0.6 = 0.048$。
+
+![The four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "&lt;eos&gt;" at that time step.  At time step 2, the token "C", which has the second highest conditional probability, is selected.](../img/s2s-prob2.svg)
+:label:`fig_s2s-prob2`
+
+接下来，让我们看看 :numref:`fig_s2s-prob2` 中的另一个例子。与 :numref:`fig_s2s-prob1` 不同，在时间步骤 2 中，我们在 :numref:`fig_s2s-prob2` 中选择令牌 “C”，它具有 * 秒 * 最高条件概率。由于时间步骤 3 所基于的时间步骤 1 和 2 的输出子序列已从 :numref:`fig_s2s-prob1` 中的 “A” 和 “B” 更改为 :numref:`fig_s2s-prob2` 中的 “A” 和 “C”，因此时间步骤 3 每个标记的条件概率也在 :numref:`fig_s2s-prob2` 中发生了变化。假设我们在时间步骤 3 选择令牌 “B”。现在，时间步骤 4 是以前三个时间步骤 “A”、“C” 和 “B” 为条件的输出子序列为条件的，这与 :numref:`fig_s2s-prob1` 中的 “A”、“B” 和 “C” 不同。因此，在 :numref:`fig_s2s-prob2` 的时间步骤 4 生成每个令牌的条件概率也与 :numref:`fig_s2s-prob1` 中的概率不同。因此，<eos> 在 :numref:`fig_s2s-prob2` 中输出序列 “A”、“C”、“B” 和 “” 的条件概率为 $0.5\times0.3 \times0.6\times0.6=0.054$，这比 :numref:`fig_s2s-prob1` 中贪婪搜索的概率大。在此示例中，<eos> 贪婪搜索获得的输出序列 “A”、“B”、“C” 和 “” 不是最佳序列。
+
+## 详尽搜索
+
+如果目标是获得最佳序列，我们可以考虑使用 * 全面搜索 *：用条件概率详尽枚举所有可能的输出序列，然后输出条件概率最高的序列。
+
+虽然我们可以使用详尽的搜索来获得最佳序列，但其计算成本 $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ 可能过高。例如，当 $|\mathcal{Y}|=10000$ 和 $T'=10$ 时，我们将需要评估 $10000^{10} = 10^{40}$ 序列。这是下一个不可能的！另一方面，贪婪搜索的计算成本为 $\mathcal{O}(\left|\mathcal{Y}\right|T')$：它通常比详尽搜索要小得多。例如，当 $|\mathcal{Y}|=10000$ 和 $T'=10$ 时，我们只需要评估 $10000\times10=10^5$ 序列。
+
+## 梁搜索
+
+关于序列搜索策略的决定取决于一个频谱，在任何一个极端都有简单的问题。如果只是准确性很重要呢？显然，详尽的搜索。如果只有计算成本很重要，该怎么办？显然，贪婪的搜索。一个现实世界的应用程序通常会提出一个复杂的问题，位于这两个极端之间。
+
+*梁搜索 * 是贪婪搜索的改进版本。它有一个名为 * 光束大小 * 的超参数，$k$。
+在时间步骤 1 中，我们选择具有最高条件概率的 $k$ 令牌。他们每个人都将分别是 $k$ 候选输出序列的第一个标记。在后续的每个时间步骤中，基于上一个时间步骤中的 $k$ 候选输出序列，我们继续选择 $k$ 候选输出序列，其条件概率最高。
+
+![The process of beam search (beam size: 2, maximum length of an output sequence: 3). The candidate output sequences are $A$, $C$, $AB$, $CE$, $ABD$, and $CED$.](../img/beam-search.svg)
+:label:`fig_beam-search`
+
+:numref:`fig_beam-search` 以实例展示了光束搜索过程。假设输出词汇仅包含五个元素：$\mathcal{Y} = \{A, B, C, D, E\}$，其中一个是 “<eos>”。让光束大小为 2，输出序列的最大长度为 3。在时间步骤 1 时，假设具有最高条件概率 $P(y_1 \mid \mathbf{c})$ 的令牌是 $A$ 和 $C$。在时间步骤 2，对于所有 $y_2 \in \mathcal{Y},$，我们计算
+
+$$\begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned}$$  
+
+并选择这十个值中最大的两个值，例如 $P(A, B \mid \mathbf{c})$ 和 $P(C, E \mid \mathbf{c})$。然后，在时间步骤 3，对于所有 $y_3 \in \mathcal{Y}$，我们计算
+
+$$\begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c}),\\P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\end{aligned}$$ 
+
+并在这十个值中选择最大的两个值，例如：$P(A, B, D \mid \mathbf{c})$ 和 $P(C, E, D \mid  \mathbf{c}).$。因此，我们得到了六个候选人的输出序列：(一) 以及 (六)
+
+最后，我们根据这六个序列获得一组最终候选输出序列（例如，丢弃包括 “<eos>” 之后的部分）。然后，我们选择以下分数中最高的序列作为输出序列：
+
+$$ \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
+:eqlabel:`eq_beam-search-score`
+
+其中 $L$ 是最终候选序列的长度，$\alpha$ 通常设置为 0.75。由于较长的序列在 :eqref:`eq_beam-search-score` 的总和中具有更多的对数项，因此分母中的术语 $L^\alpha$ 会惩罚长序列。
+
+光束搜索的计算成本为 $\mathcal{O}(k\left|\mathcal{Y}\right|T')$。这种结果是贪婪搜索和彻底搜索的结果之间的。事实上，贪婪搜索可以被视为一种特殊类型的光束搜索，光束大小为 1。通过灵活的光束尺寸选择，光束搜索可以在精度与计算成本之间进行权衡。
+
+## 摘要
+
+* 序列搜索策略包括贪婪搜索、详尽搜索和光束搜索。
+* 梁搜索通过灵活选择光束尺寸，在精度与计算成本之间进行权衡。
+
+## 练习
+
+1. 我们能否将详尽搜索视为一种特殊类型的光束搜索？为什么还是为什么不呢？
+1. 在 :numref:`sec_seq2seq` 机器翻译问题中应用光束搜索。光束尺寸如何影响平移结果和预测速度？
+1. 我们使用语言建模来按照 :numref:`sec_rnn_scratch` 中的用户提供的前缀生成文本。它使用哪种搜索策略？你能改进它吗？
+
+[Discussions](https://discuss.d2l.ai/t/338)
@@ -0,0 +1,217 @@
+# Beam Search
+:label:`sec_beam-search`
+
+In :numref:`sec_seq2seq`,
+we predicted the output sequence token by token
+until the special end-of-sequence "&lt;eos&gt;" token
+is predicted.
+In this section,
+we will begin with formalizing this *greedy search* strategy
+and exploring issues with it,
+then compare this strategy with other alternatives:
+*exhaustive search* and *beam search*.
+
+Before a formal introduction to greedy search,
+let us formalize the search problem
+using
+the same mathematical notation from :numref:`sec_seq2seq`.
+At any time step $t'$, 
+the probability of the decoder output $y_{t'}$ 
+is conditional 
+on the output subsequence
+$y_1, \ldots, y_{t'-1}$ before $t'$ and 
+the context variable $\mathbf{c}$ that
+encodes the information of the input sequence.
+To quantify computational cost,
+denote by 
+$\mathcal{Y}$ (it contains "&lt;eos&gt;")
+the output vocabulary.
+So the cardinality $\left|\mathcal{Y}\right|$ of this vocabulary set
+is the vocabulary size.
+Let us also specify the maximum number of tokens
+of an output sequence as $T'$.
+As a result,
+our goal is to search for an ideal output
+from all the 
+$\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$
+possible output sequences.
+Of course, 
+for all these output sequences,
+portions including and after "&lt;eos&gt;" will be discarded
+in the actual output.
+
+## Greedy Search
+
+First, let us take a look at 
+a simple strategy: *greedy search*.
+This strategy has been used to predict sequences in :numref:`sec_seq2seq`.
+In greedy search,
+at any time step $t'$ of the output sequence, 
+we search for the token 
+with the highest conditional probability from $\mathcal{Y}$, i.e., 
+
+$$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
+
+as the output. 
+Once "&lt;eos&gt;" is outputted or the output sequence has reached its maximum length $T'$, the output sequence is completed.
+
+So what can go wrong with greedy search?
+In fact,
+the *optimal sequence*
+should be the output sequence
+with the maximum 
+$\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$,
+which is
+the conditional probability of generating an output sequence based on the input sequence.
+Unfortunately, there is no guarantee
+that the optimal sequence will be obtained
+by greedy search.
+
+![At each time step, greedy search selects the token with the highest conditional probability.](../img/s2s-prob1.svg)
+:label:`fig_s2s-prob1`
+
+Let us illustrate it with an example.
+Suppose that there are four tokens 
+"A", "B", "C", and "&lt;eos&gt;" in the output dictionary.
+In :numref:`fig_s2s-prob1`,
+the four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "&lt;eos&gt;" at that time step, respectively.  
+At each time step, 
+greedy search selects the token with the highest conditional probability. 
+Therefore, the output sequence "A", "B", "C", and "&lt;eos&gt;" will be predicted 
+in :numref:`fig_s2s-prob1`. 
+The conditional probability of this output sequence is $0.5\times0.4\times0.4\times0.6 = 0.048$.
+
+![The four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "&lt;eos&gt;" at that time step.  At time step 2, the token "C", which has the second highest conditional probability, is selected.](../img/s2s-prob2.svg)
+:label:`fig_s2s-prob2`
+
+
+Next, let us look at another example 
+in :numref:`fig_s2s-prob2`. 
+Unlike in :numref:`fig_s2s-prob1`, 
+at time step 2
+we select the token "C"
+in :numref:`fig_s2s-prob2`, 
+which has the *second* highest conditional probability.
+Since the output subsequences at time steps 1 and 2, 
+on which time step 3 is based, 
+have changed from "A" and "B" in :numref:`fig_s2s-prob1` to "A" and "C" in :numref:`fig_s2s-prob2`, 
+the conditional probability of each token 
+at time step 3 has also changed in :numref:`fig_s2s-prob2`. 
+Suppose that we choose the token "B" at time step 3. 
+Now time step 4 is conditional on
+the output subsequence at the first three time steps
+"A", "C", and "B", 
+which is different from "A", "B", and "C" in :numref:`fig_s2s-prob1`. 
+Therefore, the conditional probability of generating each token at time step 4 in :numref:`fig_s2s-prob2` is also different from that in :numref:`fig_s2s-prob1`. 
+As a result, 
+the conditional probability of the output sequence "A", "C", "B", and "&lt;eos&gt;" 
+in :numref:`fig_s2s-prob2`
+is $0.5\times0.3 \times0.6\times0.6=0.054$, 
+which is greater than that of greedy search in :numref:`fig_s2s-prob1`. 
+In this example, 
+the output sequence "A", "B", "C", and "&lt;eos&gt;" obtained by the greedy search is not an optimal sequence.
+
+## Exhaustive Search
+
+If the goal is to obtain the optimal sequence, we may consider using *exhaustive search*: 
+exhaustively enumerate all the possible output sequences with their conditional probabilities,
+then output the one 
+with the highest conditional probability.
+
+Although we can use exhaustive search to obtain the optimal sequence, 
+its computational cost $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ is likely to be excessively high. 
+For example, when $|\mathcal{Y}|=10000$ and $T'=10$, we will need to evaluate $10000^{10} = 10^{40}$ sequences. This is next to impossible!
+On the other hand,
+the computational cost of greedy search is 
+$\mathcal{O}(\left|\mathcal{Y}\right|T')$: 
+it is usually significantly smaller than
+that of exhaustive search. For example, when $|\mathcal{Y}|=10000$ and $T'=10$, we only need to evaluate $10000\times10=10^5$ sequences.
+
+
+## Beam Search
+
+Decisions about sequence searching strategies
+lie on a spectrum,
+with easy questions at either extreme.
+What if only accuracy matters?
+Obviously, exhaustive search.
+What if only computational cost matters?
+Clearly, greedy search.
+A real-world applications usually asks
+a complicated question,
+somewhere in between those two extremes.
+
+*Beam search* is an improved version of greedy search. It has a hyperparameter named *beam size*, $k$. 
+At time step 1, 
+we select $k$ tokens with the highest conditional probabilities.
+Each of them will be the first token of 
+$k$ candidate output sequences, respectively.
+At each subsequent time step, 
+based on the $k$ candidate output sequences
+at the previous time step,
+we continue to select $k$ candidate output sequences 
+with the highest conditional probabilities 
+from $k\left|\mathcal{Y}\right|$ possible choices.
+
+![The process of beam search (beam size: 2, maximum length of an output sequence: 3). The candidate output sequences are $A$, $C$, $AB$, $CE$, $ABD$, and $CED$.](../img/beam-search.svg)
+:label:`fig_beam-search`
+
+
+:numref:`fig_beam-search` demonstrates the 
+process of beam search with an example. 
+Suppose that the output vocabulary
+contains only five elements: 
+$\mathcal{Y} = \{A, B, C, D, E\}$, 
+where one of them is “&lt;eos&gt;”. 
+Let the beam size be 2 and 
+the maximum length of an output sequence be 3. 
+At time step 1, 
+suppose that the tokens with the highest conditional probabilities $P(y_1 \mid \mathbf{c})$ are $A$ and $C$. At time step 2, for all $y_2 \in \mathcal{Y},$ we compute 
+
+$$\begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned}$$  
+
+and pick the largest two among these ten values, say
+$P(A, B \mid \mathbf{c})$ and $P(C, E \mid \mathbf{c})$.
+Then at time step 3, for all $y_3 \in \mathcal{Y}$, we compute 
+
+$$\begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c}),\\P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\end{aligned}$$ 
+
+and pick the largest two among these ten values, say 
+$P(A, B, D \mid \mathbf{c})$   and  $P(C, E, D \mid  \mathbf{c}).$
+As a result, we get six candidates output sequences: (i) $A$; (ii) $C$; (iii) $A$, $B$; (iv) $C$, $E$; (v) $A$, $B$, $D$; and (vi) $C$, $E$, $D$. 
+
+
+In the end, we obtain the set of final candidate output sequences based on these six sequences (e.g., discard portions including and after “&lt;eos&gt;”).
+Then
+we choose the sequence with the highest of the following score as the output sequence:
+
+$$ \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
+:eqlabel:`eq_beam-search-score`
+
+where $L$ is the length of the final candidate sequence and $\alpha$ is usually set to 0.75. 
+Since a longer sequence has more logarithmic terms in the summation of :eqref:`eq_beam-search-score`,
+the term $L^\alpha$ in the denominator penalizes
+long sequences.
+
+The computational cost of beam search is $\mathcal{O}(k\left|\mathcal{Y}\right|T')$. 
+This result is in between that of greedy search and that of exhaustive search. In fact, greedy search can be treated as a special type of beam search with 
+a beam size of 1. 
+With a flexible choice of the beam size,
+beam search provides a tradeoff between
+accuracy versus computational cost.
+
+
+
+## Summary
+
+* Sequence searching strategies include greedy search, exhaustive search, and beam search.
+* Beam search provides a tradeoff between accuracy versus computational cost via its flexible choice of the beam size.
+
+
+## Exercises
+
+1. Can we treat exhaustive search as a special type of beam search? Why or why not?
+1. Apply beam search in the machine translation problem in :numref:`sec_seq2seq`. How does the beam size affect the translation results and the prediction speed?
+1. We used language modeling for generating text following  user-provided prefixes in :numref:`sec_rnn_scratch`. Which kind of search strategy does it use? Can you improve it?
+
+[Discussions](https://discuss.d2l.ai/t/338)