Skip to content

Commit b46ed48

Browse files
committed
add up to encoder decoder
1 parent 3342ef2 commit b46ed48

8 files changed

+1287
-0
lines changed
+75
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
# 梁搜索
2+
:label:`sec_beam-search`
3+
4+
在 :numref:`sec_seq2seq` 中,我们通过令牌预测输出序列令牌,直到 <eos> 预测特殊序列结束 “” 令牌。在本节中,我们将首先将这个 * 贪婪的搜索 * 策略形式化并探讨它的问题,然后将此策略与其他选择进行比较:
5+
*详尽搜索 * 和 * 光束搜索 *.
6+
7+
在正式介绍贪婪搜索之前,让我们使用 :numref:`sec_seq2seq` 相同的数学符号正式化搜索问题。在任何时间步 $t'$,解码器输出 $y_{t'}$ 的概率取决于 $t'$ 之前的输出子序列 $y_1, \ldots, y_{t'-1}$ 和上下文变量 $\mathbf{c}$ 对输入序列的信息进行编码。要量化计算成本,请用 $\mathcal{Y}$(其中包含 <eos> “”)表示输出词汇。因此,这个词汇集的基数 $\left|\mathcal{Y}\right|$ 是词汇大小。让我们还将输出序列的最大标记数指定为 $T'$。因此,我们的目标是从所有 $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ 可能的输出序列中寻找理想的输出。当然,对于所有这些输出序列,包括 “<eos>” 之后的部分将被丢弃在实际输出中。
8+
9+
## 贪婪搜索
10+
11+
首先,让我们来看一个简单的策略:* 贪婪的搜索 *。该策略已被用来预测 :numref:`sec_seq2seq` 中的序列。在贪婪搜索中,在输出序列的任何时间步骤 $t'$,我们搜索具有 $\mathcal{Y}$ 最高条件概率的令牌,即
12+
13+
$$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
14+
15+
作为输出。输出 <eos> “” 或输出序列达到其最大长度 $T'$ 后,输出序列就完成。
16+
17+
那么贪婪的搜索可能会出错?事实上,* 最优序列 * 应该是最大值为 $\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$ 的输出序列,这是基于输入序列生成输出序列的条件概率。不幸的是,不能保证最佳序列将通过贪婪的搜索获得。
18+
19+
![At each time step, greedy search selects the token with the highest conditional probability.](../img/s2s-prob1.svg)
20+
:label:`fig_s2s-prob1`
21+
22+
让我们用一个例子来说明这一点。假设 <eos> 输出字典中有四个标记 “A”、“B”、“C” 和 “”。在 :numref:`fig_s2s-prob1` 中,每个时间步长下的四个数字表示 <eos> 在该时间步长分别生成 “A”、“B”、“C” 和 “” 的条件概率。在每个时间步骤中,贪婪搜索都会选择具有最高条件概率的令牌。因此,<eos> 将在 :numref:`fig_s2s-prob1` 中预测输出序列 “A”、“B”、“C” 和 “”。此输出序列的条件概率为 $0.5\times0.4\times0.4\times0.6 = 0.048$。
23+
24+
![The four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "&lt;eos&gt;" at that time step. At time step 2, the token "C", which has the second highest conditional probability, is selected.](../img/s2s-prob2.svg)
25+
:label:`fig_s2s-prob2`
26+
27+
接下来,让我们看看 :numref:`fig_s2s-prob2` 中的另一个例子。与 :numref:`fig_s2s-prob1` 不同,在时间步骤 2 中,我们在 :numref:`fig_s2s-prob2` 中选择令牌 “C”,它具有 * 秒 * 最高条件概率。由于时间步骤 3 所基于的时间步骤 1 和 2 的输出子序列已从 :numref:`fig_s2s-prob1` 中的 “A” 和 “B” 更改为 :numref:`fig_s2s-prob2` 中的 “A” 和 “C”,因此时间步骤 3 每个标记的条件概率也在 :numref:`fig_s2s-prob2` 中发生了变化。假设我们在时间步骤 3 选择令牌 “B”。现在,时间步骤 4 是以前三个时间步骤 “A”、“C” 和 “B” 为条件的输出子序列为条件的,这与 :numref:`fig_s2s-prob1` 中的 “A”、“B” 和 “C” 不同。因此,在 :numref:`fig_s2s-prob2` 的时间步骤 4 生成每个令牌的条件概率也与 :numref:`fig_s2s-prob1` 中的概率不同。因此,<eos> 在 :numref:`fig_s2s-prob2` 中输出序列 “A”、“C”、“B” 和 “” 的条件概率为 $0.5\times0.3 \times0.6\times0.6=0.054$,这比 :numref:`fig_s2s-prob1` 中贪婪搜索的概率大。在此示例中,<eos> 贪婪搜索获得的输出序列 “A”、“B”、“C” 和 “” 不是最佳序列。
28+
29+
## 详尽搜索
30+
31+
如果目标是获得最佳序列,我们可以考虑使用 * 全面搜索 *:用条件概率详尽枚举所有可能的输出序列,然后输出条件概率最高的序列。
32+
33+
虽然我们可以使用详尽的搜索来获得最佳序列,但其计算成本 $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ 可能过高。例如,当 $|\mathcal{Y}|=10000$ 和 $T'=10$ 时,我们将需要评估 $10000^{10} = 10^{40}$ 序列。这是下一个不可能的!另一方面,贪婪搜索的计算成本为 $\mathcal{O}(\left|\mathcal{Y}\right|T')$:它通常比详尽搜索要小得多。例如,当 $|\mathcal{Y}|=10000$ 和 $T'=10$ 时,我们只需要评估 $10000\times10=10^5$ 序列。
34+
35+
## 梁搜索
36+
37+
关于序列搜索策略的决定取决于一个频谱,在任何一个极端都有简单的问题。如果只是准确性很重要呢?显然,详尽的搜索。如果只有计算成本很重要,该怎么办?显然,贪婪的搜索。一个现实世界的应用程序通常会提出一个复杂的问题,位于这两个极端之间。
38+
39+
*梁搜索 * 是贪婪搜索的改进版本。它有一个名为 * 光束大小 * 的超参数,$k$。
40+
在时间步骤 1 中,我们选择具有最高条件概率的 $k$ 令牌。他们每个人都将分别是 $k$ 候选输出序列的第一个标记。在后续的每个时间步骤中,基于上一个时间步骤中的 $k$ 候选输出序列,我们继续选择 $k$ 候选输出序列,其条件概率最高。
41+
42+
![The process of beam search (beam size: 2, maximum length of an output sequence: 3). The candidate output sequences are $A$, $C$, $AB$, $CE$, $ABD$, and $CED$.](../img/beam-search.svg)
43+
:label:`fig_beam-search`
44+
45+
:numref:`fig_beam-search` 以实例展示了光束搜索过程。假设输出词汇仅包含五个元素:$\mathcal{Y} = \{A, B, C, D, E\}$,其中一个是 “<eos>”。让光束大小为 2,输出序列的最大长度为 3。在时间步骤 1 时,假设具有最高条件概率 $P(y_1 \mid \mathbf{c})$ 的令牌是 $A$ 和 $C$。在时间步骤 2,对于所有 $y_2 \in \mathcal{Y},$,我们计算
46+
47+
$$\begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned}$$
48+
49+
并选择这十个值中最大的两个值,例如 $P(A, B \mid \mathbf{c})$ 和 $P(C, E \mid \mathbf{c})$。然后,在时间步骤 3,对于所有 $y_3 \in \mathcal{Y}$,我们计算
50+
51+
$$\begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c}),\\P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\end{aligned}$$
52+
53+
并在这十个值中选择最大的两个值,例如:$P(A, B, D \mid \mathbf{c})$ 和 $P(C, E, D \mid \mathbf{c}).$。因此,我们得到了六个候选人的输出序列:(一) 以及 (六)
54+
55+
最后,我们根据这六个序列获得一组最终候选输出序列(例如,丢弃包括 “<eos>” 之后的部分)。然后,我们选择以下分数中最高的序列作为输出序列:
56+
57+
$$ \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
58+
:eqlabel:`eq_beam-search-score`
59+
60+
其中 $L$ 是最终候选序列的长度,$\alpha$ 通常设置为 0.75。由于较长的序列在 :eqref:`eq_beam-search-score` 的总和中具有更多的对数项,因此分母中的术语 $L^\alpha$ 会惩罚长序列。
61+
62+
光束搜索的计算成本为 $\mathcal{O}(k\left|\mathcal{Y}\right|T')$。这种结果是贪婪搜索和彻底搜索的结果之间的。事实上,贪婪搜索可以被视为一种特殊类型的光束搜索,光束大小为 1。通过灵活的光束尺寸选择,光束搜索可以在精度与计算成本之间进行权衡。
63+
64+
## 摘要
65+
66+
* 序列搜索策略包括贪婪搜索、详尽搜索和光束搜索。
67+
* 梁搜索通过灵活选择光束尺寸,在精度与计算成本之间进行权衡。
68+
69+
## 练习
70+
71+
1. 我们能否将详尽搜索视为一种特殊类型的光束搜索?为什么还是为什么不呢?
72+
1. 在 :numref:`sec_seq2seq` 机器翻译问题中应用光束搜索。光束尺寸如何影响平移结果和预测速度?
73+
1. 我们使用语言建模来按照 :numref:`sec_rnn_scratch` 中的用户提供的前缀生成文本。它使用哪种搜索策略?你能改进它吗?
74+
75+
[Discussions](https://discuss.d2l.ai/t/338)
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,217 @@
1+
# Beam Search
2+
:label:`sec_beam-search`
3+
4+
In :numref:`sec_seq2seq`,
5+
we predicted the output sequence token by token
6+
until the special end-of-sequence "&lt;eos&gt;" token
7+
is predicted.
8+
In this section,
9+
we will begin with formalizing this *greedy search* strategy
10+
and exploring issues with it,
11+
then compare this strategy with other alternatives:
12+
*exhaustive search* and *beam search*.
13+
14+
Before a formal introduction to greedy search,
15+
let us formalize the search problem
16+
using
17+
the same mathematical notation from :numref:`sec_seq2seq`.
18+
At any time step $t'$,
19+
the probability of the decoder output $y_{t'}$
20+
is conditional
21+
on the output subsequence
22+
$y_1, \ldots, y_{t'-1}$ before $t'$ and
23+
the context variable $\mathbf{c}$ that
24+
encodes the information of the input sequence.
25+
To quantify computational cost,
26+
denote by
27+
$\mathcal{Y}$ (it contains "&lt;eos&gt;")
28+
the output vocabulary.
29+
So the cardinality $\left|\mathcal{Y}\right|$ of this vocabulary set
30+
is the vocabulary size.
31+
Let us also specify the maximum number of tokens
32+
of an output sequence as $T'$.
33+
As a result,
34+
our goal is to search for an ideal output
35+
from all the
36+
$\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$
37+
possible output sequences.
38+
Of course,
39+
for all these output sequences,
40+
portions including and after "&lt;eos&gt;" will be discarded
41+
in the actual output.
42+
43+
## Greedy Search
44+
45+
First, let us take a look at
46+
a simple strategy: *greedy search*.
47+
This strategy has been used to predict sequences in :numref:`sec_seq2seq`.
48+
In greedy search,
49+
at any time step $t'$ of the output sequence,
50+
we search for the token
51+
with the highest conditional probability from $\mathcal{Y}$, i.e.,
52+
53+
$$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
54+
55+
as the output.
56+
Once "&lt;eos&gt;" is outputted or the output sequence has reached its maximum length $T'$, the output sequence is completed.
57+
58+
So what can go wrong with greedy search?
59+
In fact,
60+
the *optimal sequence*
61+
should be the output sequence
62+
with the maximum
63+
$\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$,
64+
which is
65+
the conditional probability of generating an output sequence based on the input sequence.
66+
Unfortunately, there is no guarantee
67+
that the optimal sequence will be obtained
68+
by greedy search.
69+
70+
![At each time step, greedy search selects the token with the highest conditional probability.](../img/s2s-prob1.svg)
71+
:label:`fig_s2s-prob1`
72+
73+
Let us illustrate it with an example.
74+
Suppose that there are four tokens
75+
"A", "B", "C", and "&lt;eos&gt;" in the output dictionary.
76+
In :numref:`fig_s2s-prob1`,
77+
the four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "&lt;eos&gt;" at that time step, respectively.
78+
At each time step,
79+
greedy search selects the token with the highest conditional probability.
80+
Therefore, the output sequence "A", "B", "C", and "&lt;eos&gt;" will be predicted
81+
in :numref:`fig_s2s-prob1`.
82+
The conditional probability of this output sequence is $0.5\times0.4\times0.4\times0.6 = 0.048$.
83+
84+
![The four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "&lt;eos&gt;" at that time step. At time step 2, the token "C", which has the second highest conditional probability, is selected.](../img/s2s-prob2.svg)
85+
:label:`fig_s2s-prob2`
86+
87+
88+
Next, let us look at another example
89+
in :numref:`fig_s2s-prob2`.
90+
Unlike in :numref:`fig_s2s-prob1`,
91+
at time step 2
92+
we select the token "C"
93+
in :numref:`fig_s2s-prob2`,
94+
which has the *second* highest conditional probability.
95+
Since the output subsequences at time steps 1 and 2,
96+
on which time step 3 is based,
97+
have changed from "A" and "B" in :numref:`fig_s2s-prob1` to "A" and "C" in :numref:`fig_s2s-prob2`,
98+
the conditional probability of each token
99+
at time step 3 has also changed in :numref:`fig_s2s-prob2`.
100+
Suppose that we choose the token "B" at time step 3.
101+
Now time step 4 is conditional on
102+
the output subsequence at the first three time steps
103+
"A", "C", and "B",
104+
which is different from "A", "B", and "C" in :numref:`fig_s2s-prob1`.
105+
Therefore, the conditional probability of generating each token at time step 4 in :numref:`fig_s2s-prob2` is also different from that in :numref:`fig_s2s-prob1`.
106+
As a result,
107+
the conditional probability of the output sequence "A", "C", "B", and "&lt;eos&gt;"
108+
in :numref:`fig_s2s-prob2`
109+
is $0.5\times0.3 \times0.6\times0.6=0.054$,
110+
which is greater than that of greedy search in :numref:`fig_s2s-prob1`.
111+
In this example,
112+
the output sequence "A", "B", "C", and "&lt;eos&gt;" obtained by the greedy search is not an optimal sequence.
113+
114+
## Exhaustive Search
115+
116+
If the goal is to obtain the optimal sequence, we may consider using *exhaustive search*:
117+
exhaustively enumerate all the possible output sequences with their conditional probabilities,
118+
then output the one
119+
with the highest conditional probability.
120+
121+
Although we can use exhaustive search to obtain the optimal sequence,
122+
its computational cost $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ is likely to be excessively high.
123+
For example, when $|\mathcal{Y}|=10000$ and $T'=10$, we will need to evaluate $10000^{10} = 10^{40}$ sequences. This is next to impossible!
124+
On the other hand,
125+
the computational cost of greedy search is
126+
$\mathcal{O}(\left|\mathcal{Y}\right|T')$:
127+
it is usually significantly smaller than
128+
that of exhaustive search. For example, when $|\mathcal{Y}|=10000$ and $T'=10$, we only need to evaluate $10000\times10=10^5$ sequences.
129+
130+
131+
## Beam Search
132+
133+
Decisions about sequence searching strategies
134+
lie on a spectrum,
135+
with easy questions at either extreme.
136+
What if only accuracy matters?
137+
Obviously, exhaustive search.
138+
What if only computational cost matters?
139+
Clearly, greedy search.
140+
A real-world applications usually asks
141+
a complicated question,
142+
somewhere in between those two extremes.
143+
144+
*Beam search* is an improved version of greedy search. It has a hyperparameter named *beam size*, $k$.
145+
At time step 1,
146+
we select $k$ tokens with the highest conditional probabilities.
147+
Each of them will be the first token of
148+
$k$ candidate output sequences, respectively.
149+
At each subsequent time step,
150+
based on the $k$ candidate output sequences
151+
at the previous time step,
152+
we continue to select $k$ candidate output sequences
153+
with the highest conditional probabilities
154+
from $k\left|\mathcal{Y}\right|$ possible choices.
155+
156+
![The process of beam search (beam size: 2, maximum length of an output sequence: 3). The candidate output sequences are $A$, $C$, $AB$, $CE$, $ABD$, and $CED$.](../img/beam-search.svg)
157+
:label:`fig_beam-search`
158+
159+
160+
:numref:`fig_beam-search` demonstrates the
161+
process of beam search with an example.
162+
Suppose that the output vocabulary
163+
contains only five elements:
164+
$\mathcal{Y} = \{A, B, C, D, E\}$,
165+
where one of them is “&lt;eos&gt;”.
166+
Let the beam size be 2 and
167+
the maximum length of an output sequence be 3.
168+
At time step 1,
169+
suppose that the tokens with the highest conditional probabilities $P(y_1 \mid \mathbf{c})$ are $A$ and $C$. At time step 2, for all $y_2 \in \mathcal{Y},$ we compute
170+
171+
$$\begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned}$$
172+
173+
and pick the largest two among these ten values, say
174+
$P(A, B \mid \mathbf{c})$ and $P(C, E \mid \mathbf{c})$.
175+
Then at time step 3, for all $y_3 \in \mathcal{Y}$, we compute
176+
177+
$$\begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c}),\\P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\end{aligned}$$
178+
179+
and pick the largest two among these ten values, say
180+
$P(A, B, D \mid \mathbf{c})$ and $P(C, E, D \mid \mathbf{c}).$
181+
As a result, we get six candidates output sequences: (i) $A$; (ii) $C$; (iii) $A$, $B$; (iv) $C$, $E$; (v) $A$, $B$, $D$; and (vi) $C$, $E$, $D$.
182+
183+
184+
In the end, we obtain the set of final candidate output sequences based on these six sequences (e.g., discard portions including and after “&lt;eos&gt;”).
185+
Then
186+
we choose the sequence with the highest of the following score as the output sequence:
187+
188+
$$ \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
189+
:eqlabel:`eq_beam-search-score`
190+
191+
where $L$ is the length of the final candidate sequence and $\alpha$ is usually set to 0.75.
192+
Since a longer sequence has more logarithmic terms in the summation of :eqref:`eq_beam-search-score`,
193+
the term $L^\alpha$ in the denominator penalizes
194+
long sequences.
195+
196+
The computational cost of beam search is $\mathcal{O}(k\left|\mathcal{Y}\right|T')$.
197+
This result is in between that of greedy search and that of exhaustive search. In fact, greedy search can be treated as a special type of beam search with
198+
a beam size of 1.
199+
With a flexible choice of the beam size,
200+
beam search provides a tradeoff between
201+
accuracy versus computational cost.
202+
203+
204+
205+
## Summary
206+
207+
* Sequence searching strategies include greedy search, exhaustive search, and beam search.
208+
* Beam search provides a tradeoff between accuracy versus computational cost via its flexible choice of the beam size.
209+
210+
211+
## Exercises
212+
213+
1. Can we treat exhaustive search as a special type of beam search? Why or why not?
214+
1. Apply beam search in the machine translation problem in :numref:`sec_seq2seq`. How does the beam size affect the translation results and the prediction speed?
215+
1. We used language modeling for generating text following user-provided prefixes in :numref:`sec_rnn_scratch`. Which kind of search strategy does it use? Can you improve it?
216+
217+
[Discussions](https://discuss.d2l.ai/t/338)

0 commit comments

Comments
 (0)