🏷️sec_beam-search
In :numref:sec_seq2seq
,
we introduced the encoder-decoder architecture,
and the standard techniques for training them end-to-end.
However, when it came to test-time prediction,
we mentioned only the greedy strategy,
where we select at each time step
the most token given the highest
predicted probability of coming next.
until, at some time step,
we find that we have predicted
the special end-of-sequence "<eos>" token.
In this section, we will begin
by formalizing this greedy search strategy
and identifying some problems
that practitioners tend to run into.
Subsequently, we compare this strategy
with two alternatives:
exhaustive search (illustrative but not practical)
and beam search (the standard method in practice).
Let's begin by setting up our mathematical notation,
borrowing conventions from :numref:sec_seq2seq
.
At any time step
Consider the simple greedy search strategy from :numref:sec_seq2seq
.
Here, at any time step
$$y_{t'} = \operatorname*{argmax}{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y{t'-1}, \mathbf{c}).$$
Once our model outputs "<eos>"
(or we reach the maximum length
This strategy might looks reasonable,
and in fact it's not so bad!
Considering how computationally undemanding it is,
you'd be hard pressed to get more bang for your buck.
However, if we put aside efficiency for a minute,
it might seem more reasonable to search
for the most likely sequence,
not the sequence of (greedily selected) most likely tokens.
It turns out that these two objects can be quite different.
The most likely sequence is the one that maximizes the expression
Let's illustrate it with an example.
Suppose that there are four tokens
"A", "B", "C", and "<eos>" in the output dictionary.
In :numref:fig_s2s-prob1
,
the four numbers under each time step represent
the conditional probabilities of generating "A", "B", "C",
and "<eos>" at that time step, respectively.
At each time step, greedy search selects
the token with the highest conditional probability.
Therefore, the output sequence "A", "B", "C", and "<eos>"
will be predicted (:numref:fig_s2s-prob1
).
The conditional probability of this output sequence
is
Next, let's look at another example in :numref:fig_s2s-prob2
.
Unlike in :numref:fig_s2s-prob1
,
at time step 2 we select the token "C"
in :numref:fig_s2s-prob2
,
which has the second highest conditional probability.
Since the output subsequences at time steps 1 and 2,
on which time step 3 is based,
have changed from "A" and "B" in :numref:fig_s2s-prob1
to "A" and "C" in :numref:fig_s2s-prob2
,
the conditional probability of each token
at time step 3 has also changed in :numref:fig_s2s-prob2
.
Suppose that we choose the token "B" at time step 3.
Now time step 4 is conditional on
the output subsequence at the first three time steps
"A", "C", and "B",
which is different from "A", "B", and "C" in :numref:fig_s2s-prob1
.
Therefore, the conditional probability of generating
each token at time step 4 in :numref:fig_s2s-prob2
is also different from that in :numref:fig_s2s-prob1
.
As a result, the conditional probability of the output sequence
"A", "C", "B", and "<eos>" in :numref:fig_s2s-prob2
is fig_s2s-prob1
.
In this example, the output sequence "A", "B", "C", and "<eos>"
obtained by the greedy search is not the optimal sequence.
If the goal is to obtain the most likely sequence, we may consider using exhaustive search: exhaustively enumerate all the possible output sequences with their conditional probabilities, and then output the one that scores the highest predicted probability.
While this would certainly give us what we desire,
it would come at a prohibitive computational cost
of
You could view sequence decoding strategies as lying on a spectrum,
with beam search striking a compromise
between the efficiency of greedy search
and the optimality of exhaustive search.
The most straightforward version of beam search
is characterized by a single hyperparameter,
the beam size,
:numref:fig_beam-search
demonstrates the
process of beam search with an example.
Suppose that the output vocabulary
contains only five elements:
and pick the largest two among these ten values, say
and pick the largest two among these ten values, say
In the end, we obtain the set of final candidate output sequences based on these six sequences (e.g., discard portions including and after “<eos>”). Then we choose the sequence with the highest of the following score as the output sequence:
$$ \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}\mid \mathbf{c}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
:eqlabel:eq_beam-search-score
where eq_beam-search-score
,
the term
The computational cost of beam search is
Sequence searching strategies include greedy search, exhaustive search, and beam search. Beam search provides a tradeoff between accuracy versus computational cost via its flexible choice of the beam size.
- Can we treat exhaustive search as a special type of beam search? Why or why not?
- Apply beam search in the machine translation problem in :numref:
sec_seq2seq
. How does the beam size affect the translation results and the prediction speed? - We used language modeling for generating text following user-provided prefixes in :numref:
sec_rnn-scratch
. Which kind of search strategy does it use? Can you improve it?