🏷️sec_beam-search
In :numref:sec_seq2seq
,
we predicted the output sequence token by token
until the special end-of-sequence "<eos>" token
is predicted.
In this section,
we will begin with formalizing this greedy search strategy
and exploring issues with it,
then compare this strategy with other alternatives:
exhaustive search and beam search.
Before a formal introduction to greedy search,
let us formalize the search problem
using
the same mathematical notation from :numref:sec_seq2seq
.
At any time step
First, let us take a look at
a simple strategy: greedy search.
This strategy has been used to predict sequences in :numref:sec_seq2seq
.
In greedy search,
at any time step
$$y_{t'} = \operatorname*{argmax}{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y{t'-1}, \mathbf{c}),$$
as the output.
Once "<eos>" is outputted or the output sequence has reached its maximum length
So what can go wrong with greedy search?
In fact,
the optimal sequence
should be the output sequence
with the maximum
Let us illustrate it with an example.
Suppose that there are four tokens
"A", "B", "C", and "<eos>" in the output dictionary.
In :numref:fig_s2s-prob1
,
the four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "<eos>" at that time step, respectively.
At each time step,
greedy search selects the token with the highest conditional probability.
Therefore, the output sequence "A", "B", "C", and "<eos>" will be predicted
in :numref:fig_s2s-prob1
.
The conditional probability of this output sequence is
Next, let us look at another example
in :numref:fig_s2s-prob2
.
Unlike in :numref:fig_s2s-prob1
,
at time step 2
we select the token "C"
in :numref:fig_s2s-prob2
,
which has the second highest conditional probability.
Since the output subsequences at time steps 1 and 2,
on which time step 3 is based,
have changed from "A" and "B" in :numref:fig_s2s-prob1
to "A" and "C" in :numref:fig_s2s-prob2
,
the conditional probability of each token
at time step 3 has also changed in :numref:fig_s2s-prob2
.
Suppose that we choose the token "B" at time step 3.
Now time step 4 is conditional on
the output subsequence at the first three time steps
"A", "C", and "B",
which is different from "A", "B", and "C" in :numref:fig_s2s-prob1
.
Therefore, the conditional probability of generating each token at time step 4 in :numref:fig_s2s-prob2
is also different from that in :numref:fig_s2s-prob1
.
As a result,
the conditional probability of the output sequence "A", "C", "B", and "<eos>"
in :numref:fig_s2s-prob2
is fig_s2s-prob1
.
In this example,
the output sequence "A", "B", "C", and "<eos>" obtained by the greedy search is not an optimal sequence.
If the goal is to obtain the optimal sequence, we may consider using exhaustive search: exhaustively enumerate all the possible output sequences with their conditional probabilities, then output the one with the highest conditional probability.
Although we can use exhaustive search to obtain the optimal sequence,
its computational cost
Decisions about sequence searching strategies lie on a spectrum, with easy questions at either extreme. What if only accuracy matters? Obviously, exhaustive search. What if only computational cost matters? Clearly, greedy search. A real-world application usually asks a complicated question, somewhere in between those two extremes.
Beam search is an improved version of greedy search. It has a hyperparameter named beam size,
:numref:fig_beam-search
demonstrates the
process of beam search with an example.
Suppose that the output vocabulary
contains only five elements:
and pick the largest two among these ten values, say
and pick the largest two among these ten values, say
In the end, we obtain the set of final candidate output sequences based on these six sequences (e.g., discard portions including and after “<eos>”). Then we choose the sequence with the highest of the following score as the output sequence:
$$ \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$
:eqlabel:eq_beam-search-score
where eq_beam-search-score
,
the term
The computational cost of beam search is
- Sequence searching strategies include greedy search, exhaustive search, and beam search.
- Beam search provides a tradeoff between accuracy versus computational cost via its flexible choice of the beam size.
- Can we treat exhaustive search as a special type of beam search? Why or why not?
- Apply beam search in the machine translation problem in :numref:
sec_seq2seq
. How does the beam size affect the translation results and the prediction speed? - We used language modeling for generating text following user-provided prefixes in :numref:
sec_rnn_scratch
. Which kind of search strategy does it use? Can you improve it?