|
| 1 | +# Beam Search |
| 2 | +:label:`sec_beam-search` |
| 3 | + |
| 4 | +In :numref:`sec_seq2seq`, |
| 5 | +we predicted the output sequence token by token |
| 6 | +until the special end-of-sequence "<eos>" token |
| 7 | +is predicted. |
| 8 | +In this section, |
| 9 | +we will begin with formalizing this *greedy search* strategy |
| 10 | +and exploring issues with it, |
| 11 | +then compare this strategy with other alternatives: |
| 12 | +*exhaustive search* and *beam search*. |
| 13 | + |
| 14 | +Before a formal introduction to greedy search, |
| 15 | +let us formalize the search problem |
| 16 | +using |
| 17 | +the same mathematical notation from :numref:`sec_seq2seq`. |
| 18 | +At any time step $t'$, |
| 19 | +the probability of the decoder output $y_{t'}$ |
| 20 | +is conditional |
| 21 | +on the output subsequence |
| 22 | +$y_1, \ldots, y_{t'-1}$ before $t'$ and |
| 23 | +the context variable $\mathbf{c}$ that |
| 24 | +encodes the information of the input sequence. |
| 25 | +To quantify computational cost, |
| 26 | +denote by |
| 27 | +$\mathcal{Y}$ (it contains "<eos>") |
| 28 | +the output vocabulary. |
| 29 | +So the cardinality $\left|\mathcal{Y}\right|$ of this vocabulary set |
| 30 | +is the vocabulary size. |
| 31 | +Let us also specify the maximum number of tokens |
| 32 | +of an output sequence as $T'$. |
| 33 | +As a result, |
| 34 | +our goal is to search for an ideal output |
| 35 | +from all the |
| 36 | +$\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ |
| 37 | +possible output sequences. |
| 38 | +Of course, |
| 39 | +for all these output sequences, |
| 40 | +portions including and after "<eos>" will be discarded |
| 41 | +in the actual output. |
| 42 | + |
| 43 | +## Greedy Search |
| 44 | + |
| 45 | +First, let us take a look at |
| 46 | +a simple strategy: *greedy search*. |
| 47 | +This strategy has been used to predict sequences in :numref:`sec_seq2seq`. |
| 48 | +In greedy search, |
| 49 | +at any time step $t'$ of the output sequence, |
| 50 | +we search for the token |
| 51 | +with the highest conditional probability from $\mathcal{Y}$, i.e., |
| 52 | + |
| 53 | +$$y_{t'} = \operatorname*{argmax}_{y \in \mathcal{Y}} P(y \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$ |
| 54 | + |
| 55 | +as the output. |
| 56 | +Once "<eos>" is outputted or the output sequence has reached its maximum length $T'$, the output sequence is completed. |
| 57 | + |
| 58 | +So what can go wrong with greedy search? |
| 59 | +In fact, |
| 60 | +the *optimal sequence* |
| 61 | +should be the output sequence |
| 62 | +with the maximum |
| 63 | +$\prod_{t'=1}^{T'} P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c})$, |
| 64 | +which is |
| 65 | +the conditional probability of generating an output sequence based on the input sequence. |
| 66 | +Unfortunately, there is no guarantee |
| 67 | +that the optimal sequence will be obtained |
| 68 | +by greedy search. |
| 69 | + |
| 70 | + |
| 71 | +:label:`fig_s2s-prob1` |
| 72 | + |
| 73 | +Let us illustrate it with an example. |
| 74 | +Suppose that there are four tokens |
| 75 | +"A", "B", "C", and "<eos>" in the output dictionary. |
| 76 | +In :numref:`fig_s2s-prob1`, |
| 77 | +the four numbers under each time step represent the conditional probabilities of generating "A", "B", "C", and "<eos>" at that time step, respectively. |
| 78 | +At each time step, |
| 79 | +greedy search selects the token with the highest conditional probability. |
| 80 | +Therefore, the output sequence "A", "B", "C", and "<eos>" will be predicted |
| 81 | +in :numref:`fig_s2s-prob1`. |
| 82 | +The conditional probability of this output sequence is $0.5\times0.4\times0.4\times0.6 = 0.048$. |
| 83 | + |
| 84 | + |
| 85 | +:label:`fig_s2s-prob2` |
| 86 | + |
| 87 | + |
| 88 | +Next, let us look at another example |
| 89 | +in :numref:`fig_s2s-prob2`. |
| 90 | +Unlike in :numref:`fig_s2s-prob1`, |
| 91 | +at time step 2 |
| 92 | +we select the token "C" |
| 93 | +in :numref:`fig_s2s-prob2`, |
| 94 | +which has the *second* highest conditional probability. |
| 95 | +Since the output subsequences at time steps 1 and 2, |
| 96 | +on which time step 3 is based, |
| 97 | +have changed from "A" and "B" in :numref:`fig_s2s-prob1` to "A" and "C" in :numref:`fig_s2s-prob2`, |
| 98 | +the conditional probability of each token |
| 99 | +at time step 3 has also changed in :numref:`fig_s2s-prob2`. |
| 100 | +Suppose that we choose the token "B" at time step 3. |
| 101 | +Now time step 4 is conditional on |
| 102 | +the output subsequence at the first three time steps |
| 103 | +"A", "C", and "B", |
| 104 | +which is different from "A", "B", and "C" in :numref:`fig_s2s-prob1`. |
| 105 | +Therefore, the conditional probability of generating each token at time step 4 in :numref:`fig_s2s-prob2` is also different from that in :numref:`fig_s2s-prob1`. |
| 106 | +As a result, |
| 107 | +the conditional probability of the output sequence "A", "C", "B", and "<eos>" |
| 108 | +in :numref:`fig_s2s-prob2` |
| 109 | +is $0.5\times0.3 \times0.6\times0.6=0.054$, |
| 110 | +which is greater than that of greedy search in :numref:`fig_s2s-prob1`. |
| 111 | +In this example, |
| 112 | +the output sequence "A", "B", "C", and "<eos>" obtained by the greedy search is not an optimal sequence. |
| 113 | + |
| 114 | +## Exhaustive Search |
| 115 | + |
| 116 | +If the goal is to obtain the optimal sequence, we may consider using *exhaustive search*: |
| 117 | +exhaustively enumerate all the possible output sequences with their conditional probabilities, |
| 118 | +then output the one |
| 119 | +with the highest conditional probability. |
| 120 | + |
| 121 | +Although we can use exhaustive search to obtain the optimal sequence, |
| 122 | +its computational cost $\mathcal{O}(\left|\mathcal{Y}\right|^{T'})$ is likely to be excessively high. |
| 123 | +For example, when $|\mathcal{Y}|=10000$ and $T'=10$, we will need to evaluate $10000^{10} = 10^{40}$ sequences. This is next to impossible! |
| 124 | +On the other hand, |
| 125 | +the computational cost of greedy search is |
| 126 | +$\mathcal{O}(\left|\mathcal{Y}\right|T')$: |
| 127 | +it is usually significantly smaller than |
| 128 | +that of exhaustive search. For example, when $|\mathcal{Y}|=10000$ and $T'=10$, we only need to evaluate $10000\times10=10^5$ sequences. |
| 129 | + |
| 130 | + |
| 131 | +## Beam Search |
| 132 | + |
| 133 | +Decisions about sequence searching strategies |
| 134 | +lie on a spectrum, |
| 135 | +with easy questions at either extreme. |
| 136 | +What if only accuracy matters? |
| 137 | +Obviously, exhaustive search. |
| 138 | +What if only computational cost matters? |
| 139 | +Clearly, greedy search. |
| 140 | +A real-world applications usually asks |
| 141 | +a complicated question, |
| 142 | +somewhere in between those two extremes. |
| 143 | + |
| 144 | +*Beam search* is an improved version of greedy search. It has a hyperparameter named *beam size*, $k$. |
| 145 | +At time step 1, |
| 146 | +we select $k$ tokens with the highest conditional probabilities. |
| 147 | +Each of them will be the first token of |
| 148 | +$k$ candidate output sequences, respectively. |
| 149 | +At each subsequent time step, |
| 150 | +based on the $k$ candidate output sequences |
| 151 | +at the previous time step, |
| 152 | +we continue to select $k$ candidate output sequences |
| 153 | +with the highest conditional probabilities |
| 154 | +from $k\left|\mathcal{Y}\right|$ possible choices. |
| 155 | + |
| 156 | + |
| 157 | +:label:`fig_beam-search` |
| 158 | + |
| 159 | + |
| 160 | +:numref:`fig_beam-search` demonstrates the |
| 161 | +process of beam search with an example. |
| 162 | +Suppose that the output vocabulary |
| 163 | +contains only five elements: |
| 164 | +$\mathcal{Y} = \{A, B, C, D, E\}$, |
| 165 | +where one of them is “<eos>”. |
| 166 | +Let the beam size be 2 and |
| 167 | +the maximum length of an output sequence be 3. |
| 168 | +At time step 1, |
| 169 | +suppose that the tokens with the highest conditional probabilities $P(y_1 \mid \mathbf{c})$ are $A$ and $C$. At time step 2, for all $y_2 \in \mathcal{Y},$ we compute |
| 170 | + |
| 171 | +$$\begin{aligned}P(A, y_2 \mid \mathbf{c}) = P(A \mid \mathbf{c})P(y_2 \mid A, \mathbf{c}),\\ P(C, y_2 \mid \mathbf{c}) = P(C \mid \mathbf{c})P(y_2 \mid C, \mathbf{c}),\end{aligned}$$ |
| 172 | + |
| 173 | +and pick the largest two among these ten values, say |
| 174 | +$P(A, B \mid \mathbf{c})$ and $P(C, E \mid \mathbf{c})$. |
| 175 | +Then at time step 3, for all $y_3 \in \mathcal{Y}$, we compute |
| 176 | + |
| 177 | +$$\begin{aligned}P(A, B, y_3 \mid \mathbf{c}) = P(A, B \mid \mathbf{c})P(y_3 \mid A, B, \mathbf{c}),\\P(C, E, y_3 \mid \mathbf{c}) = P(C, E \mid \mathbf{c})P(y_3 \mid C, E, \mathbf{c}),\end{aligned}$$ |
| 178 | + |
| 179 | +and pick the largest two among these ten values, say |
| 180 | +$P(A, B, D \mid \mathbf{c})$ and $P(C, E, D \mid \mathbf{c}).$ |
| 181 | +As a result, we get six candidates output sequences: (i) $A$; (ii) $C$; (iii) $A$, $B$; (iv) $C$, $E$; (v) $A$, $B$, $D$; and (vi) $C$, $E$, $D$. |
| 182 | + |
| 183 | + |
| 184 | +In the end, we obtain the set of final candidate output sequences based on these six sequences (e.g., discard portions including and after “<eos>”). |
| 185 | +Then |
| 186 | +we choose the sequence with the highest of the following score as the output sequence: |
| 187 | + |
| 188 | +$$ \frac{1}{L^\alpha} \log P(y_1, \ldots, y_{L}) = \frac{1}{L^\alpha} \sum_{t'=1}^L \log P(y_{t'} \mid y_1, \ldots, y_{t'-1}, \mathbf{c}),$$ |
| 189 | +:eqlabel:`eq_beam-search-score` |
| 190 | + |
| 191 | +where $L$ is the length of the final candidate sequence and $\alpha$ is usually set to 0.75. |
| 192 | +Since a longer sequence has more logarithmic terms in the summation of :eqref:`eq_beam-search-score`, |
| 193 | +the term $L^\alpha$ in the denominator penalizes |
| 194 | +long sequences. |
| 195 | + |
| 196 | +The computational cost of beam search is $\mathcal{O}(k\left|\mathcal{Y}\right|T')$. |
| 197 | +This result is in between that of greedy search and that of exhaustive search. In fact, greedy search can be treated as a special type of beam search with |
| 198 | +a beam size of 1. |
| 199 | +With a flexible choice of the beam size, |
| 200 | +beam search provides a tradeoff between |
| 201 | +accuracy versus computational cost. |
| 202 | + |
| 203 | + |
| 204 | + |
| 205 | +## Summary |
| 206 | + |
| 207 | +* Sequence searching strategies include greedy search, exhaustive search, and beam search. |
| 208 | +* Beam search provides a tradeoff between accuracy versus computational cost via its flexible choice of the beam size. |
| 209 | + |
| 210 | + |
| 211 | +## Exercises |
| 212 | + |
| 213 | +1. Can we treat exhaustive search as a special type of beam search? Why or why not? |
| 214 | +1. Apply beam search in the machine translation problem in :numref:`sec_seq2seq`. How does the beam size affect the translation results and the prediction speed? |
| 215 | +1. We used language modeling for generating text following user-provided prefixes in :numref:`sec_rnn_scratch`. Which kind of search strategy does it use? Can you improve it? |
| 216 | + |
| 217 | +[Discussions](https://discuss.d2l.ai/t/338) |
0 commit comments