Skip to content

Commit

Permalink
Post edit bi and deep rnns
Browse files Browse the repository at this point in the history
  • Loading branch information
astonzhang committed Aug 26, 2022
1 parent 9affcf0 commit 5571434
Show file tree
Hide file tree
Showing 2 changed files with 14 additions and 52 deletions.
47 changes: 5 additions & 42 deletions chapter_recurrent-modern/bi-rnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

So far, our working example of a sequence learning task has been language modeling,
where we aim to predict the next token given all previous tokens in a sequence.
In this scenario, we wish only to condition upon the left-ward context,
In this scenario, we wish only to condition upon the leftward context,
and thus the unidirectional chaining of a standard RNN seems appropriate.
However, there are many other sequence learning tasks contexts
where it's perfectly fine to condition the prediction at every time step
Expand All @@ -29,10 +29,10 @@ but "not" seems incompatible with the third sentences.


Fortunately, a simple technique transforms any unidirectional RNN
into a bidrectional RNN.
into a bidrectional RNN :cite:`Schuster.Paliwal.1997`.
We simply implement two unidirectional RNN layers
chained together in opposite directions
and acting on the same input.
and acting on the same input (:numref:`fig_birnn`).
For the first RNN layer,
the first input is $\mathbf{x}_1$
and the last input is $\mathbf{x}_T$,
Expand Down Expand Up @@ -67,7 +67,7 @@ $$
\end{aligned}
$$

where the weights $\mathbf{W}_{xh}^{(f)} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh}^{(f)} \in \mathbb{R}^{h \times h}, \mathbf{W}_{xh}^{(b)} \in \mathbb{R}^{d \times h}, \text{ and } \mathbf{W}_{hh}^{(b)} \in \mathbb{R}^{h \times h}$, and biases $\mathbf{b}_h^{(f)} \in \mathbb{R}^{1 \times h} \text{ and } \mathbf{b}_h^{(b)} \in \mathbb{R}^{1 \times h}$ are all the model parameters.
where the weights $\mathbf{W}_{xh}^{(f)} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh}^{(f)} \in \mathbb{R}^{h \times h}, \mathbf{W}_{xh}^{(b)} \in \mathbb{R}^{d \times h}, \text{ and } \mathbf{W}_{hh}^{(b)} \in \mathbb{R}^{h \times h}$, and biases $\mathbf{b}_h^{(f)} \in \mathbb{R}^{1 \times h}$ and $\mathbf{b}_h^{(b)} \in \mathbb{R}^{1 \times h}$ are all the model parameters.

Next, we concatenate the forward and backward hidden states
$\overrightarrow{\mathbf{H}}_t$ and $\overleftarrow{\mathbf{H}}_t$
Expand Down Expand Up @@ -144,24 +144,6 @@ def forward(self, inputs, Hs=None):
return outputs, (f_H, b_H)
```

The training procedure is the same
as in :numref:`sec_rnn-scratch`.

```{.python .input}
%%tab all
data = d2l.TimeMachine(batch_size=1024, num_steps=32)
if tab.selected('mxnet', 'pytorch'):
birnn = BiRNNScratch(num_inputs=len(data.vocab), num_hiddens=32)
model = d2l.RNNLMScratch(birnn, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=50, gradient_clip_val=1, num_gpus=1)
if tab.selected('tensorflow'):
with d2l.try_gpu():
birnn = BiRNNScratch(num_inputs=len(data.vocab), num_hiddens=32)
model = d2l.RNNLMScratch(birnn, vocab_size=len(data.vocab), lr=2)
trainer = d2l.Trainer(max_epochs=50, gradient_clip_val=1)
trainer.fit(model, data)
```

### Concise Implementation

Using the high-level APIs,
Expand All @@ -181,28 +163,9 @@ class BiGRU(d2l.RNN):
self.num_hiddens *= 2
```

```{.python .input}
%%tab mxnet, pytorch
gru = BiGRU(num_inputs=len(data.vocab), num_hiddens=32)
if tab.selected('mxnet', 'pytorch'):
model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)
if tab.selected('tensorflow'):
with d2l.try_gpu():
model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)
trainer.fit(model, data)
```

```{.python .input}
%%tab mxnet, pytorch
model.predict('it has', 20, data.vocab, d2l.try_gpu())
```

For a discussion of more effective uses of bidirectional RNNs,
please see the sentiment analysis application in :numref:`sec_sentiment_rnn`.

## Summary

In bidirectional RNNs, the hidden state for each time step is simultaneously determined by the data prior to and after the current time step. Bidirectional RNNs bear a striking resemblance with the forward-backward algorithm in probabilistic graphical models. Bidirectional RNNs are mostly useful for sequence encoding and the estimation of observations given bidirectional context. Bidirectional RNNs are very costly to train due to long gradient chains.
In bidirectional RNNs, the hidden state for each time step is simultaneously determined by the data prior to and after the current time step. Bidirectional RNNs are mostly useful for sequence encoding and the estimation of observations given bidirectional context. Bidirectional RNNs are very costly to train due to long gradient chains.

## Exercises

Expand Down
19 changes: 9 additions & 10 deletions chapter_recurrent-modern/deep-rnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,13 +20,13 @@ However, we often also wish to retain the ability
to express complex relationships
between the inputs at a given time step
and the outputs at that same time step.
Thus we often construct RNN's that are deep
Thus we often construct RNNs that are deep
not only in the time direction
but also in the input-to-output direction.
This is precisely the notion of depth
that we have already encountered
in our development of multilayer perceptrons
and deep convolutional neural networks.
in our development of MLPs
and deep CNNs.


The standard method for building this sort of deep RNN
Expand All @@ -38,9 +38,9 @@ In this short section, we illustrate this design pattern
and present a simple example for how to code up such stacked RNNs.
Below, in :numref:`fig_deep_rnn`, we illustrate
a deep RNN with $L$ hidden layers.
Each hidden state operates on a sequentual input
Each hidden state operates on a sequential input
and produces a sequential output.
Moreover each RNN cell at each time step
Moreover, any RNN cell (white box in :numref:`fig_deep_rnn`) at each time step
depends on both the same layer's
value at the previous time step
and the previous layer's value
Expand All @@ -53,8 +53,7 @@ Formally, suppose that we have a minibatch input
$\mathbf{X}_t \in \mathbb{R}^{n \times d}$
(number of examples: $n$, number of inputs in each example: $d$) at time step $t$.
At the same time step,
let the hidden state of the $l^\mathrm{th}$ hidden layer
($l=1,\ldots,L$) be $\mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h}$
let the hidden state of the $l^\mathrm{th}$ hidden layer ($l=1,\ldots,L$) be $\mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h}$
(number of hidden units: $h$)
and the output layer variable be
$\mathbf{O}_t \in \mathbb{R}^{n \times q}$
Expand Down Expand Up @@ -85,11 +84,11 @@ are the model parameters of the output layer.
Just as with MLPs, the number of hidden layers $L$
and the number of hidden units $h$ are hyperparameters
that we can tune.
Common RNN layer widths are in the range (64,2056),
and common depths are in the range (1, 8).
Common RNN layer widths ($h$) are in the range $(64, 2056)$,
and common depths ($L$) are in the range $(1, 8)$.
In addition, we can easily get a deep gated RNN
by replacing the hidden state computation in :eqref:`eq_deep_rnn_H`
with that from a GRU or an LSTM.
with that from an LSTM or a GRU.

```{.python .input}
%load_ext d2lbook.tab
Expand Down

0 comments on commit 5571434

Please sign in to comment.