Post edit bi and deep rnns

naviat · Aug 26, 2022 · 5571434 · 5571434
1 parent 9affcf0
commit 5571434
Show file tree

Hide file tree

Showing 2 changed files with 14 additions and 52 deletions.
diff --git a/chapter_recurrent-modern/bi-rnn.md b/chapter_recurrent-modern/bi-rnn.md
@@ -3,7 +3,7 @@
 
 So far, our working example of a sequence learning task has been language modeling,
 where we aim to predict the next token given all previous tokens in a sequence. 
-In this scenario, we wish only to condition upon the left-ward context,
+In this scenario, we wish only to condition upon the leftward context,
 and thus the unidirectional chaining of a standard RNN seems appropriate. 
 However, there are many other sequence learning tasks contexts 
 where it's perfectly fine to condition the prediction at every time step
@@ -29,10 +29,10 @@ but "not" seems incompatible with the third sentences.
 
 
 Fortunately, a simple technique transforms any unidirectional RNN 
-into a bidrectional RNN.
+into a bidrectional RNN :cite:`Schuster.Paliwal.1997`.
 We simply implement two unidirectional RNN layers
 chained together in opposite directions 
-and acting on the same input.
+and acting on the same input (:numref:`fig_birnn`).
 For the first RNN layer,
 the first input is $\mathbf{x}_1$
 and the last input is $\mathbf{x}_T$,
@@ -67,7 +67,7 @@ $$
 \end{aligned}
 $$
 
-where the weights $\mathbf{W}_{xh}^{(f)} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh}^{(f)} \in \mathbb{R}^{h \times h}, \mathbf{W}_{xh}^{(b)} \in \mathbb{R}^{d \times h}, \text{ and } \mathbf{W}_{hh}^{(b)} \in \mathbb{R}^{h \times h}$, and biases $\mathbf{b}_h^{(f)} \in \mathbb{R}^{1 \times h} \text{ and } \mathbf{b}_h^{(b)} \in \mathbb{R}^{1 \times h}$ are all the model parameters.
+where the weights $\mathbf{W}_{xh}^{(f)} \in \mathbb{R}^{d \times h}, \mathbf{W}_{hh}^{(f)} \in \mathbb{R}^{h \times h}, \mathbf{W}_{xh}^{(b)} \in \mathbb{R}^{d \times h}, \text{ and } \mathbf{W}_{hh}^{(b)} \in \mathbb{R}^{h \times h}$, and biases $\mathbf{b}_h^{(f)} \in \mathbb{R}^{1 \times h}$ and $\mathbf{b}_h^{(b)} \in \mathbb{R}^{1 \times h}$ are all the model parameters.
 
 Next, we concatenate the forward and backward hidden states
 $\overrightarrow{\mathbf{H}}_t$ and $\overleftarrow{\mathbf{H}}_t$
@@ -144,24 +144,6 @@ def forward(self, inputs, Hs=None):
     return outputs, (f_H, b_H)
 ```
 
-The training procedure is the same
-as in :numref:`sec_rnn-scratch`.
-
-```{.python .input}
-%%tab all
-data = d2l.TimeMachine(batch_size=1024, num_steps=32)
-if tab.selected('mxnet', 'pytorch'):
-    birnn = BiRNNScratch(num_inputs=len(data.vocab), num_hiddens=32)
-    model = d2l.RNNLMScratch(birnn, vocab_size=len(data.vocab), lr=2)
-    trainer = d2l.Trainer(max_epochs=50, gradient_clip_val=1, num_gpus=1)
-if tab.selected('tensorflow'):
-    with d2l.try_gpu():
-        birnn = BiRNNScratch(num_inputs=len(data.vocab), num_hiddens=32)
-        model = d2l.RNNLMScratch(birnn, vocab_size=len(data.vocab), lr=2)
-    trainer = d2l.Trainer(max_epochs=50, gradient_clip_val=1)
-trainer.fit(model, data)
-```
-
 ### Concise Implementation
 
 Using the high-level APIs,
@@ -181,28 +163,9 @@ class BiGRU(d2l.RNN):
         self.num_hiddens *= 2
 ```
 
-```{.python .input}
-%%tab mxnet, pytorch
-gru = BiGRU(num_inputs=len(data.vocab), num_hiddens=32)
-if tab.selected('mxnet', 'pytorch'):
-    model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)
-if tab.selected('tensorflow'):
-    with d2l.try_gpu():
-        model = d2l.RNNLM(gru, vocab_size=len(data.vocab), lr=2)
-trainer.fit(model, data)
-```
-
-```{.python .input}
-%%tab mxnet, pytorch
-model.predict('it has', 20, data.vocab, d2l.try_gpu())
-```
-
-For a discussion of more effective uses of bidirectional RNNs, 
-please see the sentiment analysis application in :numref:`sec_sentiment_rnn`.
-
 ## Summary
 
-In bidirectional RNNs, the hidden state for each time step is simultaneously determined by the data prior to and after the current time step. Bidirectional RNNs bear a striking resemblance with the forward-backward algorithm in probabilistic graphical models. Bidirectional RNNs are mostly useful for sequence encoding and the estimation of observations given bidirectional context. Bidirectional RNNs are very costly to train due to long gradient chains.
+In bidirectional RNNs, the hidden state for each time step is simultaneously determined by the data prior to and after the current time step. Bidirectional RNNs are mostly useful for sequence encoding and the estimation of observations given bidirectional context. Bidirectional RNNs are very costly to train due to long gradient chains.
 
 ## Exercises
 

diff --git a/chapter_recurrent-modern/deep-rnn.md b/chapter_recurrent-modern/deep-rnn.md
@@ -20,13 +20,13 @@ However, we often also wish to retain the ability
 to express complex relationships 
 between the inputs at a given time step
 and the outputs at that same time step.
-Thus we often construct RNN's that are deep
+Thus we often construct RNNs that are deep
 not only in the time direction 
 but also in the input-to-output direction.
 This is precisely the notion of depth
 that we have already encountered 
-in our development of multilayer perceptrons
-and deep convolutional neural networks.
+in our development of MLPs
+and deep CNNs.
 
 
 The standard method for building this sort of deep RNN 
@@ -38,9 +38,9 @@ In this short section, we illustrate this design pattern
 and present a simple example for how to code up such stacked RNNs.
 Below, in :numref:`fig_deep_rnn`, we illustrate
 a deep RNN with $L$ hidden layers.
-Each hidden state operates on a sequentual input
+Each hidden state operates on a sequential input
 and produces a sequential output.
-Moreover each RNN cell at each time step
+Moreover, any RNN cell (white box in :numref:`fig_deep_rnn`) at each time step
 depends on both the same layer's 
 value at the previous time step
 and the previous layer's value 
@@ -53,8 +53,7 @@ Formally, suppose that we have a minibatch input
 $\mathbf{X}_t \in \mathbb{R}^{n \times d}$ 
 (number of examples: $n$, number of inputs in each example: $d$) at time step $t$.
 At the same time step, 
-let the hidden state of the $l^\mathrm{th}$ hidden layer  
-($l=1,\ldots,L$) be $\mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h}$ 
+let the hidden state of the $l^\mathrm{th}$ hidden layer ($l=1,\ldots,L$) be $\mathbf{H}_t^{(l)} \in \mathbb{R}^{n \times h}$ 
 (number of hidden units: $h$)
 and the output layer variable be 
 $\mathbf{O}_t \in \mathbb{R}^{n \times q}$ 
@@ -85,11 +84,11 @@ are the model parameters of the output layer.
 Just as with MLPs, the number of hidden layers $L$ 
 and the number of hidden units $h$ are hyperparameters
 that we can tune.
-Common RNN layer widths are in the range (64,2056),
-and common depths are in the range (1, 8). 
+Common RNN layer widths ($h$) are in the range $(64, 2056)$,
+and common depths ($L$) are in the range $(1, 8)$. 
 In addition, we can easily get a deep gated RNN
 by replacing the hidden state computation in :eqref:`eq_deep_rnn_H`
-with that from a GRU or an LSTM.
+with that from an LSTM or a GRU.
 
 ```{.python .input}
 %load_ext d2lbook.tab