🏷️sec_rnn
In :numref:sec_language_model
we introduced
where
eq_ht_xt
For a sufficiently powerful function eq_ht_xt
, the latent variable model is not an approximation. After all,
Recall that we have discussed hidden layers with hidden units in :numref:chap_perceptrons
.
It is noteworthy that
hidden layers and hidden states refer to two very different concepts.
Hidden layers are, as explained, layers that are hidden from view on the path from input to output.
Hidden states are technically speaking inputs to whatever we do at a given step,
and they can only be computed by looking at data at previous time steps.
Recurrent neural networks (RNNs) are neural networks with hidden states. Before introducing the RNN model, we first revisit the MLP model introduced in :numref:sec_mlp
.
Neural Networks without Hidden States
Let us take a look at an MLP with a single hidden layer.
Let the hidden layer's activation function be
rnn_h_without_state
In :eqref:rnn_h_without_state
, we have the weight parameter
where
This is entirely analogous to the regression problem we solved previously in :numref:sec_sequence
, hence we omit details.
Suffice it to say that we can pick feature-label pairs at random and learn the parameters of our network via automatic differentiation and stochastic gradient descent.
Recurrent Neural Networks with Hidden States
🏷️subsec_rnn_w_hidden_states
Matters are entirely different when we have hidden states. Let us look at the structure in some more detail.
Assume that we have
a minibatch of inputs
$$\mathbf{H}t = \phi(\mathbf{X}t \mathbf{W}{xh} + \mathbf{H}{t-1} \mathbf{W}_{hh} + \mathbf{b}_h).$$
:eqlabel:rnn_h_with_state
Compared with :eqref:rnn_h_without_state
, :eqref:rnn_h_with_state
adds one more term $\mathbf{H}{t-1} \mathbf{W}{hh}$ and thus
instantiates :eqref:eq_ht_xt
.
From the relationship between hidden variables $\mathbf{H}t$ and $\mathbf{H}{t-1}$ of adjacent time steps,
we know that these variables captured and retained the sequence's historical information up to their current time step, just like the state or memory of the neural network's current time step. Therefore, such a hidden variable is called a hidden state.
Since the hidden state uses the same definition of the previous time step in the current time step, the computation of :eqref:rnn_h_with_state
is recurrent. Hence, neural networks with hidden states
based on recurrent computation are named
recurrent neural networks.
Layers that perform
the computation of :eqref:rnn_h_with_state
in RNNs
are called recurrent layers.
There are many different ways for constructing RNNs.
RNNs with a hidden state defined by :eqref:rnn_h_with_state
are very common.
For time step
$$\mathbf{O}_t = \mathbf{H}t \mathbf{W}{hq} + \mathbf{b}_q.$$
Parameters of the RNN
include the weights $\mathbf{W}{xh} \in \mathbb{R}^{d \times h}, \mathbf{W}{hh} \in \mathbb{R}^{h \times h}$,
and the bias $\mathbf{b}h \in \mathbb{R}^{1 \times h}$
of the hidden layer,
together with the weights $\mathbf{W}{hq} \in \mathbb{R}^{h \times q}$
and the bias
:numref:fig_rnn
illustrates the computational logic of an RNN at three adjacent time steps.
At any time step rnn_h_with_state
.
The hidden state of the current time step
We just mentioned that the calculation of $\mathbf{X}t \mathbf{W}{xh} + \mathbf{H}{t-1} \mathbf{W}{hh}$ for the hidden state is equivalent to
matrix multiplication of
concatenation of $\mathbf{X}t$ and $\mathbf{H}{t-1}$
and
concatenation of $\mathbf{W}{xh}$ and $\mathbf{W}{hh}$.
Though this can be proven in mathematics,
in the following we just use a simple code snippet to show this.
To begin with,
we define matrices X
, W_xh
, H
, and W_hh
, whose shapes are (3, 1), (1, 4), (3, 4), and (4, 4), respectively.
Multiplying X
by W_xh
, and H
by W_hh
, respectively, and then adding these two multiplications,
we obtain a matrix of shape (3, 4).
from d2l import mxnet as d2l
from mxnet import np, npx
npx.set_np()
#@tab pytorch
from d2l import torch as d2l
import torch
#@tab tensorflow
from d2l import tensorflow as d2l
import tensorflow as tf
#@tab mxnet, pytorch
X, W_xh = d2l.normal(0, 1, (3, 1)), d2l.normal(0, 1, (1, 4))
H, W_hh = d2l.normal(0, 1, (3, 4)), d2l.normal(0, 1, (4, 4))
d2l.matmul(X, W_xh) + d2l.matmul(H, W_hh)
#@tab tensorflow
X, W_xh = d2l.normal((3, 1), 0, 1), d2l.normal((1, 4), 0, 1)
H, W_hh = d2l.normal((3, 4), 0, 1), d2l.normal((4, 4), 0, 1)
d2l.matmul(X, W_xh) + d2l.matmul(H, W_hh)
Now we concatenate the matrices X
and H
along columns (axis 1),
and the matrices
W_xh
and W_hh
along rows (axis 0).
These two concatenations
result in
matrices of shape (3, 5)
and of shape (5, 4), respectively.
Multiplying these two concatenated matrices,
we obtain the same output matrix of shape (3, 4)
as above.
#@tab all
d2l.matmul(d2l.concat((X, H), 1), d2l.concat((W_xh, W_hh), 0))
Recall that for language modeling in :numref:sec_language_model
,
we aim to predict the next token based on
the current and past tokens,
thus we shift the original sequence by one token
as the labels.
Now we illustrate how RNNs can be used to build a language model.
Let the minibatch size be 1, and the sequence of the text be "machine".
To simplify training in subsequent sections,
we tokenize text into characters rather than words
and consider a character-level language model.
:numref:fig_rnn_train
demonstrates how to predict the next character based on the current and previous characters via an RNN for character-level language modeling.
During the training process,
we run a softmax operation on the output from the output layer for each time step, and then use the cross-entropy loss to compute the error between the model output and the label.
Due to the recurrent computation of the hidden state in the hidden layer, the output of time step 3 in :numref:fig_rnn_train
,
In practice, each token is represented by a subsec_rnn_w_hidden_states
.
🏷️subsec_perplexity
Last, let us discuss about how to measure the language model quality, which will be used to evaluate our RNN-based models in the subsequent sections. One way is to check how surprising the text is. A good language model is able to predict with high-accuracy tokens that what we will see next. Consider the following continuations of the phrase "It is raining", as proposed by different language models:
- "It is raining outside"
- "It is raining banana tree"
- "It is raining piouw;kcj pwepoiut"
In terms of quality, example 1 is clearly the best. The words are sensible and logically coherent. While it might not quite accurately reflect which word follows semantically ("in San Francisco" and "in winter" would have been perfectly reasonable extensions), the model is able to capture which kind of word follows. Example 2 is considerably worse by producing a nonsensical extension. Nonetheless, at least the model has learned how to spell words and some degree of correlation between words. Last, example 3 indicates a poorly trained model that does not fit data properly.
We might measure the quality of the model by computing the likelihood of the sequence. Unfortunately this is a number that is hard to understand and difficult to compare. After all, shorter sequences are much more likely to occur than the longer ones, hence evaluating the model on Tolstoy's magnum opus War and Peace will inevitably produce a much smaller likelihood than, say, on Saint-Exupery's novella The Little Prince. What is missing is the equivalent of an average.
Information theory comes handy here.
We have defined entropy, surprisal, and cross-entropy
when we introduced the softmax regression
(:numref:subsec_info_theory_basics
)
and more of information theory is discussed in the online appendix on information theory.
If we want to compress text, we can ask about
predicting the next token given the current set of tokens.
A better language model should allow us to predict the next token more accurately.
Thus, it should allow us to spend fewer bits in compressing the sequence.
So we can measure it by the cross-entropy loss averaged
over all the
eq_avg_ce_for_lm
where eq_avg_ce_for_lm
:
Perplexity can be best understood as the harmonic mean of the number of real choices that we have when deciding which token to pick next. Let us look at a number of cases:
- In the best case scenario, the model always perfectly estimates the probability of the label token as 1. In this case the perplexity of the model is 1.
- In the worst case scenario, the model always predicts the probability of the label token as 0. In this situation, the perplexity is positive infinity.
- At the baseline, the model predicts a uniform distribution over all the available tokens of the vocabulary. In this case, the perplexity equals the number of unique tokens of the vocabulary. In fact, if we were to store the sequence without any compression, this would be the best we could do to encode it. Hence, this provides a nontrivial upper bound that any useful model must beat.
In the following sections, we will implement RNNs for character-level language models and use perplexity to evaluate such models.
- A neural network that uses recurrent computation for hidden states is called a recurrent neural network (RNN).
- The hidden state of an RNN can capture historical information of the sequence up to the current time step.
- The number of RNN model parameters does not grow as the number of time steps increases.
- We can create character-level language models using an RNN.
- We can use perplexity to evaluate the quality of language models.
- If we use an RNN to predict the next character in a text sequence, what is the required dimension for any output?
- Why can RNNs express the conditional probability of a token at some time step based on all the previous tokens in the text sequence?
- What happens to the gradient if you backpropagate through a long sequence?
- What are some of the problems associated with the language model described in this section?
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab:
:begin_tab:tensorflow
Discussions
:end_tab: