🏷️sec_word2vec_pretraining
In this section, we will train a skip-gram model defined in
:numref:sec_word2vec
.
First, import the packages and modules required for the experiment, and load the PTB dataset.
from d2l import mxnet as d2l
from mxnet import autograd, gluon, np, npx
from mxnet.gluon import nn
npx.set_np()
batch_size, max_window_size, num_noise_words = 512, 5, 5
data_iter, vocab = d2l.load_data_ptb(batch_size, max_window_size,
num_noise_words)
#@tab pytorch
from d2l import torch as d2l
import torch
from torch import nn
batch_size, max_window_size, num_noise_words = 512, 5, 5
data_iter, vocab = d2l.load_data_ptb(batch_size, max_window_size,
num_noise_words)
We will implement the skip-gram model by using embedding layers and minibatch multiplication. These methods are also often used to implement other natural language processing applications.
As described in :numref:sec_seq2seq
,
The layer in which the obtained word is embedded is called the embedding layer, which can be obtained by creating an nn.Embedding
instance in high-level APIs. The weight of the embedding layer is a matrix whose number of rows is the dictionary size (input_dim
) and whose number of columns is the dimension of each word vector (output_dim
). We set the dictionary size to
embed = nn.Embedding(input_dim=20, output_dim=4)
embed.initialize()
embed.weight
#@tab pytorch
embed = nn.Embedding(num_embeddings=20, embedding_dim=4)
print(f'Parameter embedding_weight ({embed.weight.shape}, '
'dtype={embed.weight.dtype})')
The input of the embedding layer is the index of the word. When we enter the index
#@tab all
x = d2l.tensor([[1, 2, 3], [4, 5, 6]])
embed(x)
In forward calculation, the input of the skip-gram model contains the central target word index center
and the concatenated context and noise word index contexts_and_negatives
. In which, the center
variable has the shape (batch size, 1), while the contexts_and_negatives
variable has the shape (batch size, max_len
). These two variables are first transformed from word indexes to word vectors by the word embedding layer, and then the output of shape (batch size, 1, max_len
) is obtained by minibatch multiplication. Each element in the output is the inner product of the central target word vector and the context word vector or noise word vector.
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
v = embed_v(center)
u = embed_u(contexts_and_negatives)
pred = npx.batch_dot(v, u.swapaxes(1, 2))
return pred
#@tab pytorch
def skip_gram(center, contexts_and_negatives, embed_v, embed_u):
v = embed_v(center)
u = embed_u(contexts_and_negatives)
pred = torch.bmm(v, u.permute(0, 2, 1))
return pred
Verify that the output shape should be (batch size, 1, max_len
).
skip_gram(np.ones((2, 1)), np.ones((2, 4)), embed, embed).shape
#@tab pytorch
skip_gram(torch.ones((2, 1), dtype=torch.long),
torch.ones((2, 4), dtype=torch.long), embed, embed).shape
Before training the word embedding model, we need to define the loss function of the model.
According to the definition of the loss function in negative sampling, we can directly use the binary cross-entropy loss function from high-level APIs.
loss = gluon.loss.SigmoidBCELoss()
#@tab pytorch
class SigmoidBCELoss(nn.Module):
"BCEWithLogitLoss with masking on call."
def __init__(self):
super().__init__()
def forward(self, inputs, target, mask=None):
out = nn.functional.binary_cross_entropy_with_logits(
inputs, target, weight=mask, reduction="none")
return out.mean(dim=1)
loss = SigmoidBCELoss()
It is worth mentioning that we can use the mask variable to specify the partial predicted value and label that participate in loss function calculation in the minibatch: when the mask is 1, the predicted value and label of the corresponding position will participate in the calculation of the loss function; When the mask is 0, they do not participate. As we mentioned earlier, mask variables can be used to avoid the effect of padding on loss function calculations.
Given two identical examples, different masks lead to different loss values.
#@tab all
pred = d2l.tensor([[.5]*4]*2)
label = d2l.tensor([[1., 0., 1., 0.]]*2)
mask = d2l.tensor([[1, 1, 1, 1], [1, 1, 0, 0]])
loss(pred, label, mask)
We can normalize the loss in each example due to various lengths in each example.
#@tab all
loss(pred, label, mask) / mask.sum(axis=1) * mask.shape[1]
We construct the embedding layers of the central and context words, respectively, and set the hyperparameter word vector dimension embed_size
to 100.
embed_size = 100
net = nn.Sequential()
net.add(nn.Embedding(input_dim=len(vocab), output_dim=embed_size),
nn.Embedding(input_dim=len(vocab), output_dim=embed_size))
#@tab pytorch
embed_size = 100
net = nn.Sequential(nn.Embedding(num_embeddings=len(vocab),
embedding_dim=embed_size),
nn.Embedding(num_embeddings=len(vocab),
embedding_dim=embed_size))
The training function is defined below. Because of the existence of padding, the calculation of the loss function is slightly different compared to the previous training functions.
def train(net, data_iter, lr, num_epochs, device=d2l.try_gpu()):
net.initialize(ctx=device, force_reinit=True)
trainer = gluon.Trainer(net.collect_params(), 'adam',
{'learning_rate': lr})
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[1, num_epochs])
metric = d2l.Accumulator(2) # Sum of losses, no. of tokens
for epoch in range(num_epochs):
timer, num_batches = d2l.Timer(), len(data_iter)
for i, batch in enumerate(data_iter):
center, context_negative, mask, label = [
data.as_in_ctx(device) for data in batch]
with autograd.record():
pred = skip_gram(center, context_negative, net[0], net[1])
l = (loss(pred.reshape(label.shape), label, mask)
/ mask.sum(axis=1) * mask.shape[1])
l.backward()
trainer.step(batch_size)
metric.add(l.sum(), l.size)
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(metric[0] / metric[1],))
print(f'loss {metric[0] / metric[1]:.3f}, '
f'{metric[1] / timer.stop():.1f} tokens/sec on {str(device)}')
#@tab pytorch
def train(net, data_iter, lr, num_epochs, device=d2l.try_gpu()):
def init_weights(m):
if type(m) == nn.Embedding:
nn.init.xavier_uniform_(m.weight)
net.apply(init_weights)
net = net.to(device)
optimizer = torch.optim.Adam(net.parameters(), lr=lr)
animator = d2l.Animator(xlabel='epoch', ylabel='loss',
xlim=[1, num_epochs])
metric = d2l.Accumulator(2) # Sum of losses, no. of tokens
for epoch in range(num_epochs):
timer, num_batches = d2l.Timer(), len(data_iter)
for i, batch in enumerate(data_iter):
optimizer.zero_grad()
center, context_negative, mask, label = [
data.to(device) for data in batch]
pred = skip_gram(center, context_negative, net[0], net[1])
l = (loss(pred.reshape(label.shape).float(), label.float(), mask)
/ mask.sum(axis=1) * mask.shape[1])
l.sum().backward()
optimizer.step()
metric.add(l.sum(), l.numel())
if (i + 1) % (num_batches // 5) == 0 or i == num_batches - 1:
animator.add(epoch + (i + 1) / num_batches,
(metric[0] / metric[1],))
print(f'loss {metric[0] / metric[1]:.3f}, '
f'{metric[1] / timer.stop():.1f} tokens/sec on {str(device)}')
Now, we can train a skip-gram model using negative sampling.
#@tab all
lr, num_epochs = 0.01, 5
train(net, data_iter, lr, num_epochs)
After training the word embedding model, we can represent similarity in meaning between words based on the cosine similarity of two word vectors. As we can see, when using the trained word embedding model, the words closest in meaning to the word "chip" are mostly related to chips.
def get_similar_tokens(query_token, k, embed):
W = embed.weight.data()
x = W[vocab[query_token]]
# Compute the cosine similarity. Add 1e-9 for numerical stability
cos = np.dot(W, x) / np.sqrt(np.sum(W * W, axis=1) * np.sum(x * x) + 1e-9)
topk = npx.topk(cos, k=k+1, ret_typ='indices').asnumpy().astype('int32')
for i in topk[1:]: # Remove the input words
print(f'cosine sim={float(cos[i]):.3f}: {vocab.idx_to_token[i]}')
get_similar_tokens('chip', 3, net[0])
#@tab pytorch
def get_similar_tokens(query_token, k, embed):
W = embed.weight.data
x = W[vocab[query_token]]
# Compute the cosine similarity. Add 1e-9 for numerical stability
cos = torch.mv(W, x) / torch.sqrt(torch.sum(W * W, dim=1) *
torch.sum(x * x) + 1e-9)
topk = torch.topk(cos, k=k+1)[1].cpu().numpy().astype('int32')
for i in topk[1:]: # Remove the input words
print(f'cosine sim={float(cos[i]):.3f}: {vocab.idx_to_token[i]}')
get_similar_tokens('chip', 3, net[0])
- We can pretrain a skip-gram model through negative sampling.
- Set
sparse_grad=True
when creating an instance ofnn.Embedding
. Does it accelerate training? Look up MXNet documentation to learn the meaning of this argument. - Try to find synonyms for other words.
- Tune the hyperparameters and observe and analyze the experimental results.
- When the dataset is large, we usually sample the context words and the noise words for the central target word in the current minibatch only when updating the model parameters. In other words, the same central target word may have different context words or noise words in different epochs. What are the benefits of this sort of training? Try to implement this training method.
:begin_tab:mxnet
Discussions
:end_tab:
:begin_tab:pytorch
Discussions
:end_tab: