Learning effective feature combinations is critical to the success of click-through rate prediction task. Factorization machines model feature interactions in a linear paradigm (e.g., bilinear interactions). This is often insufficient for real-world data where inherent feature crossing structures are usually very complex and nonlinear. What's worse, second-order feature interactions are generally used in factorization machines in practice. Modeling higher degrees of feature combinations with factorization machines is possible theoretically but it is usually not adopted due to numerical instability and high computational complexity.
One effective solution is using deep neural networks. Deep neural networks are powerful in feature representation learning and have the potential to learn sophisticated feature interactions. As such, it is natural to integrate deep neural networks to factorization machines. Adding nonlinear transformation layers to factorization machines gives it the capability to model both low-order feature combinations and high-order feature combinations. Moreover, non-linear inherent structures from inputs can also be captured with deep neural networks. In this section, we will introduce a representative model named deep factorization machines (DeepFM) :cite:Guo.Tang.Ye.ea.2017
which combine FM and deep neural networks.
DeepFM consists of an FM component and a deep component which are integrated in a parallel structure. The FM component is the same as the 2-way factorization machines which is used to model the low-order feature interactions. The deep component is an MLP that is used to capture high-order feature interactions and nonlinearities. These two components share the same inputs/embeddings and their outputs are summed up as the final prediction. It is worth pointing out that the spirit of DeepFM resembles that of the Wide & Deep architecture which can capture both memorization and generalization. The advantages of DeepFM over the Wide & Deep model is that it reduces the effort of hand-crafted feature engineering by identifying feature combinations automatically.
We omit the description of the FM component for brevity and denote the output as
where
where
where
It is worth noting that DeepFM is not the only way to combine deep neural networks with FM. We can also add nonlinear layers over the feature interactions :cite:He.Chua.2017
.
#@tab mxnet
from d2l import mxnet as d2l
from mxnet import init, gluon, np, npx
from mxnet.gluon import nn
import os
npx.set_np()
The implementation of DeepFM is similar to that of FM. We keep the FM part unchanged and use an MLP block with relu
as the activation function. Dropout is also used to regularize the model. The number of neurons of the MLP can be adjusted with the mlp_dims
hyperparameter.
#@tab mxnet
class DeepFM(nn.Block):
def __init__(self, field_dims, num_factors, mlp_dims, drop_rate=0.1):
super(DeepFM, self).__init__()
num_inputs = int(sum(field_dims))
self.embedding = nn.Embedding(num_inputs, num_factors)
self.fc = nn.Embedding(num_inputs, 1)
self.linear_layer = nn.Dense(1, use_bias=True)
input_dim = self.embed_output_dim = len(field_dims) * num_factors
self.mlp = nn.Sequential()
for dim in mlp_dims:
self.mlp.add(nn.Dense(dim, 'relu', True, in_units=input_dim))
self.mlp.add(nn.Dropout(rate=drop_rate))
input_dim = dim
self.mlp.add(nn.Dense(in_units=input_dim, units=1))
def forward(self, x):
embed_x = self.embedding(x)
square_of_sum = np.sum(embed_x, axis=1) ** 2
sum_of_square = np.sum(embed_x ** 2, axis=1)
inputs = np.reshape(embed_x, (-1, self.embed_output_dim))
x = self.linear_layer(self.fc(x).sum(1)) \
+ 0.5 * (square_of_sum - sum_of_square).sum(1, keepdims=True) \
+ self.mlp(inputs)
x = npx.sigmoid(x)
return x
The data loading process is the same as that of FM. We set the MLP component of DeepFM to a three-layered dense network with the a pyramid structure (30-20-10). All other hyperparameters remain the same as FM.
#@tab mxnet
batch_size = 2048
data_dir = d2l.download_extract('ctr')
train_data = d2l.CTRDataset(os.path.join(data_dir, 'train.csv'))
test_data = d2l.CTRDataset(os.path.join(data_dir, 'test.csv'),
feat_mapper=train_data.feat_mapper,
defaults=train_data.defaults)
field_dims = train_data.field_dims
train_iter = gluon.data.DataLoader(
train_data, shuffle=True, last_batch='rollover', batch_size=batch_size,
num_workers=d2l.get_dataloader_workers())
test_iter = gluon.data.DataLoader(
test_data, shuffle=False, last_batch='rollover', batch_size=batch_size,
num_workers=d2l.get_dataloader_workers())
devices = d2l.try_all_gpus()
net = DeepFM(field_dims, num_factors=10, mlp_dims=[30, 20, 10])
net.initialize(init.Xavier(), ctx=devices)
lr, num_epochs, optimizer = 0.01, 30, 'adam'
trainer = gluon.Trainer(net.collect_params(), optimizer,
{'learning_rate': lr})
loss = gluon.loss.SigmoidBinaryCrossEntropyLoss()
d2l.train_ch13(net, train_iter, test_iter, loss, trainer, num_epochs, devices)
Compared with FM, DeepFM converges faster and achieves better performance.
- Integrating neural networks to FM enables it to model complex and high-order interactions.
- DeepFM outperforms the original FM on the advertising dataset.
- Vary the structure of the MLP to check its impact on model performance.
- Change the dataset to Criteo and compare it with the original FM model.
:begin_tab:mxnet
Discussions
:end_tab: