This file describes a set of TF-Ranking extensions.
[TOC]
We firstly provide a step-by-step guide on quickly developing a TF-Ranking model
using RankingPipeline
. Overall, there are three steps as shown in below.
To use RankingPipeline
, you need to first install the Tensorflow Ranking
library with pip (instruction).
The RankingPipeline
requires a dictionary of hparams to construct an instance.
Most of them are standard Tensorflow model parameters, including:
train_input_pattern
,eval_input_pattern
.train_batch_size
,eval_batch_size
.checkpoint_secs
,num_checkpoints
.num_train_steps
,num_eval_steps
.model_dir
.
In the meantime, it should also involve TF-Ranking specific parameters such as:
loss
. Here is the list of the supported ranking losses,list_size
. Default to beNone
. IfNone
, the actual list size per batch is determined by the maximum number of examples in that batch. If a fixedlist_size
is set, the number of examples larger than thelist_size
will be trucated, and smaller than thelist_size
will be padded. Please refer totfr.data.make_parsing_fn
for more details.listwise_inference
. Default to beFalse
. IfFalse
, the features will be encoded usingtf.feature.encode_pointwise_features
; otherwise, the features will be encoded bytf.feature.encode_listwise_features
.
In addition, we add a parameter convert_labels_to_binary
for label processing.
If the convert_labels_to_binary
is set to be True, we convert a label to 1
if its value is larger than 0; otherwise, it will be converted to 0
. With the
convert_labels_to_binary
setting to be False, we keep label values intact.
hparams = dict(
train_input_pattern="/path/to/train/files",
eval_input_pattern="/path/to/eval/files",
train_batch_size=8,
eval_batch_size=8,
checkpoint_secs=120,
num_checkpoints=1000,
num_train_steps=10000,
num_eval_steps=100,
loss="softmax_loss",
list_size=10,
listwise_inference=False,
convert_labels_to_binary=False,
model_dir="/path/to/your/model/directory")
If there are needs to change the default values, you may pass the parameters
through flags. For example, if you need a list_size
of 100 and the loss type
to be approx_ndcg_loss
, you can do the following:
bazel build -c opt \
tensorflow_ranking/extension/examples/pipeline_example_py_binary && \
./bazel-bin/tensorflow_ranking/extension/examples/pipeline_example_py_binary \
--train_input_pattern=/path/to/train/files \
--eval_input_pattern=/path/to/eval/files \
--model_dir=/path/to/your/model/directory \
--list_size=100 \
--loss="approx_ndcg_loss"
RankingPipeline
requires a tf.estimator.Estimator
to define TF-Ranking model
computation details. The simplest approach is to construct an Estimator
using
tfr.estimator.EstimatorBuilder
. The following provides a simplfied version of
pipeline_example.py
using the RankingPipeline
. With a minimal definition of
context_feature_columns
, example_feature_columns
and scoring_function
, one
can create a tf.estimator.Estimator
.
import tensorflow as tf
import tensorflow_ranking as tfr
def context_feature_columns():
return {}
def example_feature_columns():
sparse_column = tf.feature_column.categorical_column_with_hash_bucket(
key="document_tokens", hash_bucket_size=100)
document_embedding = tf.feature_column.embedding_column(
categorical_column=sparse_column, dimension=20)
return {"document_tokens": document_embedding}
def scoring_function(context_features, example_features, mode):
"""A feed-forward network to score query-document pairs."""
input_layer = [
tf.compat.v1.layers.flatten(context_features["document_tokens"])
]
cur_layer = tf.concat(input_layer, 1)
for dim in [256, 128, 64, 32]:
cur_layer = tf.compat.v1.layers.dense(cur_layer, units=dim)
cur_layer = tf.nn.relu(cur_layer)
return tf.compat.v1.layers.dense(cur_layer, units=1)
ranking_estimator = tfr.estimator.EstimatorBuilder(
context_feature_columns(),
example_feature_columns(),
scoring_function=scoring_function,
hparams=hparams).make_estimator()
One benefit of using tfr.estimator.EstimatorBuilder
is that it allows clients
to overwirte default model behaviors. E.g, if you need evaluation metrics other
than the default NDCG@5 and NDCG@10. You may overwrite the eval_metrics_fn
as
below to obtain MAP@10 and NDCG@20. For function overwriting, please refer to
the constructor of tfr.estimator.EstimatorBuilder
for more details.
class MyEstimatorBuilder(tfr.estimator.EstimatorBuilder):
def _eval_metric_fns(self):
metric_fns = {}
metric_fns.update({
"metric/ndcg_%d" % topn: metrics.make_ranking_metric_fn(
metrics.RankingMetricKey.NDCG, topn=topn) for topn in [20]
})
metric_fns.update({
"metric/ndcg_%d" % topn: metrics.make_ranking_metric_fn(
metrics.RankingMetricKey.MAP, topn=topn) for topn in [10]
})
return metric_fns
ranking_estimator = MyEstimatorBuilder(
context_feature_columns(),
example_feature_columns(),
scoring_function=scoring_function,
hparams=hparams).make_estimator()
If tfr.estimator.EstimatorBuilder
still cannot fulfill your needs. You can
create your own tf.estimator.Estimator
. Here, you might still want to refer
to the tfr.estimator.EstimatorBuilder
as it provides a template for building
an Estimator
.
With the above tf.estimator.Estimator
, hparams
, context_feature_columns()
and example_feature_columns()
, the next step is to create a RankingPipeline
instance, and simply call the train_and_eval()
for model training and eval.
import tensorflow_ranking as tfr
ranking_pipeline = tfr.ext.pipeline.RankingPipeline(
context_feature_columns(),
example_feature_columns(),
hparams=hparams,
estimator=ranking_estimator,
label_feature_name="utility",
label_feature_type=tf.float32)
ranking_pipeline.train_and_eval()
Again, if you want to customize certain RankingPipeline
behaviors, you may
create a subclass of RankingPipeline
, and overwrite related functions. For
example, if you want to remove the best exporters, you may do the following.
For function overwriting, please refer to the constructor of RankingPipeline
for more details.
class NoBestExporterRankingPipeline(tfr.ext.pipeline.RankingPipeline):
def _export_strategies(self, event_file_pattern):
del event_file_pattern
latest_exporter = tf.estimator.LatestExporter(
"latest_model",
serving_input_receiver_fn=self._make_serving_input_fn())
return [latest_exporter]
ranking_pipeline = NoBestExporterRankingPipeline(
context_feature_columns(),
example_feature_columns(),
hparams=hparams,
estimator=ranking_estimator)
ranking_pipeline.train_and_eval()
To summarize the above steps, and put everything together. A complete example
code is provided below. Here, we only provide a simplified illustration for
adopting the RankingPipeline
. For a running example with the provided data,
please refer to the pipeline_example.py
.
import tensorflow as tf
import tensorflow_ranking as tfr
def context_feature_columns():
return {}
def example_feature_columns():
sparse_column = tf.feature_column.categorical_column_with_hash_bucket(
key="document_tokens", hash_bucket_size=100)
document_embedding = tf.feature_column.embedding_column(
categorical_column=sparse_column, dimension=20)
return {"document_tokens": document_embedding}
def scoring_function(context_features, example_features, mode):
"""A feed-forward network to score query-document pairs."""
input_layer = [
tf.compat.v1.layers.flatten(context_features["document_tokens"])
]
cur_layer = tf.concat(input_layer, 1)
for dim in [256, 128, 64, 32]:
cur_layer = tf.compat.v1.layers.dense(cur_layer, units=dim)
cur_layer = tf.nn.relu(cur_layer)
return tf.compat.v1.layers.dense(cur_layer, units=1)
hparams = dict(
train_input_pattern="/path/to/train/files",
eval_input_pattern="/path/to/eval/files",
train_batch_size=8,
eval_batch_size=8,
checkpoint_secs=120,
num_checkpoints=1000,
num_train_steps=10000,
num_eval_steps=100,
loss="softmax_loss",
list_size=10,
listwise_inference=False,
convert_labels_to_binary=False,
model_dir="/path/to/your/model/directory") # step 1
ranking_estimator = tfr.estimator.EstimatorBuilder( # step 2
context_feature_columns(),
example_feature_columns(),
scoring_function=scoring_function,
hparams=hparams).make_estimator()
ranking_pipeline = tfr.ext.pipeline.RankingPipeline( # step 3
context_feature_columns(),
example_feature_columns(),
hparams=hparams,
estimator=ranking_estimator,
label_feature_name="utility",
label_feature_type=tf.float32)
ranking_pipeline.train_and_eval()