RecFlow: An Industrial Full Flow Recommendation Dataset

Download the data

Download manually through the following links:

link: Drive

Motivation

To provide the recommendation systems (RS) research community with an industrial full flow dataset, we propose RecFlow, which includes samples from the exposure space and unexposed items filtered at each stage of Kuaishou's multi-stage RS. Compared with all existing public RS datasets, RecFlow can be leveraged to not only optimize the conventional recommendation tasks but also study the challenges including the interplay of different stages, the data distribution shift, auxiliary ranking tasks, user behavior sequence modeling, etc. It is the first public RS dataset that allows researchers to study the real industrial multi-stage RS.

The following figure illustrates the process of RecFlow's data collection .

Usage

RecFlow can be applied to the following tasks. (1) By recording items from the serving space, RecFlow enables the study of how to alleviate the discrepancy between training and serving for specific stages during both the learning and evaluation processes. (2) RecFlow also records the stage information for different stage samples, facilitating research on joint modeling of multiple stages, such as stage consistency or optimal multi-stage RS. (3) The positive and negative samples from the exposure space are suitable for classical click-through rate prediction or sequential recommendation tasks. (4) RecFlow stores multiple types of positive feedback (e.g., effective view, long view, like, follow, share, comment), supporting research on multi-task recommendation. (5) Information about video duration and playing time for each exposure video allows the study of learning through implicit feedback, such as predicting playing time. (6) RecFlow includes a request identifier feature, which can contribute to studying the re-ranking problem. (7) Timestamps for each sample enable the aggregation of user feedback in chronological order, facilitating the study of user behavior sequence modeling algorithms. (8) RecFlow incorporates context, user, and video features beyond identity features (e.g., user ID and video ID), making it suitable for context-based recommendation. (9) The rich information recorded about RS and user feedback allows the construction of more accurate RS simulators or user models in feed scenarios. (10) Rich stage data may help estimate selection bias more accurately and design better debiasd algorithms.

Dataset Organization

RecFlow dataset has following folders. all_stage contains data from all stages. realshow contains data from the exposure space. seq_effective_50_dict contains the user's effective_view behavior sequence of length 50. request_id_dict stores the data from all stages in first_level_key-second_level_key-value structure. The first_level_key is the request_id, the second_levele_key is the stage label (i.e. realshow,rerank_pos,rerank_neg,rank_pos,rank_neg,coarse_neg,prerank_neg), the value is the corresponding videos of that stage. ubm_seq_request_id_dict is for the user behavior sequence modeling tasks and hold the same structure with request_id_dict. id_cnt.pkl records the unique ID number of each feature field. retrieval_test.feather is the testing dataset for retrieval experiments. coarse_rank_test.feather is the testing dataset for coarse ranking experiments. rank_test.feather is the testing dataset for ranking experiments. realshow_video_info.feather contains the video information from the exposure space. realshow_video_info_daily contains the accumulated video information from the exposure space.

RecFlow
   ├── all_stage
   |   ├──2024-01-13.feather
   |   ├──2024-01-14.feather
   |   ├──...  
   |   └──2024-02-18.feather
   |       
   ├── realshow
   |   ├──2024-01-13.feather
   |   ├──2024-01-14.feather
   |   ├──...
   |   └──2024-02-18.feather
   |
   ├── seq_effective_50_dict
   |   ├──2024-01-13.pkl
   |   ├──2024-01-14.pkl
   |   ├──...
   |   └──2024-02-18.pkl
   |
   ├── request_id_dict
   |   ├──2024-01-13.pkl
   |   ├──2024-01-14.pkl
   |   ├──...
   |   └──2024-02-18.pkl
   |
   ├── ubm_seq_request_id_dict
   |   ├──2024-01-13.pkl
   |   ├──2024-01-14.pkl
   |   ├──...
   |   └──2024-02-18.pkl
   |
   └── others
      ├──id_cnt.pkl
      ├──retrieval_test.feather
      ├──coarse_rank_test.feather 
      ├──rank_test.feather
      ├──realshow_video_info.feather
      └──realshow_video_info_daily
         ├──2024-01-13.feather
         ├──2024-01-14.feather
         ├──...
         └──2024-02-18.feather

Descriptions of the feature fields in RecFlow.

Field Name:	Description	Type
request_id	The unique ID of each recommendation request.	Integer
request_timestamp	The timestamp of each recommendation request.	Integer
user_id	The unique ID of each user.	Integer
device_id	The unique ID of each device.	Integer
age	The user's age.	Integer
gender	The user's gender.	Integer
province	The user's province.	Integer
video_id	The unique ID of each video.	Integer
author_id	The unique ID of each author.	Integer
category_level_one	The first level category ID of each video.	Integer
category_level_two	The second level category ID of each video.	Integer
upload_type	The upload type ID of each video.	Integer
upload_timestamp	The upload timestamp of each video.	Integer
duration	The time duration of each video in milliseconds.	Integer
realshow	A binary feedback signal indicating the video is exposed to the user.	Integer
rerank_pos	A binary feedback signal indicating the video ranks top-10 in rerank stage.	Integer
rerank_neg	A binary feedback signal indicating the video ranks out of top-10 in rerank stage.	Integer
rank_pos	A binary feedback signal indicating the video ranks top-10 in rank stage.	Integer
rank_neg	A binary feedback signal indicating the video ranks out of top-10 in rank stage.	Integer
coarse_neg	A binary feedback signal indicating the video ranks out of top-500 in coarse rank stage.	Integer
prerank_neg	A binary feedback signal indicating the video ranks out of top-500 in pre-rank stage.	Integer
rank_index	The rank position of the video in the rank stage.	Integer
rerank_index	The rank position of the video in rerank stage.	Integer
playing_time	The time duration of the user watching the video.	Integer
effective_view	A binary feedback signal indicating the user watches at least 30% of the video.	Integer
long_view	A binary feedback signal indicating the user watches at least 100% of the video.	Integer
like	A binary feedback signal indicating the user hit the like button.	Integer
follow	A binary feedback signal indicating the user hit the follow the author button.	Integer
forward	A binary feedback signal indicating the user forwards this video.	Integer
comment	A binary feedback signal indicating the user writes a comment in the comments section of this video	Integer

Code

If you want to run the code in the repository, you need to download the data from Drive, and place them in the data folder as above data organization.

Retrieval

Baseline

bash ./retrieval/run_sasrec.sh

Hard Negative Mining

bash ./retrieval/run_sasrec_hardnegmining.sh

Interplay between Retrieval and Subsequent Stages

bash ./retrieval/run_sasrec_fsltr.sh

Coarse Ranking

Baseline

bash ./coarse/run_dssm.sh

Data Distribution Shift

bash ./coarse/run_dssm_data_dist_shift_sampling.sh
bash ./coarse/run_dssm_data_dist_shift_all.sh

Interplay between Retrieval and Subsequent Stages

bash ./coarse/run_dssm_fsltr.sh

Auxiliary Ranking

bash ./coarse/run_dssm_auxiliary_ranking.sh

User Behavior Sequence Modeling

bash ./coarse/run_dssm_ubm.sh

Ranking

Baseline

bash ./rank/run_din.sh

Data Distribution Shift

bash ./rank/run_din_data_dist_shift_sampling.sh
bash ./rank/run_din_data_dist_shift_all.sh

Interplay between Retrieval and Subsequent Stages

bash ./rank/run_din_fsltr.sh

Auxiliary Ranking

bash ./rank/run_din_auxiliary_ranking.sh

User Behavior Sequence Modeling

bash ./rank/run_din_ubm.sh

Requirements

python=3.7
numpy=1.19.2
pandas=1.3.5
pyarrow=8.0.0
scikit-learn=1.0.2
pytorch=1.6
faiss-gpu=1.7.1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

RecFlow: An Industrial Full Flow Recommendation Dataset

Download the data

Motivation

Usage

Dataset Organization

Descriptions of the feature fields in RecFlow.

Code

Retrieval

Coarse Ranking

Ranking

Requirements

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
coarse		coarse
data		data
rank		rank
retrieval		retrieval
LICENSE		LICENSE
README.md		README.md
recflow.jpg		recflow.jpg

License

RecFlow-ICLR/RecFlow

Folders and files

Latest commit

History

Repository files navigation

RecFlow: An Industrial Full Flow Recommendation Dataset

Download the data

Motivation

Usage

Dataset Organization

Descriptions of the feature fields in RecFlow.

Code

Retrieval

Coarse Ranking

Ranking

Requirements

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages