This was a competition between universities and research institutes around the globe hosted by datasciencegame as an in-class kaggle challenge. This challenge was based on Deezer's music streaming data with the task of predicting a user's probability to listen to a recommended song. I was the team leader of a group of 3 fellow data science students and our final predictions consisted of a blend of gradient boosting models. This placed us in front of other top universities such as University of Cambridge, Imperial College, Berkely and LSE.
The main challenge of this dataset was that training and test sets came from different distributions and thus led to inconsistent local cross-validation results compared to the public leaderboard scores on kaggle. To mitigate this, we used "adversarial validation" (http://fastml.com/adversarial-validation-part-one/) to sort the training data by its probability of being different from the test data samples. As a result of this procedure, we used a simple validation dataset which resembled the test set most, thus guaranteeing us a consistent evaluation process. Moreover, we reduced the amount of data (8+million samples) considerably by deleting 75% of the provided samples, leading to a huge speed up and even slight performance gain. Moreover, engineered features such as the days between listening and release date also proved to be powerful. The final predictions consisted of a blend of different gradient boosting models from the xgboost and lightgbm python libraries.