The goal of this project is to improve the basket prediction algorithm for Instacart, aiming to increase the F1 score from 0.25 to at least 0.28. Various techniques were used to explore customer purchasing behavior and enhance prediction accuracy. The final model surpassed the success threshold, achieving an F1 score of 0.30. Detailed insights and findings are available in the report.
- Improve the F1 score of the current basket prediction algorithm by at least 0.03 (from 0.25 to 0.28).
- Analyze purchasing patterns and explore feature engineering to enhance decision-making.
The project uses a dataset containing approximately 30 million product orders from 3 million orders by 200,000 customers. The data includes multiple files detailing product information, user orders, and prior product orders.
This project leverages a range of powerful frameworks and tools to ensure cutting-edge performance and efficiency. Here are the key technologies used:
To tackle the problem, I used a combination of feature engineering, distributed computing, and GPU-accelerated training:
- Feature Engineering: Focused on user, product, and time-based features. Many engineered features performed exceptionally well.
- Data Processing: Initially used Polars for efficient data manipulation, then transitioned to PySpark for distributed data processing as the dataset grew in size.
- Modeling: Models were trained using XGBoost, LightGBM, and H2O, with distributed computing and GPU training for scalability.
- Validation: Employed a time-based validation strategy to ensure the model accounted for the sequential nature of purchases.
- Reorder Patterns: Users tend to reorder on the same day, the 7th day, or the 30th day after a previous order.
- Peak Ordering Time: Orders are mostly placed between 8 AM and 4 PM.
- Product Preference: Organic products are reordered 8% more frequently than non-organic products.
- Department Reorder Rates: Dairy, Eggs, Produce, Beverages, and Bakery have reorder rates above 65%, while Personal Care and Pantry have rates below 35%.
The model achieved an F1 score of 0.30, surpassing the success threshold of 0.27. Future improvements can be made through further feature engineering and by exploring advanced architectures such as LSTMs, GRUs, and Transformers for better handling of sequential data.