Skip to content

d-sutariya/Instacart-Basket-Recommendation

Repository files navigation

Instacart Next Basket Prediction

Overview

The goal of this project is to improve the basket prediction algorithm for Instacart, aiming to increase the F1 score from 0.25 to at least 0.28. Various techniques were used to explore customer purchasing behavior and enhance prediction accuracy. The final model surpassed the success threshold, achieving an F1 score of 0.30. Detailed insights and findings are available in the report.

Objective

  • Improve the F1 score of the current basket prediction algorithm by at least 0.03 (from 0.25 to 0.28).
  • Analyze purchasing patterns and explore feature engineering to enhance decision-making.

Dataset

The project uses a dataset containing approximately 30 million product orders from 3 million orders by 200,000 customers. The data includes multiple files detailing product information, user orders, and prior product orders.

Frameworks and Tools

This project leverages a range of powerful frameworks and tools to ensure cutting-edge performance and efficiency. Here are the key technologies used:

Core Technologies

  • Polars Polars
  • PySpark PySpark
  • XGBoost XGBoost
  • LightGBM LightGBM
  • H2O H2O

Additional Tools

  • Plotly Plotly
  • Pandas Pandas
  • Matplotlib Matplotlib
  • Seaborn Seaborn

Methodology

To tackle the problem, I used a combination of feature engineering, distributed computing, and GPU-accelerated training:

  • Feature Engineering: Focused on user, product, and time-based features. Many engineered features performed exceptionally well.
  • Data Processing: Initially used Polars for efficient data manipulation, then transitioned to PySpark for distributed data processing as the dataset grew in size.
  • Modeling: Models were trained using XGBoost, LightGBM, and H2O, with distributed computing and GPU training for scalability.
  • Validation: Employed a time-based validation strategy to ensure the model accounted for the sequential nature of purchases.

Key Findings

  1. Reorder Patterns: Users tend to reorder on the same day, the 7th day, or the 30th day after a previous order.days_since_prior_order w.r.t count.png
  2. Peak Ordering Time: Orders are mostly placed between 8 AM and 4 PM.order distribution w.r.t order hour of day
  3. Product Preference: Organic products are reordered 8% more frequently than non-organic products.product type Organic vs Inorganic
  4. Department Reorder Rates: Dairy, Eggs, Produce, Beverages, and Bakery have reorder rates above 65%, while Personal Care and Pantry have rates below 35%.reorder percentage w.r.t department

Conclusion

The model achieved an F1 score of 0.30, surpassing the success threshold of 0.27. Future improvements can be made through further feature engineering and by exploring advanced architectures such as LSTMs, GRUs, and Transformers for better handling of sequential data.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published