Skip to content

BoKatanKrize/kickstarter-mlops

Repository files navigation

kickstarter-mlops

Poetry Python W&B Prefect Docker Flask Jupyter Notebook Pandas scikit-learn PyCharm

1. Context

Image Description

Kickstarter is a popular crowdfunding platform where creative minds seek support for their projects. It hosts a wide variety of different projects, spanning from technology startups to creative arts ventures and social impact initiatives, among others. In Kickstarter, a project secures funds only if it reaches its predefined funding goal. Now, there are numerous factors that influence the outcome of a project (e.g., project category, funding goal, country, ...), making it feasible to develop a predictive model to determine a project's likelihood of success.

2. Goal

The objective of this project is to put into practice the knowledge acquired during the mlops-zoomcamp course (offered by DataTalks.Club). We aim to build an MLOps pipeline to predict if a Kickstarter project will succeed or fail.

3. Project Structure

├── cli.py                # Command-line interface for Poetry to interact with the project.
├── compose.yaml          # Docker Compose file for LocalStack & Flask App services.
├── data
│   ├── interim               # (Intermediate) cleaned data storage directory.
│   ├── processed             # (Final) featured engineered data storage directory.
│   └── raw                   # (Original) raw data storage directory.
│
├── images                # Directory for storing project images.
├── LICENSE               # Project license file.
├── models
│   ├── interim               # Cleaning pipeline storage directory.
│   ├── processed             # Feature engineering pipeline storage directory.
│   ├── registry              # Model registry storage directory.
│   └── trained               # Trained models storage directory.
│
├── notebooks
│   └── Kickstarter_Prototyping_and_EDA.ipynb  # Jupyter notebook for exploratory data analysis.
│
├── poetry.lock           # Lock file for Poetry package manager.
├── pyproject.toml        # Project configuration file using Poetry.
├── README.md             # Main project README file.
├── sample.env            # Sample environment variable configuration file.
└── src
    ├── data
    │   ├── __init__.py          # Initialization for data module.
    │   ├── cleaner.py           # Data cleaning script.
    │   └── downloader.py        # Data downloading script.
    │
    ├── deployment
    │   ├── __init__.py           # Initialization for deployment module.
    │   └── web_service
    │       ├── __init__.py                      # Initialization for web service module.
    │       ├── Dockerfile                       # Docker configuration for the web service.
    │       ├── poetry.lock                      # Lock file for the web service.
    │       ├── predict.py                       # Prediction script for the web service.
    │       ├── pyproject.toml                   # Project configuration for the web service (Poetry).
    │       ├── sample_kickstarter_project.json  # Sample input data for the web service.
    │       └── test.py                          # Test script for the web service.
    │
    ├── features
    │   ├── __init__.py           # Initialization for feature engineering module.
    │   └── build_features.py     # Feature engineering script.
    │
    ├── models
    │   ├── __init__.py           # Initialization for models module.
    │   ├── register_model.py     # Script for model registry in W&B.
    │   ├── sweep_config.yaml     # Configuration for hyperparameter tuning.
    │   └── train.py              # Script for model training.
    │
    ├── orchestration
    │   ├── __init__.py           # Initialization for orchestration module.
    │   └── orchestrate_train.py  # Orchestration script for model training (Prefect flow).
    │
    └── utils
        ├── __init__.py           # Initialization for utility module.
        ├── aws_s3.py             # Utility for Amazon S3 operations.
        ├── io.py                 # Utility for file I/O operations.
        ├── pipelines.py          # Utility for data processing pipelines.
        └── wandb.py              # Utility for W&B integration.

4. Dataset

The latest available dataset is automatically downloaded from webrobots.io. It contains data on projects hosted on Kickstarter from 2009 to 2023. The raw data contains 39 features (see Jupyter Notebook), but we have selected 11 final features to feed our ML model (we think these are the most relevant):

Feature Description
creation_to_launch_hours (float) Time duration from project creation to launch on Kickstarter
campaign_hours (float) Time duration from launch to deadline
name_length (int) Length (in words) of the project's name
description_length (int) Length (in words) of the project's description
usd_goal (float) Funding goal in USD
main_category (str) Project's main category (e.g., journalism)
sub_category (str) Project's sub-category (e.g., print)
country (str) Acronym of country from which the project creator originates
staff_pick (bool) Indicates whether a project was highlighted as a staff pick
diff_main_category_goal (float) Difference current goal wrt median goal of the main category
diff_sub_category_goal (float) Difference current goal wrt median goal of the sub-category

5. Prerequisites

  1. conda (optional) : install conda on your system.
  2. python : make sure you have python 3.10 installed (included in conda).
  3. docker : required for containerization.
  4. docker compose : additionally, you'll need Compose V2 for managing multi-container applications.

6. Setup

We will build everything on top of a baseline virtual env (base). In our case it's provided by conda.

6.1. Clone the repo

(base) $ git clone https://github.com/BoKatanKrize/kickstarter-mlops.git

6.2. Set up the Project using poetry.

(base) $ pip install poetry

The idea is to use poetry as the package manager and dependency resolver while leveraging conda to manage the Python interpreter. Thus, if we run:

(base) $ poetry install

poetry has now created a virtual environment (on top of (base)) dedicated to the project, and installed the packages listed in pyproject.toml. The package dependencies are recorded in the lock file, named poetry.lock.

Finally, install Poe the Poet as a poetry plugin

(base) $ poetry self add 'poethepoet[poetry_plugin]'

6.3. Set up S3 bucket with localstack

  1. Start localstack container
(base) $ docker compose up -d localstack
  1. Install the AWS Command Line Interface
(base) $ pip install awscli
  1. Configure AWS CLI
    1. Make sure you have localstack up and running
    2. Run
     (base) $ aws configure --profile localstack-profile
    1. You'll be prompted to enter your AWS access key, secret key, region, and output format. E.g:

      • AWS Access Key ID [None]: localstack
      • Secret Access Key [None]: password
      • Default region name [None]: eu-west-1
      • Deault output format [None]: JSON

      which are the values saved in sample.env.

    2. The configuration will be stored in ~/.aws/config

    3. Create .env based on sample.env

6.4. Set up Weights & Biases

  1. Create a W&B account
  2. Create a project called kickstarter-mlops
  3. Save the following variables to .env:
    1. Your W&B API Key
    2. Your entity (user name)

6.5. Set up Prefect Cloud

  1. Create a prefect cloud account
  2. Create an API key and a workspace
    1. Save them to .env
    2. Finally, provide the [USER-ID] to complete PREFECT_API_URL
  3. Authenticate
    (base) $ prefect cloud login -k <your-api-key>

7. Training

One of the key advantages of using poetry is the streamlined management of script executions without the need for a Makefile. Poetry takes care of guiding the entire workflow from the .git root.

  1. Launch localstack S3 bucket (if it's not running already)
(base) $ poetry poe launch-localstack-s3
  1. The downloader script is responsible for downloading the latest data and saving it as raw data. It ensures that you have the most up-to-date dataset to work with. When executed, it fetches the necessary data and stores it for further processing.
    (base) $ poetry run downloader
  1. The cleaner script employs scikit-learn pipelines to clean the data and remove unnecessary features.
    (base) $ poetry run cleaner
  1. The build_features script utilizes scikit-learn pipelines to perform feature engineering. This step aims to enhance the quality of the final features
    (base) $ poetry run build_features
  1. The train script employs Weights and Biases Sweep to perform hyperparameter optimization using both XGBoost and LightGBM. This step helps identify the best-performing model by tuning various hyperparameters.
    (base) $ poetry run train
  1. The register_model script searches for the best model in terms of ROC AUC metric among XGBoost and LightGBM. Once identified, it stores this model in the W&B model registry, allowing for easy access and tracking.
    (base) $ poetry run register_model

Throughout each of these steps, the data and models are saved to an S3 bucket provided by Localstack, ensuring that all artifacts are preserved for future reference. E.g., to check the trained models from W&B Sweep:

    (base) $ aws s3 --endpoint-url http://localhost:4566 ls s3://kickstarter-bucket/models/trained/

8. Orchestration

The previous training workflow also can be automatically executed by using a Prefect deployment

  1. Set up orchestration with Prefect
    (base) $ poetry poe setup-orchestration
  • Creates Prefect workpool
  • Creates Prefect deployment (unfortunately, can't skip the prompts; select default)
  • Starts Prefect worker
  1. Launch Prefect orchestration
    (base) $ poetry poe setup-orchestration
  • It launches an localstack S3 bucket (if it's not running already)
  • It runs the Prefect deployment + workpool + worker
  • Best ML model is saved in W&B Registry

9. Deployment as web-service

We deploy the best model in W&B Registry as a web-service (in a docker container). Make sure that the S3 bucket from either direct training (sec. 6) or orchestration (sec. 7) is running. Then execute

    (base) $ poetry poe launch-flask-app

to set up the web-service. The web service will automatically connect to the W&B Registry and get the best model. The web service works by reading a single Kickstarter project and its features (sample_kickstarter_project.json), and returning a prediction of "Successful" or "Failed". To send this packet of data and obtain the prediction, execute

    (base) $ poetry poe predict-flask-app

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published