Project for MLOps Zoomcamp The main goal of the project is to use MLOps tools and best practices on prediction task.
Dataset source: Kaggle House Prices Prediction challange link
Modelling notebook is inspired by Serigne's notebook
This project tries to automate the prediction of house prices based on different features of a house such as location, shape, available utilities, condition, style, etc. The project intents to automate different stages of the process including training, deployment and further sustaining it in production.
The project is developed on AWS EC2 instance and it is highly recommended to run on EC2 instance as well. Model artifacts are stored on AWS S3 bucket, so it is advised to create a S3 bucket with you custom name.
Programs installed on EC2: Anaconda, Docker, docker-compose
Clone this repository to the local repository
git clone https://github.com/bryskulov/mlops-house-prices.git
Folder explanations:
- notebooks: Jupyter notebooks for prototyping
- model_training: Automated model training scripts
- web_service: Deployment of the model as a web-service
First, install pipenv
package and later the other packages from Pipfile
.
It is important to be in the same directory as the Pipfile
, when running the bash script.
pip install pipenv
pipenv install
Activate the pipenv environment:
pipenv shell
Set your AWS S3 Bucket name as environment variable:
export S3_BUCKET_PATH="s3://mlflow-models-bryskulov"
This script is used to train the model once using the data in "model_training/data/" path. The idea is that new models are to be trained depending on the data in that folder. In future, of course, it is better to pull data from some relational database.
Activate MLFlow Tracking server. If you need to create new database, you can use the following script
mlflow ui --backend-store-uri sqlite:///mlflow.db --default-artifact-root=$S3_BUCKET_PATH
To run the model training, run:
python train.py --data_path data/train.csv
Note
: I use MLFlow model registry in the jupyter notebook, however in production I decided to use models by the RUN_ID.
Here, model training is scheduled via workflow orchestration tool "Prefect".
Activate MLFlow Tracking server:
mlflow ui --backend-store-uri sqlite:///mlflow.db --default-artifact-root=$S3_BUCKET_PATH
Start Prefect UI with the following bash command:
prefect orion start
Note
: It will run prefect server and it can accessed from the browser.
Create a new deployment with Prefect CLI command:
prefect deployment create prefect_deploy.py
Note
: This will create a new deployment in prefect, however it won't run it.
To run the deployment we should create a work queue, it can be done in Prefect UI.
After creating work queue, we need to start the agent via bash script:
prefect agent start <work queue ID>
Now, you can observe all the scheduled, completed and failed flows in the Prefect UI.
After training the models, inspect the models and choose the model that you prefer. Pay attention that the chosen model has an artifact attached.\
Define the chosen varibles as enrivonment variables:
export MLFLOW_EXPERIMENT_ID='1'
export RUN_ID='be58cd18afc44f5ab13b3409613e04f9'
Don't forget to change the directory and initiate a different Pipenv environment:
cd ..
cd web-service
pipenv shell
The web application is deployed via Flask on the localhost:9696.
You can deploy the model easily with a couple of commands, the script will make all the checks and only then deploy the service.
First, change the environment variables in the file ".env" according to you.
Second, run the Makefile:
make setup
make deploy
To build the Docker Image run:
docker build -t house-price-prediction-service:v2 .
Run the Docker:
docker run -it --rm -p 9696:9696 \
-e S3_BUCKET_PATH=$S3_BUCKET_PATH\
-e MLFLOW_EXPERIMENT_ID=$MLFLOW_EXPERIMENT_ID \
-e RUN_ID=$RUN_ID \
house-price-prediction-service:v2
I run both unit tests and integration test on my deployment application.
Pytest is used for unittesting. The tests can be run through IDE or by script:
pytest unit_tests
Integration test is automated, so you only need to run script "run.sh" in the folder "integration_test":
cd integration_test
source run.sh
Note
: If you get an error, check that you activate pipenv environment
and passed the environment variables such as s3_bucket_path, mlflow_experiment_id, run_id