Model cards are a succinct approach for documenting the creation, use, and shortcomings of a model. The idea is to write a documentation such that a non-expert can understand the model card's contents. For additional information see the Model Card paper: https://arxiv.org/pdf/1810.03993.pdf
Ivanovitch Silva created the model. A complete data pipeline was built using DVC and Scikit-Learn to train a XGBoost model. For the sake of understanding, a simple hyperparameter-tuning was conducted, and the hyperparameters values adopted in the train are described in a yaml file.
This model is used as a proof of concept for the evaluation of an entire data pipeline incorporating MLOps assumptions. The data pipeline is composed of the following stages: a) data
, b) eda
, c) preprocess
, d) check data
, e) segregate
, f) train
, g) evaluate
and h) check model
.
The dataset used in this project is based on individual income in the United States. The data is from the 1994 census, and contains information on an individual's marital status, age, type of work, and more
. The target column, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than 50k a year.
You can download the data from the University of California, Irvine's website.
After the EDA stage of the data pipeline, it was noted that the training data is imbalanced when considered the target variable and some features (sex
, race
and workclass
.
The dataset under study is split into Train and Test during the Segregate
stage of the data pipeline. 70% of the clean data is used to Train and the remaining 30% to Test. Additionally, 30% of the Train data is used for validation purposes (hyperparameter-tuning). This configuration is done in a yaml file.
In order to follow the performance of machine learning experiments, the project marked certains stage outputs of the data pipeline as metrics. The metrics adopted are: accuracy, f1, precision, recall.
To calculate the evaluations metrics is only necessary to run:
dvc metrics show
The follow results will be shown:
Path | Accuracy | F1 | Precision | Recall |
---|---|---|---|---|
pipeline/data/train_scores.json | 0.8328 | 0.6321 | 0.6728 | 0.5959 |
pipeline/data/test_scores.json | 0.8382 | 0.6347 | 0.6960 | 0.5833 |
We may be tempted to claim that this dataset contains the only attributes capable of predicting someone's income. However, we know that is not true, and we will need to deal with the class imbalances somehow.
It should be noted that the model trained in this project was used only for validation of a complete data pipeline. It is notary that some important issues related to dataset imbalances exist, and adequate techniques need to be adopted in order to balance it.