Name		Name	Last commit message	Last commit date
parent directory ..
check_data		check_data
check_model		check_model
data		data
eda		eda
evaluate		evaluate
preprocess		preprocess
segregate		segregate
train		train
README.md		README.md

README.md

Model Card

Model cards are a succinct approach for documenting the creation, use, and shortcomings of a model. The idea is to write a documentation such that a non-expert can understand the model card's contents. For additional information see the Model Card paper: https://arxiv.org/pdf/1810.03993.pdf

Model Details

Ivanovitch Silva created the model. A complete data pipeline was built using DVC and Scikit-Learn to train a XGBoost model. For the sake of understanding, a simple hyperparameter-tuning was conducted, and the hyperparameters values adopted in the train are described in a yaml file.

Intended Use

This model is used as a proof of concept for the evaluation of an entire data pipeline incorporating MLOps assumptions. The data pipeline is composed of the following stages: a) data, b) eda, c) preprocess, d) check data, e) segregate, f) train, g) evaluate and h) check model.

Training Data

The dataset used in this project is based on individual income in the United States. The data is from the 1994 census, and contains information on an individual's marital status, age, type of work, and more. The target column, or what we want to predict, is whether individuals make less than or equal to 50k a year, or more than 50k a year.

You can download the data from the University of California, Irvine's website.

After the EDA stage of the data pipeline, it was noted that the training data is imbalanced when considered the target variable and some features (sex, race and workclass.

Evaluation Data

The dataset under study is split into Train and Test during the Segregate stage of the data pipeline. 70% of the clean data is used to Train and the remaining 30% to Test. Additionally, 30% of the Train data is used for validation purposes (hyperparameter-tuning). This configuration is done in a yaml file.

Metrics

In order to follow the performance of machine learning experiments, the project marked certains stage outputs of the data pipeline as metrics. The metrics adopted are: accuracy, f1, precision, recall.

To calculate the evaluations metrics is only necessary to run:

dvc metrics show

The follow results will be shown:

Path	Accuracy	F1	Precision	Recall
pipeline/data/train_scores.json	0.8328	0.6321	0.6728	0.5959
pipeline/data/test_scores.json	0.8382	0.6347	0.6960	0.5833

Ethical Considerations

We may be tempted to claim that this dataset contains the only attributes capable of predicting someone's income. However, we know that is not true, and we will need to deal with the class imbalances somehow.

Caveats and Recommendations

It should be noted that the model trained in this project was used only for validation of a complete data pipeline. It is notary that some important issues related to dataset imbalances exist, and adequate techniques need to be adopted in order to balance it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline

pipeline

README.md

Model Card

Model Details

Intended Use

Training Data

Evaluation Data

Metrics

Ethical Considerations

Caveats and Recommendations

Files

pipeline

Directory actions

More options

Directory actions

More options

Latest commit

History

pipeline

Folders and files

parent directory

README.md

Model Card

Model Details

Intended Use

Training Data

Evaluation Data

Metrics

Ethical Considerations

Caveats and Recommendations