Name	Name	Last commit message	Last commit date
parent directory ..
ml_dataprep	ml_dataprep
tests	tests
README.md	README.md
requirements.txt	requirements.txt
run.sh	run.sh
setup.py	setup.py

Data preparation tool for machine learning

The tool is aiming to make a train/validation split for the machine learning dataset stored in BigQuery. The goal is to have an output in form of .csv files in a Google Cloud Storage bucket.

Tool description

There are few phases in which the data is being processed:

Reading data from the BigQuery source table.
Shuffling the data and copying it to a temporary BigQuery table in the destination dataset.
Splitting the temporary table into two: training and validation tables.
Exporting the training and validation data to the corresponding .csv files in the destination GCS bucket.

In addition to just moving and splitting the data you can limit the data by specifying columns you want to move and also generate new features by using SQL syntax. User can also specify parameters for dynamic column names and a specific train/validation set split ratio.

To make it happened there are several parameters that have to be set when running the code:

Parameter name	Description	Is Required
source_project	GCP project from where to extract the data	Y
source_dataset	BigQuery dataset from where to extract the data	Y
source_table	BigQuery table from where to extract the data	Y
destination_project	GCP project where to save the intermediary data	Y
destination_dataset	BigQuery dataset where to save the intermediary data	Y
destination_gcs_path	Google Cloud Storage where to export the training and validation data in CSV format	Y
parameters	Columns parameters that should be replaced in the column names	N
split_ratio	Split ratio for the train/validation split	N

The code is divided into several modules that serve different purposes:

The ml_dataprep.config module is the place where user should specify the features and target values to be extracted by defining the COLUMNS, TARGET_COLUMNS_SHUFFLE and TARGET_COLUMNS_EXPORT arrays.
Several queries defined in the ml_dataprep.queries module that are used for shuffling and splitting the data.
The ml_dataprep.bigquery module is used to connect to and use the capabilities of BigQuery.
The processing logic is defined in the ml_dataprep.dataprep module. Using the above modules it implements the creation of the temporary tables, the data movement and export.
The code can be run using the ml_dataprep.runner module as shown below.

Prerequisites

This code requires Python, preferably using the 3.x stable version. You can download and install Python at the following address: https://www.python.org/downloads/. The requirements.txt file contains the modules needed to run. You can install the required packages using the following command:

pip install -r requirements.txt

In order to run the tool successfully you also need the Google Cloud SDK and a Google Cloud project that will contain the destination BigQuery dataset and GCS storage bucket. This project is then used as a destination_project parameter. Similarly you need a BigQuery dataset and a Google Cloud Storage bucket which will further be used as destination_dataset and destination_gcs_path parameters respectively.

The next step is to authorize the gcloud tools using your credentials and set up environment properties. You can do it using the following command:

gcloud init

You should set the environment to use the destination project.

Make sure you have the access rights required to read the data from the BigQuery source table.

Using the tool

This section will describe how to make the default configuration working. Then it will guide you to adjust it to your specific use case. The input is the source of the data in BigQuery. The output is the training and validation datasets saved in BigQuery tables and Google Cloud Storage files.

The inputs of the tool are the following:

Source BigQuery project, dataset and table - provided as parameters to the script.
Destination BigQuery project and datasets - provided as parameters to the script.
Destination Cloud Storage path - passed as a parameter to the script.
Columns related parameters (column names, SQL transformations, target columns) - configured in the ml_dataprep.config module.
Split ratio between training and validation data - optional parameter passed to the script.
Column names parameters - optional parameter passed to the script.

The result of the script execution is:

Two BigQuery tables containing training and validation data named <source_table_name>_training_<timestamp> and <source_table_name>_validation_<timestamp> respectively located in the <destination_project>.<destination_dataset> dataset
A Cloud Storage folder named <source_table_name>_<timestamp> located in <cloud_storage_path> containing the training and validation data files located
Training data files named training_<index>.csv located in the above mentioned folder
Validation data files named validation_<index>.csv located in the above mentioned folder

Depending on the size of the source table more than one file for both the training and validation datasets can be create.

Running the default

The default tool configuration allows you to experiment using one of the public BigQuery datasets from the International Census Data. We used the age_specific_fertility_rates table of this dataset.

To make the example work and generate the training and validation data in your project, you need to fill in the run.sh script your Google Cloud project id, BigQuery dataset name, and path to the Google Cloud Storage bucket. Running the script will create temporary tables in your BigQuery dataset and corresponding .csv files containing the training and validation datasets.

For example if your destination project ID is my_bq_project, your dataset name is my_census and the path to the bucket is gs://my_ages_ratios, the run.sh will look like this:

python -m ml_dataprep.runner \
	--parameters 15_19 20_24 25_29 30_34 35_39 40_44 45_49 \
	--source_project=bigquery-public-data \
	--source_dataset=census_bureau_international \
	--source_table=age_specific_fertility_rates \
	--destination_project=my_bq_project \
	--destination_dataset=my_census \
	--destination_gcs_path=gs://my_ages_ratios

After running the script you will find the training and validation datasets in the following BigQuery tables:

my_bq_project.my_census.age_specific_fertility_rates_training_<timestamp>
my_bq_project.my_census.age_specific_fertility_rates_validation_<timestamp>

The training and validation datasets will also be present in the following Cloud Storage files:

gs://my_ages_ratios/age_specific_fertility_rates_<timestamp>/training_000000000000.csv
gs://my_ages_ratios/age_specific_fertility_rates_<timestamp>/validation_000000000000.csv

Running your example

Starting from the default tool configuration, to change the BigQuery source you need to provide the source_project, source_dataset and source_table parameters in the run.sh file.

IMPORTANT Changing the source data would mean the table structure (and hence columns) will be different. You will need to specify the following variables in the ml_dataprep.config module:

COLUMNS - array of column names that you want to be in the final dataset unchanged.
TARGET_COLUMNS_SHUFFLE - array of the column names that would be regarded as target values in the machine learning training; as per the default example you, can create additional target columns as SQL combinations of source columns and naming them.
TARGET_COLUMNS_EXPORT - array of the target column names that would be copied to the final train/validation split.

All arrays can contain parameterized names. Column names parameters can be set up using parameters when running the run.sh script. All columns can also be specified directly or as constructed ones using SQL syntax. However it is recommended that TARGET_COLUMNS_EXPORT uses the names of the columns as they were defined in TARGET_COLUMNS_SHUFFLE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ml-dataprep

ml-dataprep

README.md

Data preparation tool for machine learning

Tool description

Prerequisites

Using the tool

Running the default

Running your example

Files

ml-dataprep

Directory actions

More options

Directory actions

More options

Latest commit

History

ml-dataprep

Folders and files

parent directory

README.md

Data preparation tool for machine learning

Tool description

Prerequisites

Using the tool

Running the default

Running your example