Skip to content

Latest commit

 

History

History
97 lines (63 loc) · 5.48 KB

machine-learning-walkthrough-2-upload-data.md

File metadata and controls

97 lines (63 loc) · 5.48 KB
title description services documentationcenter author manager editor ms.assetid ms.service ms.workload ms.tgt_pltfrm ms.devlang ms.topic ms.date ms.author
Step 2: Upload data into a Machine Learning experiment | Microsoft Docs
Step 2 of the Develop a predictive solution walkthrough: Upload stored public data into Azure Machine Learning Studio.
machine-learning
garyericson
jhubbard
cgronlun
9f4bc52e-9919-4dea-90ea-5cf7cc506d85
machine-learning
tbd
na
na
article
12/16/2016
garye

Walkthrough Step 2: Upload existing data into an Azure Machine Learning experiment

This is the second step of the walkthrough, Develop a predictive analytics solution in Azure Machine Learning

  1. Create a Machine Learning workspace
  2. Upload existing data
  3. Create a new experiment
  4. Train and evaluate the models
  5. Deploy the Web service
  6. Access the Web service

To develop a predictive model for credit risk, we need data that we can use to train and then test the model. For this walkthrough, we'll use the "UCI Statlog (German Credit Data) Data Set" from the UC Irvine Machine Learning repository. You can find it here:
http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)

We'll use the file named german.data. Download this file to your local hard drive.

This dataset contains rows of 20 variables for 1000 past applicants for credit. These 20 variables represent the dataset's set of features (the feature vector), which provides identifying characteristics for each credit applicant. An additional column in each row represents the applicant's calculated credit risk, with 700 applicants identified as a low credit risk and 300 as a high risk.

The UCI website provides a description of the attributes of the feature vector for this data. This includes financial information, credit history, employment status, and personal information. For each applicant, a binary rating has been given indicating whether they are a low or high credit risk.

We'll use this data to train a predictive analytics model. When we're done, our model should be able to accept a feature vector for a new individual and predict whether he or she is a low or high credit risk.

Here's one interesting twist. The description of the dataset explains that misclassifying a person as a low credit risk when they are actually a high credit risk is 5 times more costly to the financial institution than misclassifying a low credit risk as high. One simple way to take this into account in our experiment is by duplicating (5 times) those entries that represent someone with a high credit risk. Then, if the model misclassifies that high credit risk as low, it will do that misclassification 5 times, once for each duplicate. This will increase the cost of this error in the training results.

Convert the dataset format

The original dataset uses a blank-separated format. Machine Learning Studio works better with a comma-separated value (CSV) file, so we'll convert the dataset by replacing spaces with commas.

There are many ways to convert this data. One way is by using the following Windows PowerShell command:

cat german.data | %{$_ -replace " ",","} | sc german.csv  

Another way is by using the Unix sed command:

sed 's/ /,/g' german.data > german.csv  

In either case, we have created a comma-separated version of the data in a file named german.csv that we'll use in our experiment.

Upload the dataset to Machine Learning Studio

Once the data has been converted to CSV format, we need to upload it into Machine Learning Studio.

  1. Open the Machine Learning Studio home page (https://studio.azureml.net).

  2. Click the menu Menu in the upper-left corner of the window, click Azure Machine Learning, select Studio, and sign in.

  3. Click +NEW at the bottom of the window.

  4. Select DATASET.

  5. Select FROM LOCAL FILE.

    Add a dataset from a local file

  6. In the Upload a new dataset dialog, click Browse and find the german.csv file you created.

  7. Enter a name for the dataset. For this walkthrough, we'll call it "UCI German Credit Card Data".

  8. For data type, select Generic CSV File With no header (.nh.csv).

  9. Add a description if you’d like.

  10. Click the OK check mark.

    Upload the dataset

This uploads the data into a dataset module that we can use in an experiment.

You can manage datasets that you've uploaded to Studio by clicking the DATASETS tab to the left of the Studio window.

Manage datasets

For more information about importing other types of data into an experiment, see Import your training data into Azure Machine Learning Studio.

Next: Create a new experiment