title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Step 2: Upload data into a Machine Learning experiment | Microsoft Docs |
Step 2 of the Develop a predictive solution walkthrough: Upload stored public data into Azure Machine Learning Studio. |
machine-learning |
garyericson |
jhubbard |
cgronlun |
9f4bc52e-9919-4dea-90ea-5cf7cc506d85 |
machine-learning |
tbd |
na |
na |
article |
12/16/2016 |
garye |
This is the second step of the walkthrough, Develop a predictive analytics solution in Azure Machine Learning
- Create a Machine Learning workspace
- Upload existing data
- Create a new experiment
- Train and evaluate the models
- Deploy the Web service
- Access the Web service
To develop a predictive model for credit risk, we need data that we can use to train and then test the model. For this walkthrough, we'll use the "UCI Statlog (German Credit Data) Data Set" from the UC Irvine Machine Learning repository. You can find it here:
http://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data)
We'll use the file named german.data. Download this file to your local hard drive.
This dataset contains rows of 20 variables for 1000 past applicants for credit. These 20 variables represent the dataset's set of features (the feature vector), which provides identifying characteristics for each credit applicant. An additional column in each row represents the applicant's calculated credit risk, with 700 applicants identified as a low credit risk and 300 as a high risk.
The UCI website provides a description of the attributes of the feature vector for this data. This includes financial information, credit history, employment status, and personal information. For each applicant, a binary rating has been given indicating whether they are a low or high credit risk.
We'll use this data to train a predictive analytics model. When we're done, our model should be able to accept a feature vector for a new individual and predict whether he or she is a low or high credit risk.
Here's one interesting twist. The description of the dataset explains that misclassifying a person as a low credit risk when they are actually a high credit risk is 5 times more costly to the financial institution than misclassifying a low credit risk as high. One simple way to take this into account in our experiment is by duplicating (5 times) those entries that represent someone with a high credit risk. Then, if the model misclassifies that high credit risk as low, it will do that misclassification 5 times, once for each duplicate. This will increase the cost of this error in the training results.
The original dataset uses a blank-separated format. Machine Learning Studio works better with a comma-separated value (CSV) file, so we'll convert the dataset by replacing spaces with commas.
There are many ways to convert this data. One way is by using the following Windows PowerShell command:
cat german.data | %{$_ -replace " ",","} | sc german.csv
Another way is by using the Unix sed command:
sed 's/ /,/g' german.data > german.csv
In either case, we have created a comma-separated version of the data in a file named german.csv that we'll use in our experiment.
Once the data has been converted to CSV format, we need to upload it into Machine Learning Studio.
-
Open the Machine Learning Studio home page (https://studio.azureml.net).
-
Click the menu
in the upper-left corner of the window, click Azure Machine Learning, select Studio, and sign in.
-
Click +NEW at the bottom of the window.
-
Select DATASET.
-
Select FROM LOCAL FILE.
-
In the Upload a new dataset dialog, click Browse and find the german.csv file you created.
-
Enter a name for the dataset. For this walkthrough, we'll call it "UCI German Credit Card Data".
-
For data type, select Generic CSV File With no header (.nh.csv).
-
Add a description if you’d like.
-
Click the OK check mark.
This uploads the data into a dataset module that we can use in an experiment.
You can manage datasets that you've uploaded to Studio by clicking the DATASETS tab to the left of the Studio window.
For more information about importing other types of data into an experiment, see Import your training data into Azure Machine Learning Studio.
Next: Create a new experiment