Skip to content

Files

Latest commit

Nov 30, 2018
a4cb8fd · Nov 30, 2018

History

History
92 lines (66 loc) · 3.93 KB

File metadata and controls

92 lines (66 loc) · 3.93 KB
title description services author manager editor ms.service ms.component ms.topic ms.date ms.author ms.custom
Sample data in Azure blob storage | Microsoft Docs
Sample data in Azure Blob Storage
machine-learning
marktab
cgronlun
cgronlun
machine-learning
team-data-science-process
article
11/13/2017
tdsp
(previous author=deguhath, ms.author=deguhath)

Sample data in Azure blob storage

This article covers sampling data stored in Azure blob storage by downloading it programmatically and then sampling it using procedures written in Python.

Why sample your data? If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering. Its role in the Cortana Analytics Process is to enable fast prototyping of the data processing functions and machine learning models.

This sampling task is a step in the Team Data Science Process (TDSP).

Download and down-sample data

  1. Download the data from Azure blob storage using the blob service from the following sample Python code:

     from azure.storage.blob import BlobService
     import tables
    
     STORAGEACCOUNTNAME= <storage_account_name>
     STORAGEACCOUNTKEY= <storage_account_key>
     LOCALFILENAME= <local_file_name>        
     CONTAINERNAME= <container_name>
     BLOBNAME= <blob_name>
    
     #download from blob
     t1=time.time()
     blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
     blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
     t2=time.time()
     print(("It takes %s seconds to download "+blobname) % (t2 - t1))
    
  2. Read data into a Pandas data-frame from the file downloaded above.

     import pandas as pd
    
     #directly ready from file on disk
     dataframe_blobdata = pd.read_csv(LOCALFILE)
    
  3. Down-sample the data using the numpy's random.choice as follows:

     # A 1 percent sample
     sample_ratio = 0.01 
     sample_size = np.round(dataframe_blobdata.shape[0] * sample_ratio)
     sample_rows = np.random.choice(dataframe_blobdata.index.values, sample_size)
     dataframe_blobdata_sample = dataframe_blobdata.ix[sample_rows]
    

Now you can work with the above data frame with the 1 Percent sample for further exploration and feature generation.

Upload data and read it into Azure Machine Learning

You can use the following sample code to down-sample the data and use it directly in Azure Machine Learning:

  1. Write the data frame to a local file

     dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)
    
  2. Upload the local file to an Azure blob using the following sample code:

     from azure.storage.blob import BlobService
     import tables
    
     STORAGEACCOUNTNAME= <storage_account_name>
     LOCALFILENAME= <local_file_name>
     STORAGEACCOUNTKEY= <storage_account_key>
     CONTAINERNAME= <container_name>
     BLOBNAME= <blob_name>
    
     output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)    
     localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory
    
     try:
    
     #perform upload
     output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,localfileprocessed)
    
     except:            
         print ("Something went wrong with uploading to the blob:"+ BLOBNAME)
    
  3. Read the data from the Azure blob using Azure Machine Learning Import Data as shown in the image below:

reader blob