title	description	services	author	manager	editor	ms.service	ms.component	ms.topic	ms.date	ms.author	ms.custom
Sample data in Azure blob storage \| Microsoft Docs	Sample data in Azure Blob Storage	machine-learning	marktab	cgronlun	cgronlun	machine-learning	team-data-science-process	article	11/13/2017	tdsp	(previous author=deguhath, ms.author=deguhath)

Sample data in Azure blob storage

This article covers sampling data stored in Azure blob storage by downloading it programmatically and then sampling it using procedures written in Python.

Why sample your data? If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering. Its role in the Cortana Analytics Process is to enable fast prototyping of the data processing functions and machine learning models.

This sampling task is a step in the Team Data Science Process (TDSP).

Download and down-sample data

Download the data from Azure blob storage using the blob service from the following sample Python code:

 from azure.storage.blob import BlobService
 import tables

 STORAGEACCOUNTNAME= <storage_account_name>
 STORAGEACCOUNTKEY= <storage_account_key>
 LOCALFILENAME= <local_file_name>        
 CONTAINERNAME= <container_name>
 BLOBNAME= <blob_name>

 #download from blob
 t1=time.time()
 blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
 blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
 t2=time.time()
 print(("It takes %s seconds to download "+blobname) % (t2 - t1))

Read data into a Pandas data-frame from the file downloaded above.

 import pandas as pd

 #directly ready from file on disk
 dataframe_blobdata = pd.read_csv(LOCALFILE)

Down-sample the data using the numpy's random.choice as follows:

 # A 1 percent sample
 sample_ratio = 0.01 
 sample_size = np.round(dataframe_blobdata.shape[0] * sample_ratio)
 sample_rows = np.random.choice(dataframe_blobdata.index.values, sample_size)
 dataframe_blobdata_sample = dataframe_blobdata.ix[sample_rows]

Now you can work with the above data frame with the 1 Percent sample for further exploration and feature generation.

Upload data and read it into Azure Machine Learning

You can use the following sample code to down-sample the data and use it directly in Azure Machine Learning:

Write the data frame to a local file

 dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)

Upload the local file to an Azure blob using the following sample code:

 from azure.storage.blob import BlobService
 import tables

 STORAGEACCOUNTNAME= <storage_account_name>
 LOCALFILENAME= <local_file_name>
 STORAGEACCOUNTKEY= <storage_account_key>
 CONTAINERNAME= <container_name>
 BLOBNAME= <blob_name>

 output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)    
 localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory

 try:

 #perform upload
 output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,localfileprocessed)

 except:            
     print ("Something went wrong with uploading to the blob:"+ BLOBNAME)

Read the data from the Azure blob using Azure Machine Learning Import Data as shown in the image below:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Files

sample-data-blob.md

sample-data-blob.md

Sample data in Azure blob storage

Download and down-sample data

Upload data and read it into Azure Machine Learning

Files

sample-data-blob.md

Latest commit

History

sample-data-blob.md

File metadata and controls

Sample data in Azure blob storage

Download and down-sample data

Upload data and read it into Azure Machine Learning