title	description	services	documentationcenter	author	manager	editor	ms.assetid	ms.service	ms.workload	ms.tgt_pltfrm	ms.devlang	ms.topic	ms.date	ms.author
Explore data in Azure blob storage with Pandas \| Microsoft Docs	How to explore data that is stored in Azure blob container using Pandas.	machine-learning,storage		bradsev	jhubbard	cgronlun	feaa9e54-01e0-48c8-a917-1eba0f9d9ec7	machine-learning	data-services	na	na	article	12/09/2016	bradsev

Explore data in Azure blob storage with Pandas

This document covers how to explore data that is stored in Azure blob container using Pandas Python package.

The following menu links to topics that describe how to use tools to explore data from various storage environments. This task is a step in the Data Science Process.

[!INCLUDE cap-explore-data-selector]

Prerequisites

This article assumes that you have:

Created an Azure storage account. If you need instructions, see Create an Azure Storage account
Stored your data in an Azure blob storage account. If you need instructions, see Moving data to and from Azure Storage

Load the data into a Pandas DataFrame

To explore and manipulate a dataset, it must first be downloaded from the blob source to a local file, which can then be loaded in a Pandas DataFrame. Here are the steps to follow for this procedure:

Download the data from Azure blob with the following Python code sample using blob service. Replace the variable in the following code with your specific values:

 from azure.storage.blob import BlobService
 import tables

 STORAGEACCOUNTNAME= <storage_account_name>
 STORAGEACCOUNTKEY= <storage_account_key>
 LOCALFILENAME= <local_file_name>        
 CONTAINERNAME= <container_name>
 BLOBNAME= <blob_name>

 #download from blob
 t1=time.time()
 blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)
 blob_service.get_blob_to_path(CONTAINERNAME,BLOBNAME,LOCALFILENAME)
 t2=time.time()
 print(("It takes %s seconds to download "+blobname) % (t2 - t1))

Read the data into a Pandas data-frame from the downloaded file.

 #LOCALFILE is the file path    
 dataframe_blobdata = pd.read_csv(LOCALFILE)

Now you are ready to explore the data and generate features on this dataset.

Examples of data exploration using Pandas

Here are a few examples of ways to explore data using Pandas:

Inspect the number of rows and columns

 print 'the size of the data is: %d rows and  %d columns' % dataframe_blobdata.shape

Inspect the first or last few rows in the following dataset:

 dataframe_blobdata.head(10)

 dataframe_blobdata.tail(10)

Check the data type each column was imported as using the following sample code

 for col in dataframe_blobdata.columns:
     print dataframe_blobdata[col].name, ':\t', dataframe_blobdata[col].dtype

Check the basic stats for the columns in the data set as follows
```
 dataframe_blobdata.describe()
```
Look at the number of entries for each column value as follows
```
 dataframe_blobdata['<column_name>'].value_counts()
```
Count missing values versus the actual number of entries in each column using the following sample code
```
 miss_num = dataframe_blobdata.shape[0] - dataframe_blobdata.count()
 print miss_num
```
If you have missing values for a specific column in the data, you can drop them as follows:

dataframe_blobdata_noNA = dataframe_blobdata.dropna() dataframe_blobdata_noNA.shape

Another way to replace missing values is with the mode function:

dataframe_blobdata_mode = dataframe_blobdata.fillna({'<column_name>':dataframe_blobdata['<column_name>'].mode()[0]})

Create a histogram plot using variable number of bins to plot the distribution of a variable

 dataframe_blobdata['<column_name>'].value_counts().plot(kind='bar')

 np.log(dataframe_blobdata['<column_name>']+1).hist(bins=50)

Look at correlations between variables using a scatterplot or using the built-in correlation function

 #relationship between column_a and column_b using scatter plot
 plt.scatter(dataframe_blobdata['<column_a>'], dataframe_blobdata['<column_b>'])

 #correlation between column_a and column_b
 dataframe_blobdata[['<column_a>', '<column_b>']].corr()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

machine-learning-data-science-explore-data-blob.md

machine-learning-data-science-explore-data-blob.md

Explore data in Azure blob storage with Pandas

Prerequisites

Load the data into a Pandas DataFrame

Examples of data exploration using Pandas

Files

machine-learning-data-science-explore-data-blob.md

Latest commit

History

machine-learning-data-science-explore-data-blob.md

File metadata and controls

Explore data in Azure blob storage with Pandas

Prerequisites

Load the data into a Pandas DataFrame

Examples of data exploration using Pandas