title	description	services	documentationcenter	author	manager	editor	ms.assetid	ms.service	ms.component	ms.workload	ms.tgt_pltfrm	ms.devlang	ms.topic	ms.date	ms.author
Sample Data in SQL Server on Azure \| Microsoft Docs	Sample Data in SQL Server on Azure	machine-learning		deguhath	cgeonlun	cgronlun	33c030d4-5cca-4cc9-99d7-2bd13a3926af	machine-learning	team-data-science-process	data-services	na	na	article	11/13/2017	deguhath

Sample data in SQL Server on Azure

This article shows how to sample data stored in SQL Server on Azure using either SQL or the Python programming language. It also shows how to move sampled data into Azure Machine Learning by saving it to a file, uploading it to an Azure blob, and then reading it into Azure Machine Learning Studio.

The Python sampling uses the pyodbc ODBC library to connect to SQL Server on Azure and the Pandas library to do the sampling.

Note

The sample SQL code in this document assumes that the data is in a SQL Server on Azure. If it is not, refer to Move data to SQL Server on Azure article for instructions on how to move your data to SQL Server on Azure.

Why sample your data? If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering. Its role in the Team Data Science Process (TDSP) is to enable fast prototyping of the data processing functions and machine learning models.

This sampling task is a step in the Team Data Science Process (TDSP).

Using SQL

This section describes several methods using SQL to perform simple random sampling against the data in the database. Choose a method based on your data size and its distribution.

The following two items show how to use newid in SQL Server to perform the sampling. The method you choose depends on how random you want the sample to be (pk_id in the following sample code is assumed to be an auto-generated primary key).

Less strict random sample

 select  * from <table_name> where <primary_key> in 
 (select top 10 percent <primary_key> from <table_name> order by newid())

More random sample

 SELECT * FROM <table_name>
 WHERE 0.1 >= CAST(CHECKSUM(NEWID(), <primary_key>) & 0x7fffffff AS float)/ CAST (0x7fffffff AS int)

Tablesample can be used for sampling the data as well. This may be a better approach if your data size is large (assuming that data on different pages is not correlated) and for the query to complete in a reasonable time.

SELECT *
FROM <table_name> 
TABLESAMPLE (10 PERCENT)

Note

You can explore and generate features from this sampled data by storing it in a new table

Connecting to Azure Machine Learning

You can directly use the sample queries above in the Azure Machine Learning Import Data module to down-sample the data on the fly and bring it into an Azure Machine Learning experiment. A screen shot of using the reader module to read the sampled data is shown here:

Using the Python programming language

This section demonstrates using the pyodbc library to establish an ODBC connect to a SQL server database in Python. The database connection string is as follows: (replace servername, dbname, username and password with your configuration):

#Set up the SQL Azure connection
import pyodbc    
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=<dbname>;UID=<username>;PWD=<password>')

The Pandas library in Python provides a rich set of data structures and data analysis tools for data manipulation for Python programming. The following code reads a 0.1% sample of the data from a table in Azure SQL database into a Pandas data:

import pandas as pd

# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select column1, cloumn2... from <table_name> tablesample (0.1 percent)''', conn)

You can now work with the sampled data in the Pandas data frame.

Connecting to Azure Machine Learning

You can use the following sample code to save the down-sampled data to a file and upload it to an Azure blob. The data in the blob can be directly read into an Azure Machine Learning Experiment using the Import Data module. The steps are as follows:

Write the pandas data frame to a local file

 dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)

Upload local file to Azure blob

 from azure.storage import BlobService
 import tables

 STORAGEACCOUNTNAME= <storage_account_name>
 LOCALFILENAME= <local_file_name>
 STORAGEACCOUNTKEY= <storage_account_key>
 CONTAINERNAME= <container_name>
 BLOBNAME= <blob_name>

 output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY)    
 localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory

 try:

 #perform upload
 output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,localfileprocessed)

 except:            
     print ("Something went wrong with uploading blob:"+BLOBNAME)

Read data from Azure blob using Azure Machine Learning Import Data module as shown in the following screen grab:

The Team Data Science Process in Action example

To walkthrough an example of the Team Data Science Process a using a public dataset, see Team Data Science Process in Action: using SQL Server.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sample-data-sql-server.md

sample-data-sql-server.md

Sample data in SQL Server on Azure

Using SQL

Connecting to Azure Machine Learning

Using the Python programming language

Connecting to Azure Machine Learning

The Team Data Science Process in Action example

Files

sample-data-sql-server.md

Latest commit

History

sample-data-sql-server.md

File metadata and controls

Sample data in SQL Server on Azure

Using SQL

Connecting to Azure Machine Learning

Using the Python programming language

Connecting to Azure Machine Learning

The Team Data Science Process in Action example