title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.component | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample Data in SQL Server on Azure | Microsoft Docs |
Sample Data in SQL Server on Azure |
machine-learning |
deguhath |
cgeonlun |
cgronlun |
33c030d4-5cca-4cc9-99d7-2bd13a3926af |
machine-learning |
team-data-science-process |
data-services |
na |
na |
article |
11/13/2017 |
deguhath |
This article shows how to sample data stored in SQL Server on Azure using either SQL or the Python programming language. It also shows how to move sampled data into Azure Machine Learning by saving it to a file, uploading it to an Azure blob, and then reading it into Azure Machine Learning Studio.
The Python sampling uses the pyodbc ODBC library to connect to SQL Server on Azure and the Pandas library to do the sampling.
Note
The sample SQL code in this document assumes that the data is in a SQL Server on Azure. If it is not, refer to Move data to SQL Server on Azure article for instructions on how to move your data to SQL Server on Azure.
Why sample your data? If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering. Its role in the Team Data Science Process (TDSP) is to enable fast prototyping of the data processing functions and machine learning models.
This sampling task is a step in the Team Data Science Process (TDSP).
This section describes several methods using SQL to perform simple random sampling against the data in the database. Choose a method based on your data size and its distribution.
The following two items show how to use newid
in SQL Server to perform the sampling. The method you choose depends on how random you want the sample to be (pk_id in the following sample code is assumed to be an auto-generated primary key).
-
Less strict random sample
select * from <table_name> where <primary_key> in (select top 10 percent <primary_key> from <table_name> order by newid())
-
More random sample
SELECT * FROM <table_name> WHERE 0.1 >= CAST(CHECKSUM(NEWID(), <primary_key>) & 0x7fffffff AS float)/ CAST (0x7fffffff AS int)
Tablesample can be used for sampling the data as well. This may be a better approach if your data size is large (assuming that data on different pages is not correlated) and for the query to complete in a reasonable time.
SELECT *
FROM <table_name>
TABLESAMPLE (10 PERCENT)
Note
You can explore and generate features from this sampled data by storing it in a new table
You can directly use the sample queries above in the Azure Machine Learning Import Data module to down-sample the data on the fly and bring it into an Azure Machine Learning experiment. A screen shot of using the reader module to read the sampled data is shown here:
This section demonstrates using the pyodbc library to establish an ODBC connect to a SQL server database in Python. The database connection string is as follows: (replace servername, dbname, username and password with your configuration):
#Set up the SQL Azure connection
import pyodbc
conn = pyodbc.connect('DRIVER={SQL Server};SERVER=<servername>;DATABASE=<dbname>;UID=<username>;PWD=<password>')
The Pandas library in Python provides a rich set of data structures and data analysis tools for data manipulation for Python programming. The following code reads a 0.1% sample of the data from a table in Azure SQL database into a Pandas data:
import pandas as pd
# Query database and load the returned results in pandas data frame
data_frame = pd.read_sql('''select column1, cloumn2... from <table_name> tablesample (0.1 percent)''', conn)
You can now work with the sampled data in the Pandas data frame.
You can use the following sample code to save the down-sampled data to a file and upload it to an Azure blob. The data in the blob can be directly read into an Azure Machine Learning Experiment using the Import Data module. The steps are as follows:
-
Write the pandas data frame to a local file
dataframe.to_csv(os.path.join(os.getcwd(),LOCALFILENAME), sep='\t', encoding='utf-8', index=False)
-
Upload local file to Azure blob
from azure.storage import BlobService import tables STORAGEACCOUNTNAME= <storage_account_name> LOCALFILENAME= <local_file_name> STORAGEACCOUNTKEY= <storage_account_key> CONTAINERNAME= <container_name> BLOBNAME= <blob_name> output_blob_service=BlobService(account_name=STORAGEACCOUNTNAME,account_key=STORAGEACCOUNTKEY) localfileprocessed = os.path.join(os.getcwd(),LOCALFILENAME) #assuming file is in current working directory try: #perform upload output_blob_service.put_block_blob_from_path(CONTAINERNAME,BLOBNAME,localfileprocessed) except: print ("Something went wrong with uploading blob:"+BLOBNAME)
-
Read data from Azure blob using Azure Machine Learning Import Data module as shown in the following screen grab:
To walkthrough an example of the Team Data Science Process a using a public dataset, see Team Data Science Process in Action: using SQL Server.