Skip to content

Latest commit

 

History

History
35 lines (27 loc) · 1.68 KB

machine-learning-data-science-sample-data.md

File metadata and controls

35 lines (27 loc) · 1.68 KB
title description services documentationcenter author manager editor ms.assetid ms.service ms.workload ms.tgt_pltfrm ms.devlang ms.topic ms.date ms.author
Sample data in Azure blob containers, SQL Server, and Hive tables | Microsoft Docs
How to explore data stored in various Azure enviromnents.
machine-learning
bradsev
jhubbard
cgronlun
80a9dfae-e3a6-4cfb-aecc-5701cfc7e39d
machine-learning
data-services
na
na
article
12/19/2016
fashah;garye;bradsev

Sample data in Azure blob containers, SQL Server, and Hive tables

This document links to topics that covers how to sample data that is stored in one of three different Azure locations:

  • Azure blob container data is sampled by downloading it programmatically and then sampling it with sample Python code.
  • SQL Server data is sampled using both SQL and the Python Programming Language.
  • Hive table data is sampled using Hive queries.

The following menu links to the topics that describe how to sample data from each of these Azure storage environments.

[!INCLUDE cap-sample-data-selector]

This sampling task is a step in the Team Data Science Process (TDSP).

Why sample data?

If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering. Its role in the Cortana Analytics Process is to enable fast prototyping of the data processing functions and machine learning models.