title | description | services | documentationcenter | author | manager | editor | ms.assetid | ms.service | ms.workload | ms.tgt_pltfrm | ms.devlang | ms.topic | ms.date | ms.author |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Sample data in Azure HDInsight Hive tables | Microsoft Docs |
Down sampling data in Azure HDInsight (Hadopop) Hive Tables |
machine-learning,hdinsight |
bradsev |
jhubbard |
cgronlun |
f31e8d01-0fd4-4a10-b1a7-35de3c327521 |
machine-learning |
data-services |
na |
na |
article |
12/19/2016 |
hangzh;bradsev |
In this article, we describe how to down-sample data stored in Azure HDInsight Hive tables using Hive queries. We cover three popularly used sampling methods:
- Uniform random sampling
- Random sampling by groups
- Stratified sampling
The following menu links to topics that describe how to sample data from various storage environments.
[!INCLUDE cap-sample-data-selector]
Why sample your data? If the dataset you plan to analyze is large, it's usually a good idea to down-sample the data to reduce it to a smaller but representative and more manageable size. This facilitates data understanding, exploration, and feature engineering. Its role in the Team Data Science Process is to enable fast prototyping of the data processing functions and machine learning models.
This sampling task is a step in the Team Data Science Process (TDSP).
Hive queries can be submitted from the Hadoop Command Line console on the head node of the Hadoop cluster. To do this, log into the head node of the Hadoop cluster, open the Hadoop Command Line console, and submit the Hive queries from there. For instructions on submitting Hive queries in the Hadoop Command Line console, see How to Submit Hive Queries.
Uniform random sampling means that each row in the data set has an equal chance of being sampled. This can be implemented by adding an extra field rand() to the data set in the inner "select" query, and in the outer "select" query that condition on that random field.
Here is an example query:
SET sampleRate=<sample rate, 0-1>;
select
field1, field2, …, fieldN
from
(
select
field1, field2, …, fieldN, rand() as samplekey
from <hive table name>
)a
where samplekey<='${hiveconf:sampleRate}'
Here, <sample rate, 0-1>
specifies the proportion of records that the users want to sample.
When sampling categorical data, you may want to either include or exclude all of the instances of some particular value of a categorical variable. This is what is meant by "sampling by group". For example, if you have a categorical variable "State", which has values NY, MA, CA, NJ, PA, etc, you want records of the same state be always together, whether they are sampled or not.
Here is an example query that samples by group:
SET sampleRate=<sample rate, 0-1>;
select
b.field1, b.field2, …, b.catfield, …, b.fieldN
from
(
select
field1, field2, …, catfield, …, fieldN
from <table name>
)b
join
(
select
catfield
from
(
select
catfield, rand() as samplekey
from <table name>
group by catfield
)a
where samplekey<='${hiveconf:sampleRate}'
)c
on b.catfield=c.catfield
Random sampling is stratified with respect to a categorical variable when the samples obtained have values of that categorical that are in the same ratio as in the parent population from which the samples were obtained. Using the same example as above, suppose your data has sub-populations by states, say NJ has 100 observations, NY has 60 observations, and WA has 300 observations. If you specify the rate of stratified sampling to be 0.5, then the sample obtained should have approximately 50, 30, and 150 observations of NJ, NY, and WA respectively.
Here is an example query:
SET sampleRate=<sample rate, 0-1>;
select
field1, field2, field3, ..., fieldN, state
from
(
select
field1, field2, field3, ..., fieldN, state,
count(*) over (partition by state) as state_cnt,
rank() over (partition by state order by rand()) as state_rank
from <table name>
) a
where state_rank <= state_cnt*'${hiveconf:sampleRate}'
For information on more advanced sampling methods that are available in Hive, see LanguageManual Sampling.