Skip to content

Commit

Permalink
[SPARK-6040][SQL] Fix the percent bug in tablesample
Browse files Browse the repository at this point in the history
HiveQL expression like `select count(1) from src tablesample(1 percent);` means take 1% sample to select. But it means 100% in the current version of the Spark.

Author: q00251598 <[email protected]>

Closes apache#4789 from watermen/SPARK-6040 and squashes the following commits:

2453ebe [q00251598] check and adjust the fraction.
  • Loading branch information
watermen authored and marmbrus committed Mar 2, 2015
1 parent 3f9def8 commit 582e5a2
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 1 deletion.
11 changes: 10 additions & 1 deletion sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveQl.scala
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@ import org.apache.spark.sql.execution.ExplainCommand
import org.apache.spark.sql.sources.DescribeCommand
import org.apache.spark.sql.hive.execution.{HiveNativeCommand, DropTable, AnalyzeTable, HiveScriptIOSchema}
import org.apache.spark.sql.types._
import org.apache.spark.util.random.RandomSampler

/* Implicit conversions */
import scala.collection.JavaConversions._
Expand Down Expand Up @@ -850,7 +851,15 @@ https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation%2C+Cube%2C
case Token("TOK_TABLESPLITSAMPLE",
Token("TOK_PERCENT", Nil) ::
Token(fraction, Nil) :: Nil) =>
Sample(fraction.toDouble, withReplacement = false, (math.random * 1000).toInt, relation)
// The range of fraction accepted by Sample is [0, 1]. Because Hive's block sampling
// function takes X PERCENT as the input and the range of X is [0, 100], we need to
// adjust the fraction.
require(
fraction.toDouble >= (0.0 - RandomSampler.roundingEpsilon)
&& fraction.toDouble <= (100.0 + RandomSampler.roundingEpsilon),
s"Sampling fraction ($fraction) must be on interval [0, 100]")
Sample(fraction.toDouble / 100, withReplacement = false, (math.random * 1000).toInt,
relation)
case Token("TOK_TABLEBUCKETSAMPLE",
Token(numerator, Nil) ::
Token(denominator, Nil) :: Nil) =>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -467,6 +467,7 @@ class HiveQuerySuite extends HiveComparisonTest with BeforeAndAfter {

test("sampling") {
sql("SELECT * FROM src TABLESAMPLE(0.1 PERCENT) s")
sql("SELECT * FROM src TABLESAMPLE(100 PERCENT) s")
}

test("DataFrame toString") {
Expand Down

0 comments on commit 582e5a2

Please sign in to comment.