Skip to content

Commit

Permalink
[SPARK-15056][SQL] Parse Unsupported Sampling Syntax and Issue Better…
Browse files Browse the repository at this point in the history
… Exceptions

#### What changes were proposed in this pull request?
Compared with the current Spark parser, there are two extra syntax are supported in Hive for sampling
- In `On` clauses, `rand()` is used for indicating sampling on the entire row instead of an individual column. For example,

   ```SQL
   SELECT * FROM source TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s;
   ```
- Users can specify the total length to be read. For example,

   ```SQL
   SELECT * FROM source TABLESAMPLE(100M) s;
   ```

Below is the link for references:
   https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Sampling

This PR is to parse and capture these two extra syntax, and issue a better error message.

#### How was this patch tested?
Added test cases to verify the thrown exceptions

Author: gatorsmile <[email protected]>

Closes apache#12838 from gatorsmile/bucketOnRand.
  • Loading branch information
gatorsmile authored and hvanhovell committed May 3, 2016
1 parent 2e2a621 commit 71296c0
Show file tree
Hide file tree
Showing 3 changed files with 22 additions and 3 deletions.
Original file line number Diff line number Diff line change
Expand Up @@ -389,7 +389,8 @@ sample
: TABLESAMPLE '('
( (percentage=(INTEGER_VALUE | DECIMAL_VALUE) sampleType=PERCENTLIT)
| (expression sampleType=ROWS)
| (sampleType=BUCKET numerator=INTEGER_VALUE OUT OF denominator=INTEGER_VALUE (ON identifier)?))
| sampleType=BYTELENGTH_LITERAL
| (sampleType=BUCKET numerator=INTEGER_VALUE OUT OF denominator=INTEGER_VALUE (ON (identifier | qualifiedName '(' ')'))?))
')'
;

Expand Down Expand Up @@ -895,6 +896,10 @@ TINYINT_LITERAL
: DIGIT+ 'Y'
;

BYTELENGTH_LITERAL
: DIGIT+ ('B' | 'K' | 'M' | 'G')
;

INTEGER_VALUE
: DIGIT+
;
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -632,8 +632,18 @@ class AstBuilder extends SqlBaseBaseVisitor[AnyRef] with Logging {
val fraction = ctx.percentage.getText.toDouble
sample(fraction / 100.0d)

case SqlBaseParser.BYTELENGTH_LITERAL =>
throw new ParseException(
"TABLESAMPLE(byteLengthLiteral) is not supported", ctx)

case SqlBaseParser.BUCKET if ctx.ON != null =>
throw new ParseException("TABLESAMPLE(BUCKET x OUT OF y ON id) is not supported", ctx)
if (ctx.identifier != null) {
throw new ParseException(
"TABLESAMPLE(BUCKET x OUT OF y ON colname) is not supported", ctx)
} else {
throw new ParseException(
"TABLESAMPLE(BUCKET x OUT OF y ON function) is not supported", ctx)
}

case SqlBaseParser.BUCKET =>
sample(ctx.numerator.getText.toDouble / ctx.denominator.getText.toDouble)
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -372,9 +372,13 @@ class PlanParserSuite extends PlanTest {
assertEqual(s"$sql tablesample(bucket 4 out of 10) as x",
Sample(0, .4d, withReplacement = false, 10L, table("t").as("x"))(true).select(star()))
intercept(s"$sql tablesample(bucket 4 out of 10 on x) as x",
"TABLESAMPLE(BUCKET x OUT OF y ON id) is not supported")
"TABLESAMPLE(BUCKET x OUT OF y ON colname) is not supported")
intercept(s"$sql tablesample(bucket 11 out of 10) as x",
s"Sampling fraction (${11.0/10.0}) must be on interval [0, 1]")
intercept("SELECT * FROM parquet_t0 TABLESAMPLE(300M) s",
"TABLESAMPLE(byteLengthLiteral) is not supported")
intercept("SELECT * FROM parquet_t0 TABLESAMPLE(BUCKET 3 OUT OF 32 ON rand()) s",
"TABLESAMPLE(BUCKET x OUT OF y ON function) is not supported")
}

test("sub-query") {
Expand Down

0 comments on commit 71296c0

Please sign in to comment.