Skip to content

Commit

Permalink
[SPARK-34212][SQL][FOLLOWUP] Parquet vectorized reader can read decim…
Browse files Browse the repository at this point in the history
…al fields with a larger precision

### What changes were proposed in this pull request?

This is a followup of apache#31357

apache#31357 added a very strong restriction to the vectorized parquet reader, that the spark data type must exactly match the physical parquet type, when reading decimal fields. This restriction is actually not necessary, as we can safely read parquet decimals with a larger precision. This PR releases this restriction a little bit.

### Why are the changes needed?

To not fail queries unnecessarily.

### Does this PR introduce _any_ user-facing change?

Yes, now users can read parquet decimals with mismatched `DecimalType` as long as the scale is the same and precision is larger.

### How was this patch tested?

updated test.

Closes apache#31443 from cloud-fan/improve.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
  • Loading branch information
cloud-fan authored and HyukjinKwon committed Feb 3, 2021
1 parent 6386602 commit 00120ea
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 1 deletion.
Original file line number Diff line number Diff line change
Expand Up @@ -111,7 +111,9 @@ public class VectorizedColumnReader {
private boolean isDecimalTypeMatched(DataType dt) {
DecimalType d = (DecimalType) dt;
DecimalMetadata dm = descriptor.getPrimitiveType().getDecimalMetadata();
return dm != null && dm.getPrecision() == d.precision() && dm.getScale() == d.scale();
// It's OK if the required decimal precision is larger than or equal to the physical decimal
// precision in the Parquet metadata, as long as the decimal scale is the same.
return dm != null && dm.getPrecision() <= d.precision() && dm.getScale() == d.scale();
}

private boolean canReadAsIntDecimal(DataType dt) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3891,6 +3891,14 @@ class SQLQuerySuite extends QueryTest with SharedSparkSession with AdaptiveSpark
val df = sql("SELECT 1.0 a, CAST(1.23 AS DECIMAL(17, 2)) b, CAST(1.23 AS DECIMAL(36, 2)) c")
df.write.parquet(path.toString)

Seq(true, false).foreach { vectorizedReader =>
withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> vectorizedReader.toString) {
// We can read the decimal parquet field with a larger precision, if scale is the same.
val schema = "a DECIMAL(9, 1), b DECIMAL(18, 2), c DECIMAL(38, 2)"
checkAnswer(readParquet(schema, path), df)
}
}

withSQLConf(SQLConf.PARQUET_VECTORIZED_READER_ENABLED.key -> "false") {
val schema1 = "a DECIMAL(3, 2), b DECIMAL(18, 3), c DECIMAL(37, 3)"
checkAnswer(readParquet(schema1, path), df)
Expand Down

0 comments on commit 00120ea

Please sign in to comment.