Skip to content

Commit

Permalink
[SPARK-23456][SPARK-21783] Turn on native ORC impl and PPD by default
Browse files Browse the repository at this point in the history
## What changes were proposed in this pull request?

Apache Spark 2.3 introduced `native` ORC supports with vectorization and many fixes. However, it's shipped as a not-default option. This PR enables `native` ORC implementation and predicate-pushdown by default for Apache Spark 2.4. We will improve and stabilize ORC data source before Apache Spark 2.4. And, eventually, Apache Spark will drop old Hive-based ORC code.

## How was this patch tested?

Pass the Jenkins with existing tests.

Author: Dongjoon Hyun <[email protected]>

Closes apache#20634 from dongjoon-hyun/SPARK-23456.
  • Loading branch information
dongjoon-hyun authored and gatorsmile committed Feb 20, 2018
1 parent 189f56f commit 83c0087
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 4 deletions.
6 changes: 5 additions & 1 deletion docs/sql-programming-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -1018,7 +1018,7 @@ the vectorized reader is used when `spark.sql.hive.convertMetastoreOrc` is also
<tr>
<td><code>spark.sql.orc.impl</code></td>
<td><code>hive</code></td>
<td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4.1. `hive` means the ORC library in Hive 1.2.1.</td>
<td>The name of ORC implementation. It can be one of <code>native</code> and <code>hive</code>. <code>native</code> means the native ORC support that is built on Apache ORC 1.4. `hive` means the ORC library in Hive 1.2.1.</td>
</tr>
<tr>
<td><code>spark.sql.orc.enableVectorizedReader</code></td>
Expand Down Expand Up @@ -1797,6 +1797,10 @@ working with timestamps in `pandas_udf`s to get the best performance, see

# Migration Guide

## Upgrading From Spark SQL 2.3 to 2.4

- Since Spark 2.4, Spark maximizes the usage of a vectorized ORC reader for ORC files by default. To do that, `spark.sql.orc.impl` and `spark.sql.orc.filterPushdown` change their default values to `native` and `true` respectively.

## Upgrading From Spark SQL 2.2 to 2.3

- Since Spark 2.3, the queries from raw JSON/CSV files are disallowed when the referenced columns only include the internal corrupt record column (named `_corrupt_record` by default). For example, `spark.read.schema(schema).json(file).filter($"_corrupt_record".isNotNull).count()` and `spark.read.schema(schema).json(file).select("_corrupt_record").show()`. Instead, you can cache or save the parsed results and then send the same query. For example, `val df = spark.read.schema(schema).json(file).cache()` and then `df.filter($"_corrupt_record".isNotNull).count()`.
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -399,11 +399,11 @@ object SQLConf {

val ORC_IMPLEMENTATION = buildConf("spark.sql.orc.impl")
.doc("When native, use the native version of ORC support instead of the ORC library in Hive " +
"1.2.1. It is 'hive' by default.")
"1.2.1. It is 'hive' by default prior to Spark 2.4.")
.internal()
.stringConf
.checkValues(Set("hive", "native"))
.createWithDefault("hive")
.createWithDefault("native")

val ORC_VECTORIZED_READER_ENABLED = buildConf("spark.sql.orc.enableVectorizedReader")
.doc("Enables vectorized orc decoding.")
Expand All @@ -426,7 +426,7 @@ object SQLConf {
val ORC_FILTER_PUSHDOWN_ENABLED = buildConf("spark.sql.orc.filterPushdown")
.doc("When true, enable filter pushdown for ORC files.")
.booleanConf
.createWithDefault(false)
.createWithDefault(true)

val HIVE_VERIFY_PARTITION_PATH = buildConf("spark.sql.hive.verifyPartitionPath")
.doc("When true, check all the partition paths under the table\'s root directory " +
Expand Down

0 comments on commit 83c0087

Please sign in to comment.