Skip to content

Commit

Permalink
IMPALA-4252: [DOCS] Document min/max filters for Kudu tables
Browse files Browse the repository at this point in the history
Change-Id: I15d8c952ab5b90e89fdd57640dfb4da882f7ecb2
Reviewed-on: http://gerrit.cloudera.org:8080/8986
Reviewed-by: Thomas Tauber-Marshall <[email protected]>
Tested-by: Impala Public Jenkins
  • Loading branch information
John Russell authored and cloudera-hudson committed Jan 11, 2018
1 parent f660191 commit ad54bf0
Show file tree
Hide file tree
Showing 8 changed files with 71 additions and 6 deletions.
6 changes: 6 additions & 0 deletions docs/shared/impala_common.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1194,6 +1194,12 @@ drop database temp;
<codeph>hadoop fs -cp</codeph>, or <codeph>INSERT</codeph> in Impala or Hive.
</p>

<p rev="2.11.0 IMPALA-4252" id="filter_option_bloom_only">
This query option affects only Bloom filters, not the min/max filters
that are applied to Kudu tables. Therefore, it does not affect the
performance of queries against Kudu tables.
</p>

<p rev="2.9.0 IMPALA-5333" id="adls_dml_performance">
<draft-comment>
Currently nothing to say on this subject. Leaving this placeholder
Expand Down
14 changes: 14 additions & 0 deletions docs/topics/impala_disable_row_runtime_filtering.xml
Original file line number Diff line number Diff line change
Expand Up @@ -72,6 +72,20 @@ under the License.
unsetting it immediately afterward.
</p>

<p conref="../shared/impala_common.xml#common/file_format_blurb"/>

<p rev="2.11.0 IMPALA-4252">
This query option only applies to queries against HDFS-based tables
using the Parquet file format.
</p>

<p conref="../shared/impala_common.xml#common/kudu_blurb"/>

<p rev="2.11.0 IMPALA-4252">
When applied to a query involving a Kudu table, this option turns off
all runtime filtering for the Kudu table.
</p>

<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_runtime_filtering.xml"/>,
Expand Down
17 changes: 17 additions & 0 deletions docs/topics/impala_kudu.xml
Original file line number Diff line number Diff line change
Expand Up @@ -1373,6 +1373,23 @@ kudu.table_name | impala::some_database.table_name_demo
parallelize the query very efficiently.
</p>

<p rev="2.11.0 IMPALA-4252">
In <keyword keyref="impala211_full"/> and higher, Impala can push down additional
information to optimize join queries involving Kudu tables. If the join clause
contains predicates of the form
<codeph><varname>column</varname> = <varname>expression</varname></codeph>,
after Impala constructs a hash table of possible matching values for the
join columns from the bigger table (either an HDFS table or a Kudu table), Impala
can <q>push down</q> the minimum and maximum matching column values to Kudu,
so that Kudu can more efficiently locate matching rows in the second (smaller) table.
These min/max filters are affected by the <codeph>RUNTIME_FILTER_MODE</codeph>,
<codeph>RUNTIME_FILTER_WAIT_TIME_MS</codeph>, and <codeph>DISABLE_ROW_RUNTIME_FILTERING</codeph>
query options; the min/max filters are not affected by the
<codeph>RUNTIME_BLOOM_FILTER_SIZE</codeph>, <codeph>RUNTIME_FILTER_MIN_SIZE</codeph>,
<codeph>RUNTIME_FILTER_MAX_SIZE</codeph>, and <codeph>MAX_NUM_RUNTIME_FILTERS</codeph>
query options.
</p>

<p>
See <xref keyref="explain"/> for examples of evaluating the effectiveness of
the predicate pushdown for a specific query against a Kudu table.
Expand Down
4 changes: 4 additions & 0 deletions docs/topics/impala_max_num_runtime_filters.xml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,10 @@ under the License.

<p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>

<p conref="../shared/impala_common.xml#common/kudu_blurb"/>

<p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>

<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_runtime_filtering.xml"/>,
Expand Down
4 changes: 4 additions & 0 deletions docs/topics/impala_runtime_bloom_filter_size.xml
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,10 @@ under the License.
unsetting it immediately afterward.
</p>

<p conref="../shared/impala_common.xml#common/kudu_blurb"/>

<p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>

<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_runtime_filtering.xml"/>,
Expand Down
4 changes: 4 additions & 0 deletions docs/topics/impala_runtime_filter_max_size.xml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@ under the License.

<p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>

<p conref="../shared/impala_common.xml#common/kudu_blurb"/>

<p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>

<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_runtime_filtering.xml"/>,
Expand Down
4 changes: 4 additions & 0 deletions docs/topics/impala_runtime_filter_min_size.xml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,10 @@ under the License.

<p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>

<p conref="../shared/impala_common.xml#common/kudu_blurb"/>

<p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>

<p conref="../shared/impala_common.xml#common/related_info"/>
<p>
<xref href="impala_runtime_filtering.xml"/>,
Expand Down
24 changes: 18 additions & 6 deletions docs/topics/impala_runtime_filtering.xml
Original file line number Diff line number Diff line change
Expand Up @@ -169,16 +169,23 @@ under the License.
of values for join key columns. When this list is values is transmitted in time to a scan node,
Impala can filter out non-matching values immediately after reading them, rather than transmitting
the raw data to another host to compare against the in-memory hash table on that host.
This data structure is implemented as a <term>Bloom filter</term>, which uses a probability-based
algorithm to determine all possible matching values. (The probability-based aspects means that the
filter might include some non-matching values, but if so, that does not cause any inaccuracy
</p>
<p>
For HDFS-based tables, this data structure is implemented as a <term>Bloom filter</term>, which uses
a probability-based algorithm to determine all possible matching values. (The probability-based aspects
means that the filter might include some non-matching values, but if so, that does not cause any inaccuracy
in the final results.)
</p>
<p rev="2.11.0 IMPALA-4252">
Another kind of filter is the <q>min-max</q> filter. It currently only applies to Kudu tables. The
filter is a data structure representing a minimum and maximum value. These filters are passed to
Kudu to reduce the number of rows returned to Impala when scanning the probe side of the join.
</p>
<p>
There are different kinds of filters to match the different kinds of joins (partitioned and broadcast).
A broadcast filter is a complete list of relevant values that can be immediately evaluated by a scan node.
A partitioned filter is a partial list of relevant values (based on the data processed by one host in the
cluster); all the partitioned filters must be combined into one (by the coordinator node) before the
A broadcast filter reflects the complete list of relevant values and can be immediately evaluated by a scan node.
A partitioned filter reflects only the values processed by one host in the
cluster; all the partitioned filters must be combined into one (by the coordinator node) before the
scan nodes can use the results to accurately filter the data as it is read from storage.
</p>
<p>
Expand Down Expand Up @@ -331,6 +338,9 @@ under the License.
<codeph>runtime filters: <varname>filter_id</varname> &lt;- <varname>table</varname>.<varname>column</varname></codeph>,
while a plan fragment that consumes a filter includes an annotation such as
<codeph>runtime filters: <varname>filter_id</varname> -&gt; <varname>table</varname>.<varname>column</varname></codeph>.
<ph rev="2.11.0 IMPALA-4252">Setting the query option <codeph>EXPLAIN_LEVEL=2</codeph> adds additional
annotations showing the type of the filter, either <codeph><varname>filter_id</varname>[bloom]</codeph>
(for HDFS-based tables) or <codeph><varname>filter_id</varname>[min_max]</codeph> (for Kudu tables).</ph>
</p>

<p>
Expand Down Expand Up @@ -507,6 +517,8 @@ select c1 from huge_t1 join [shuffle] huge_t2
The runtime filtering feature is most effective for the Parquet file formats.
For other file formats, filtering only applies for partitioned tables.
See <xref href="impala_runtime_filtering.xml#runtime_filtering_file_formats"/>.
For the ways in which runtime filtering works for Kudu tables, see
<xref href="impala_kudu.xml#kudu_performance"/>.
</p>

<!-- To do: check if this restriction is lifted in 5.8 / 2.6. -->
Expand Down

0 comments on commit ad54bf0

Please sign in to comment.