IMPALA-4252: [DOCS] Document min/max filters for Kudu tables

Change-Id: I15d8c952ab5b90e89fdd57640dfb4da882f7ecb2 Reviewed-on: http://gerrit.cloudera.org:8080/8986 Reviewed-by: Thomas Tauber-Marshall <[email protected]> Tested-by: Impala Public Jenkins
joemcdonnell · Jan 11, 2018 · ad54bf0 · ad54bf0
1 parent f660191
commit ad54bf0
Show file tree

Hide file tree

Showing 8 changed files with 71 additions and 6 deletions.
diff --git a/docs/shared/impala_common.xml b/docs/shared/impala_common.xml
@@ -1194,6 +1194,12 @@ drop database temp;
         <codeph>hadoop fs -cp</codeph>, or <codeph>INSERT</codeph> in Impala or Hive.
       </p>
 
+      <p rev="2.11.0 IMPALA-4252" id="filter_option_bloom_only">
+        This query option affects only Bloom filters, not the min/max filters
+        that are applied to Kudu tables. Therefore, it does not affect the
+        performance of queries against Kudu tables.
+      </p>
+
       <p rev="2.9.0 IMPALA-5333" id="adls_dml_performance">
         <draft-comment>
           Currently nothing to say on this subject. Leaving this placeholder

diff --git a/docs/topics/impala_disable_row_runtime_filtering.xml b/docs/topics/impala_disable_row_runtime_filtering.xml
@@ -72,6 +72,20 @@ under the License.
       unsetting it immediately afterward.
     </p>
 
+    <p conref="../shared/impala_common.xml#common/file_format_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252">
+      This query option only applies to queries against HDFS-based tables
+      using the Parquet file format.
+    </p>
+
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252">
+      When applied to a query involving a Kudu table, this option turns off
+      all runtime filtering for the Kudu table.
+    </p>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

diff --git a/docs/topics/impala_kudu.xml b/docs/topics/impala_kudu.xml
@@ -1373,6 +1373,23 @@ kudu.table_name  | impala::some_database.table_name_demo
         parallelize the query very efficiently.
       </p>
 
+      <p rev="2.11.0 IMPALA-4252">
+        In <keyword keyref="impala211_full"/> and higher, Impala can push down additional
+        information to optimize join queries involving Kudu tables. If the join clause
+        contains predicates of the form
+        <codeph><varname>column</varname> = <varname>expression</varname></codeph>,
+        after Impala constructs a hash table of possible matching values for the
+        join columns from the bigger table (either an HDFS table or a Kudu table), Impala
+        can <q>push down</q> the minimum and maximum matching column values to Kudu,
+        so that Kudu can more efficiently locate matching rows in the second (smaller) table.
+        These min/max filters are affected by the <codeph>RUNTIME_FILTER_MODE</codeph>,
+        <codeph>RUNTIME_FILTER_WAIT_TIME_MS</codeph>, and <codeph>DISABLE_ROW_RUNTIME_FILTERING</codeph>
+        query options; the min/max filters are not affected by the
+        <codeph>RUNTIME_BLOOM_FILTER_SIZE</codeph>, <codeph>RUNTIME_FILTER_MIN_SIZE</codeph>,
+        <codeph>RUNTIME_FILTER_MAX_SIZE</codeph>, and <codeph>MAX_NUM_RUNTIME_FILTERS</codeph>
+        query options.
+      </p>
+
       <p>
         See <xref keyref="explain"/> for examples of evaluating the effectiveness of
         the predicate pushdown for a specific query against a Kudu table.

diff --git a/docs/topics/impala_max_num_runtime_filters.xml b/docs/topics/impala_max_num_runtime_filters.xml
@@ -67,6 +67,10 @@ under the License.
 
     <p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>
 
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

diff --git a/docs/topics/impala_runtime_bloom_filter_size.xml b/docs/topics/impala_runtime_bloom_filter_size.xml
@@ -98,6 +98,10 @@ under the License.
       unsetting it immediately afterward.
     </p>
 
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

diff --git a/docs/topics/impala_runtime_filter_max_size.xml b/docs/topics/impala_runtime_filter_max_size.xml
@@ -57,6 +57,10 @@ under the License.
 
     <p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>
 
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

diff --git a/docs/topics/impala_runtime_filter_min_size.xml b/docs/topics/impala_runtime_filter_min_size.xml
@@ -57,6 +57,10 @@ under the License.
 
     <p conref="../shared/impala_common.xml#common/runtime_filtering_option_caveat"/>
 
+    <p conref="../shared/impala_common.xml#common/kudu_blurb"/>
+
+    <p rev="2.11.0 IMPALA-4252" conref="../shared/impala_common.xml#common/filter_option_bloom_only"/>
+
     <p conref="../shared/impala_common.xml#common/related_info"/>
     <p>
       <xref href="impala_runtime_filtering.xml"/>,

diff --git a/docs/topics/impala_runtime_filtering.xml b/docs/topics/impala_runtime_filtering.xml
@@ -169,16 +169,23 @@ under the License.
         of values for join key columns. When this list is values is transmitted in time to a scan node,
         Impala can filter out non-matching values immediately after reading them, rather than transmitting
         the raw data to another host to compare against the in-memory hash table on that host.
-        This data structure is implemented as a <term>Bloom filter</term>, which uses a probability-based
-        algorithm to determine all possible matching values. (The probability-based aspects means that the
-        filter might include some non-matching values, but if so, that does not cause any inaccuracy
+      </p>
+      <p>
+        For HDFS-based tables, this data structure is implemented as a <term>Bloom filter</term>, which uses
+        a probability-based algorithm to determine all possible matching values. (The probability-based aspects
+        means that the filter might include some non-matching values, but if so, that does not cause any inaccuracy
         in the final results.)
       </p>
+      <p rev="2.11.0 IMPALA-4252">
+        Another kind of filter is the <q>min-max</q> filter. It currently only applies to Kudu tables. The
+        filter is a data structure representing a minimum and maximum value. These filters are passed to
+        Kudu to reduce the number of rows returned to Impala when scanning the probe side of the join.
+      </p>
       <p>
         There are different kinds of filters to match the different kinds of joins (partitioned and broadcast).
-        A broadcast filter is a complete list of relevant values that can be immediately evaluated by a scan node.
-        A partitioned filter is a partial list of relevant values (based on the data processed by one host in the
-        cluster); all the partitioned filters must be combined into one (by the coordinator node) before the
+        A broadcast filter reflects the complete list of relevant values and can be immediately evaluated by a scan node.
+        A partitioned filter reflects only the values processed by one host in the
+        cluster; all the partitioned filters must be combined into one (by the coordinator node) before the
         scan nodes can use the results to accurately filter the data as it is read from storage.
       </p>
       <p>
@@ -331,6 +338,9 @@ under the License.
         <codeph>runtime filters: <varname>filter_id</varname> &lt;- <varname>table</varname>.<varname>column</varname></codeph>,
         while a plan fragment that consumes a filter includes an annotation such as
         <codeph>runtime filters: <varname>filter_id</varname> -&gt; <varname>table</varname>.<varname>column</varname></codeph>.
+        <ph rev="2.11.0 IMPALA-4252">Setting the query option <codeph>EXPLAIN_LEVEL=2</codeph> adds additional
+        annotations showing the type of the filter, either <codeph><varname>filter_id</varname>[bloom]</codeph>
+        (for HDFS-based tables) or <codeph><varname>filter_id</varname>[min_max]</codeph> (for Kudu tables).</ph>
       </p>
 
       <p>
@@ -507,6 +517,8 @@ select c1 from huge_t1 join [shuffle] huge_t2
         The runtime filtering feature is most effective for the Parquet file formats.
         For other file formats, filtering only applies for partitioned tables.
         See <xref href="impala_runtime_filtering.xml#runtime_filtering_file_formats"/>.
+        For the ways in which runtime filtering works for Kudu tables, see
+        <xref href="impala_kudu.xml#kudu_performance"/>.
       </p>
 
       <!-- To do: check if this restriction is lifted in 5.8 / 2.6. -->