Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pull new version #3

Merged
merged 331 commits into from
Nov 2, 2017
Merged

pull new version #3

merged 331 commits into from
Nov 2, 2017

Conversation

yuzzjj
Copy link
Owner

@yuzzjj yuzzjj commented Nov 2, 2017

No description provided.

smukil and others added 30 commits August 17, 2017 16:09
The HAVE_PIPE2 is a variable that tracks whether a platform has the
system function pipe2() present.

This value was not propagated to the appropriate file that uses it,
causing its value to always be 0, and the wrong branch to be taken
at compile time.

This fixes it by propagating the value to the file.

Change-Id: I6cdc343da35a34be8d95fbea3543d080dbc1ec29
Reviewed-on: http://gerrit.cloudera.org:8080/7705
Reviewed-by: Henry Robinson <[email protected]>
Tested-by: Impala Public Jenkins
The original test case accessed the 'pos' field of nested
collections. The query results could vary when reloading
the data because the order of items within a nested
collection is not necessarily the same accross loads.

This patch reformulates the test to avoid 'pos'.

Change-Id: I32e47f0845da8b27652faaceae834e025ecff42a
Reviewed-on: http://gerrit.cloudera.org:8080/7708
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Adds the following flag:

  -buffer_pool_clean_pages_limit ((Advanced) Limit on bytes of clean spilled
    pages that will be accumulated in the buffer pool. Specified as number of
    bytes ('<int>[bB]?'), megabytes ('<float>[mM]'), gigabytes
    ('<float>[gG]'), or percentage of the buffer pool limit ('<int>%').
    Defaults to bytes if no unit is given..) type: string default: "10%"

This helps prevent Impala accumulating excessive amounts of clean pages,
which can cause some problems in practice:
* The OS buffer cache is squeezed, reducing I/O performance from HDFS
  and potentially reducing the ability of the OS to absorb writes from
  Impala without blocking.
* Impala process memory consumption can expand more than users or admins
  might expect. E.g. if one query is running with a mem_limit of 1GB,
  many people will be surprised if the process inflates to the full
  process limit of 100GB. Impala doesn't provide any guarantees except
  from staying within the process mem_limit, but this could be a
  surprising divergence from past behaviour.

Observability:
A new metric buffer-pool.clean-pages-limit is added.

Testing:
Added a backend test to directly test that clean pages are evicted.
Ran in a loop to flush out any flakiness.

Ran exhaustive tests.

Change-Id: Ib6b687ab4bdddf07d9d6c997ff814aa3976042f9
Reviewed-on: http://gerrit.cloudera.org:8080/7653
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
The change to address IMPALA-5769 added periodic cleaning for minidumps,
which got in the way of the other minidump tests.

This change sets max_minidumps to the default value (9) for all tests to
keep the cleanup thread from interfering, and then sets a smaller limit
where needed.

Change-Id: I977930ae87b8d4671a89c1e07ba76b12eb92fa55
Reviewed-on: http://gerrit.cloudera.org:8080/7716
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Addresses some comments from the CR:
https://gerrit.cloudera.org/#/c/7630/8

Notably changes the name of initial_reservation_total_bytes
to initial_reservation_total_claims and attempts to clarify
comments.

Change-Id: I1391d99ca15d2ebcaabd564fbe1242806be09f72
Reviewed-on: http://gerrit.cloudera.org:8080/7681
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
In the text scanner, we were writing the partial tuple variable length
data to data_buffer_pool_ mempool which caused strange behavior, such
as incorrect results.

If we are scanning compressed data, the pool gets attached to the row
batch at the end of a GetNext() call and gets freed before the next
GetNext() call. This is wrong because we expect the data in the partial
tuple to survive between the GetNext() calls. If we are scanning non
compressed data, data_buffer_pool_ never gets cleared and grows over
time until the scanner finishes reading the scan range.

We fix the problem by writing the varlen partial tuple data to
boundary_pool, which is where the constant length partial tuple data is
written. We also make sure that boundary pool does not hold any tuple
data of returned batches by always deep copying it to output batches.

Testing:
- Ran some tests locally on ASAN build.
- Updated test_scanners_fuzz.py to make slightly more significant
  changes to the data files. This change was helpful for finding
  issues while developing this patch.

Change-Id: I60ba5c113aefd17f697c1888fd46a237ef396540
Reviewed-on: http://gerrit.cloudera.org:8080/7639
Reviewed-by: Taras Bobrovytsky <[email protected]>
Tested-by: Impala Public Jenkins
There was a bug in the BE evaluation logic of the
TupleIsNullPredicate which could lead to wrong results
for certain plan shapes.

A TupleIsNullPredicate should evaluate to true only if
all specified tuples are NULL. This was always the intent
of the FE and is also documented in the BE as the required
behavior.

Testing:
- Added regression test
- Core tests passed

Change-Id: Id659f849a68d88cfe22c65dd1747dd6d6a916163
Reviewed-on: http://gerrit.cloudera.org:8080/7737
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Fix: The issue was caused because sessions could get closed
     some time after their timeout expires, because the session
     expiration thread only woke up every session-timeout/2
     seconds. This fix changes the logic, now the session expiration
     thread will wakeup every 1 sec to check whether sessions can
     be expired.

    The test changes in session-expiry-test is based on introducing
    variable names for the numeric constants and to reduce the time
    of max_idle_timeout_ms.

Change-Id: Ic62f1ece6c128a80676c6eb331ef190a038bee76
Reviewed-on: http://gerrit.cloudera.org:8080/7709
Tested-by: Impala Public Jenkins
Reviewed-by: Henry Robinson <[email protected]>
The previous commit for IMPALA-5800 imported a slightly old version of
squeasel.c from the upstream project. In particular, the strings used to
choose the TLS version were incompatible with Impala's. This patch
corrects the error.

Testing: Add tests to make sure that the expected flags are supported by
Squeasel.

Change-Id: I03b76f2d3b3e2e6133e5c57a4f33ac315c889e15
Reviewed-on: http://gerrit.cloudera.org:8080/7752
Reviewed-by: Bharath Vissapragada <[email protected]>
Reviewed-by: Sailesh Mukil <[email protected]>
Tested-by: Impala Public Jenkins
The DCHECK is only valid if the scanner successfully processes
the entire scan range. If the scanner is cancelled or hits an
error in-between buffers, boundary_column_ may be non-empty.

The fix is to move the DCHECK to the end of FinishScanRange(),
where it only executes if the scan range is fully processed.

Testing:
Ran TestCancellation in a loop locally for an hour.

Change-Id: I3d63f0b0c1ac82103a60b68b244156b9fbd575e0
Reviewed-on: http://gerrit.cloudera.org:8080/7756
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I1e86ce13d1fdc73487b8067f3670ee73b9269366
Reviewed-on: http://gerrit.cloudera.org:8080/7767
Reviewed-by: Thomas Tauber-Marshall <[email protected]>
Tested-by: Impala Public Jenkins
We were missing accounting for this, since it is part of the expected
difference between query and process memory consumption. The identity
that applies is is:

  buffers allocated from system =
      reservation + cached buffers - unused reservation

Where "cached buffers" includes free buffers and buffers attached to
clean pages. The reservation is accounted against queries and "buffers
allocated from system" is accounted against the process MemTracker.

Reporting this in a direct way required adding a MemTracker with a
negative consumption consumption, which fortunately did not require
any major changes to the MemTracker code.

Example output when applied to buffer pool branch:

  Process: Limit=8.35 GB Total=579.18 MB Peak=590.41 MB
    Buffer Pool: Free Buffers: Total=268.25 MB
    Buffer Pool: Clean Pages: Total=172.25 MB
    Buffer Pool: Unused Reservation: Total=-8.25 MB
    Free Disk IO Buffers: Total=21.98 MB Peak=21.98 MB
    RequestPool=default-pool: Total=12.07 MB Peak=71.58 MB
      ... <snip> ...
    RequestPool=fe-eval-exprs: Total=0 Peak=4.00 KB
    Untracked Memory: Total=112.88 MB

Testing:
Added a basic test for MemTrackers with negative metrics.

Change-Id: Idb1fa3110dc893321f9f4e8ced6b7ede12194dad
Reviewed-on: http://gerrit.cloudera.org:8080/7380
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
This is implemented in the ResourceProfileBuilder to avoid duplicating
the login in every plan node.

Testing:
Updated planner tests.

Change-Id: I1e2853300371e31b13d81a763dbafb21709b16c4
Reviewed-on: http://gerrit.cloudera.org:8080/7703
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Rejects queries during admission control if:
* the largest (across all backends) min buffer reservation is
  greater than the query mem_limit or buffer_pool_limit
* the sum of the min buffer reservations across the cluster
  is larger than the pool max mem resources

There are some other interesting cases to consider later:
* every per-backend min buffer reservation is less than the
  associated backend's process mem_limit; the current
  admission control code doesn't know about other backend's
  proc mem_limits.

Also reduces minimum non-reservation memory (IMPALA-5810).
See the JIRA for experimental results that show this
slightly improves min memory requirements for small queries.
One reason to tweak this is to compensate for the fact that
BufferedBlockMgr didn't count small buffers against the
BlockMgr limit, but BufferPool counts all buffers against
it.

Testing:
* Adds new test cases in test_admission_controller.py
* Adds BE tests in reservation-tracker-test for the
  reservation-util code.

Change-Id: Iabe87ce8f460356cfe4d1be4d7092c5900f9d79b
Reviewed-on: http://gerrit.cloudera.org:8080/7678
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I9b1698ee423c0db419b5de74c3664c88294fff49
Reviewed-on: http://gerrit.cloudera.org:8080/7768
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins
The join inversion optimisation did not factor in the degree of
parallelism that the join executed with after inversion. In some cases
this lead to bad decisions, e.g. executing a join on a single node
instead of 20 nodes.

This patch adds a more sophisticated cost model that factors degree
of parallelism into the join inversion decision.

The behaviour is unchanged if inversion does not change the degree of
parallelism.

Perf:
Ran cluster TPC-H and TPC-DS benchmarks. Average changes were small:
< 3%. Saw a mix of improvements and regressions. We were satisfied
that the regressions were cases when the planner "got lucky" previously.
E.g. on TPC-H Q2 a join was flipped to put lineitem on the left as a
result of inaccurate cardinality estimates.

Mostafa also ran a TPC-DS benchmark where the dimension tables were
loaded with num_nodes=1 to minimise the number of files. We saw some
huge speedups there on the unmodified queries, e.g. TPCDS-Q10 went from
291s to 32.25s. The worst percentage regression was Q50, which went
from 1.61s to 2.4s and the worst absolute regression was Q72, which
went from 694s to 874s (25%).

Change-Id: Icacea4565ce25ef15aaab014684c9440dd501d4e
Reviewed-on: http://gerrit.cloudera.org:8080/7351
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Make the call to format() compatible with older versions of python
(< 2.7), which expect an indices in the string being formatted,
e.g. "{0} {1} {2}".format('foo', 'bar', 'baz'). Without the numbers,
format() raises an exception.

Tested by running this test suite using python 2.6.6. Before the
patch, the tests failed. After the patch, they pass.

Change-Id: I5384aaf83a6a1f3c7643ed9f15de2dba1a5913a5
Reviewed-on: http://gerrit.cloudera.org:8080/7761
Reviewed-by: David Knupp <[email protected]>
Tested-by: Impala Public Jenkins
…eIPA

FreeIPA is a piece of software that automates and simplifies management
of MIT krb5, SSSD, some LDAP service, etc. FreeIPA configures a
localauth plugin[1] in krb5.conf to map Kerberos principals to local
usernames. In this configuration, Kudu daemons were failing to start up
due to failure to map their own service principals back to a username.
This is due to a number of issues:

1) FreeIPA distinguishes between service principals and user principals
and doesn't store a 'uid' field in LDAP for service principals. Thus,
when 'sssd' tries to map a service principal to a local unix user, it
determines that there is no such user (ie getpwnam() fails). This is by
design, best I can tell.

2) sssd's implementation of krb5_auth_to_localname[1] uses getpwnam to try
to map the kerberos principal to the local username. Because of the
above, it fails for service principals.

3) Prior to el7.3, ssd configures krb5 with 'enable_only = sssd' in the
localauth plugin section. This means that if sssd fails to perform the
mapping, it does not fall back to other mappings defined in krb5.conf
(eg explicitly defined auth_to_local rules). See [2]

4) Even after 7.3, there is an additional bug in sssd which I just
filed[3], which causes the fallback to still not work. Because of this,
we're getting the KRB5_PLUGIN_NO_HANDLE error code back up at the Kudu
layer.

We already have our own fallback case for KRB5_LNAME_NO_TRANS, and it
seems like we should just be handling PLUGIN_NO_HANDLE in the same way
to workaround the above behavior.

I tested this patch on a FreeIPA-configured system on el6.7. I was able
to successfully start a master with a FreeIPA-provided keytab and
authentication required, and use 'kudu table list' to authenticate to
it.

[1] https://github.com/SSSD/sssd/blob/master/src/krb5_plugin/sssd_krb5_localauth_plugin.c
[2] https://bugzilla.redhat.com/show_bug.cgi?id=1297462
[3] https://pagure.io/SSSD/sssd/issue/3459

Change-Id: I7bc13b33053a73784350c9d30a3796a96d318c04
Reviewed-on: http://gerrit.cloudera.org:8080/7551
Tested-by: Kudu Jenkins
Reviewed-by: Alexey Serbin <[email protected]>
Reviewed-on: http://gerrit.cloudera.org:8080/7745
Reviewed-by: Michael Ho <[email protected]>
Tested-by: Impala Public Jenkins
This change enforces the rule that constants under be/src/kudu directory
should be named kFooBar and fixes the cases where we didn't adhere to this.
This is needed to avoid conflicts when cherry-picking other changes.

Change-Id: I7971659ef3152580d44d6ddfb18be7ebf41052c7
Reviewed-on: http://gerrit.cloudera.org:8080/7158
Reviewed-by: Adar Dembo <[email protected]>
Tested-by: Kudu Jenkins
Reviewed-on: http://gerrit.cloudera.org:8080/7740
Reviewed-by: Henry Robinson <[email protected]>
Tested-by: Impala Public Jenkins
In Coordinator::UpdateBackendExecStatus(), we check if the backend
has already completed with BackendState::IsDone() and return without
applying the update if so to avoid updating num_remaining_backends_
twice for the same completed backend.

The problem is that the value of BackendState::IsDone() is updated by
the call to BackendState::ApplyExecStatusReport() that comes after it,
but these operations are not performed atomically, so if there are
two simultaneous calls to UpdateBackendExecStatus(), they can both
call IsDone(), both get 'false', and then proceed to erroneously both
update num_remaining_backends_, hitting a DCHECK.

This patch modifies ApplyExecStatusReport to return true iff this
report transitioned the backend to a done status, and then only
updates num_remaining_backends_ in this case, ensuring it is only
updated once per backend.

Testing:
- Ran test_finst_cancel_when_query_complete 10,000 times without
  hitting the DCHECK (previously, it would hit about once per 300
  runs).

Change-Id: I1528661e5df6d9732ebfeb414576c82ec5c92241
Reviewed-on: http://gerrit.cloudera.org:8080/7577
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I1212de7ed4e1f93767e00d1f3245b02bb88fa015
Reviewed-on: http://gerrit.cloudera.org:8080/7774
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Adds support for a "max_row_size" query option that instructs Impala
to reserve enough memory to process rows of the specified size. For
spilling operators, the planner reserves enough memory to process
rows of this size. The advantage of this compared to simply
specifying larger values for min_spillable_buffer_size and
default_spillable_buffer_size is that operators may be able to
handler larger rows without increasing the size of all their
buffers.

The default value is 512KB. I picked that number because it doesn't
increase minimum reservations *too* much even with smaller buffers
like 64kb but should be large enough for almost all reasonable
workloads.

This is implemented in the aggs and joins using the variable page size
support added to BufferedTupleStream in an earlier commit. The synopsis
is that each stream requires reservation for one default-sized page
per read and write iterator, and temporarily requires reservation
for a max-sized page when reading or writing larger pages. The
max-sized write reservation is released immediately after the row
is appended and the max-size read reservation is released after
advancing to the next row.

The sorter and analytic simply use max-sized buffers for all pages
in the stream.

Testing:
Updated existing planner tests to reflect default max_row_size. Added
new planner tests to test the effect of the query option.

Added "set" test to check validation of query option.

Added end-to-end tests exercising spilling operators with large rows
with and without spilling induced by SET_DENY_RESERVATION_PROBABILITY.

Change-Id: Ic70f6dddbcef124bb4b329ffa2e42a74a1826570
Reviewed-on: http://gerrit.cloudera.org:8080/7629
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
This patch adds a new command "rerun" and a shortcut "@" to impala-shell
. Users can rerun a certain query by its index given by history command.
A valid index is an integer in [1, history_length] or
[-history_length, -1]. Negative values index history in reverse order.
For example, "@1;" or "rerun 1;" reruns the first query shown in history
and "@-1;" reruns the last query. The rerun command itself won't appear
in history. The history index is 1-based and increasing. Old entries
might be truncated when impala-shell starts, and the indexes will be
realigned to 1, so the same index may refer to different commands among
multiple impala-shell instances.

Testing: A test case test_rerun is added to
shell/test_shell_interactive.py

Change-Id: Ifc28e8ce07845343267224c3b9ccb71b29a524d2
Reviewed-on: http://gerrit.cloudera.org:8080/7674
Reviewed-by: Sailesh Mukil <[email protected]>
Tested-by: Impala Public Jenkins
Fixed a bug where impala crashes during execution of an aggregation
query using nondeterministic grouping expressions. This happens when
it tries to rebuild a spilled partition that can fit in memory and rows
get re-hashed to a partition other than the spilled one due to the use
of nondeterministic expressions.

Testing:
Added a query test to verify successful execution.

Change-Id: Ibdb09239577b3f0a19d710b0d148e882b0b73e23
Reviewed-on: http://gerrit.cloudera.org:8080/7714
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
Currently, a file handle in the file handle cache will
only be evicted if the cache reaches its capacity. This
means that file handles can be retained for an indefinite
amount of time. This is true even for files that have
been deleted, replaced, or modified. Since a file handle
maintains a file descriptor for local files, this can
prevent the disk space from being freed. Additionally,
unused file handles are wasted memory.

This adds code to evict file handles that have been
unused for longer than a specified threshold. A thread
periodically checks the file handle cache to see if
any file handle should be evicted. The threshold is
specified by 'unused_file_handle_timeout_sec'; it
defaults to 6 hours.

This adds a test to custom_cluster/test_hdfs_fd_caching.py
to verify the eviction behavior.

Change-Id: Iefe04b3e2e22123ecb8b3e494934c93dfb29682e
Reviewed-on: http://gerrit.cloudera.org:8080/7640
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Seems to have broken with some recent commits.

Change-Id: I9c22e197662228158d7935ebfb12d9b3691eb499
Reviewed-on: http://gerrit.cloudera.org:8080/6151
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
The bug is that OutputAllBuild() called BufferedTupleStream::GetNext()
while 'out_batch' still referenced data from the current page
of the stream. When iterating over an unpinned stream, GetNext()
sets the 'needs_deep_copy' flag when it hits the end of a page,
so that the caller has an opportunity to flush or deep copy the
data. On the next call to GetNext(), that page may be deleted
or unpinned.

The fix is to check whether the batch is at capacity before
calling BTS::GetNext().

This issue was masked by not using the 'delete_on_read' mode of the
stream, which would have freed the stream's buffers earlier and
increased the odds of ASAN detecting the problem.

Testing:
Running TestTPCHJoinQueries.test_outer_joins() reliably reproduced this
for me locally under ASAN. After the fix the problem does not reoccur.

Change-Id: Ia14148499ddaec41c2e70fef5d53e5d06ea0538d
Reviewed-on: http://gerrit.cloudera.org:8080/7772
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
This patch adds decimal type codegen support in text scanner. Currently
codegen would be disabled if there is a decimal column. With this patch
StringParser::StringToDecimal will be called in generated code. A new
file util/string-parser.cc is created and linked into libUtil. This file
contains proxy functions to StringToDecimal in ordered to keep
StringToDecimal out of LLVM IR.

In a benchmark query:
> select l_quantity, l_extendedprice, l_discount, l_tax from biglineitem where l_quantity > 100.0;
where biglineitem is tpch.lineitem repeated 6 times, the codegen version
is 19% faster than non-codgen version in scanning, and 8% faster in
query time. Codegen time in this simple case is 69ms.

Simple performance tests show that putting the parser in libUtil instead
of impala-sse.bc would reduce codegen time by 2/3 in cases where only
one decimal column is parsed while the scanning time is nearly the same.

Change-Id: Ia65820e969d59094dc92d912a5279fa90f6b179d
Reviewed-on: http://gerrit.cloudera.org:8080/7683
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Sometimes the client is not open when the debug action fires at the
start of Open() or Prepare(). In that case we should set the
probability when the client is opened later.

This caused one of the large row tests to start failing with a "failed
to repartition" error in the aggregation. The error is a false positive
caused by two distinct keys hashing to the same partition. Removing the
check allows the query to succeed because the keys hash to different
partitions in the next round of repartitioning.

If we repeatedly get unlucky and have collisions, the query will still
fail when it reaches MAX_PARTITION_DEPTH.

Testing:
Ran TestSpilling in a loop for a couple of hours, including the
exhaustive-only tests.

Change-Id: Ib26b697544d6c2312a8e1fe91b0cf8c0917e5603
Reviewed-on: http://gerrit.cloudera.org:8080/7771
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
In ParquetLevelDecoder::Init() for RLE encoding, we read the metadata
size and advance the data buffer past it. If the metadata size is
corrupted, it can cause us to incorrectly read past the end of the
buffer.

This patch checks that the metadata size is less than the total size
of the buffer, and returns an error if it isn't.

Testing:
- Ran test_scanners_fuzz.py under ASAN 500 times without hitting the
  use-after-poison (previously it would usually hit in < 100 runs).

Change-Id: I3f3d0d998f7581c7c935d98fde886f145efd61a8
Reviewed-on: http://gerrit.cloudera.org:8080/7769
Reviewed-by: Alex Behm <[email protected]>
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
glglwty and others added 29 commits October 24, 2017 15:26
Filter states indexed by FilterStats::ROW_GROUPS_KEY was only relevant
if the runtime filter is bound by partitioning columns. This is no
longer true since always-false filter can filter out row groups as
well, which results in a crash in CheckForAlwaysFalse(). This patch
registers row groups counters unconditionally to fix this bug.

Change-Id: I3eda43845b78516c1e76e07e0d2dd9d862168e1d
Reviewed-on: http://gerrit.cloudera.org:8080/8369
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Uses a thread pool to issue many compute stats commands in parallel to
Impala, rather than doing it serially. Where it was obvious, I combined
multiple stats commands into fewer, to reduce the number
of "show databses" and serialized "show tables" commands.

This speeds up the compute stats step in data loading significantly. My
measurements for testdata/bin/compute-table-stats.sh running before and
after this change, with the Impala daemons restarted (cold) or not
restarted (warm) on an 8-core, 32GB RAM machine were:

old, cold: 7m44s
new, cold: 1m42s

old, warm: 1m23s
new, warm:   48s

The data load in the full test build behaves in a cold fashion. It's
typical for https://jenkins.impala.io/job/ubuntu-16.04-from-scratch/ to
run this compute stats step for 9 or 10 minutes. With this change, this
will come down to about 2 minutes.

Change-Id: Ifb080f2552b9dbe304ecadd6e52429214094237d
Reviewed-on: http://gerrit.cloudera.org:8080/8354
Reviewed-by: David Knupp <[email protected]>
Tested-by: Impala Public Jenkins
This commit loads functional-query, TPC-H data, and TPC-DS data in
parallel. In parallel, these take about 37 minutes, dominated by
functional-query. Serially, these take about 30 minutes more, namely the
13 minutes of tpcds and 16 minutes of tpcds. This works out nicely
because CPU usage during data load is very low in aggregate. (We don't
sustain more than 1 CPU of load, whereas build machines are likely to
have many CPUs.)

To do this, I added support to run-step.sh to have a notion of a
backgroundable task, and support waiting for all tasks.

I also increased the heapsize of our HiveServer2 server. When datasets
were being loaded in parallel, we ran out of memory at 256MB of heap.

The resulting log output is currently like so (but without the
timestamps):

15:58:04  Started Loading functional-query data in background; pid 8105.
15:58:04  Started Loading TPC-H data in background; pid 8106.
15:58:04  Loading functional-query data (logging to /home/impdev/Impala/logs/data_loading/load-functional-query.log)...
15:58:04  Started Loading TPC-DS data in background; pid 8107.
15:58:04  Loading TPC-H data (logging to /home/impdev/Impala/logs/data_loading/load-tpch.log)...
15:58:04  Loading TPC-DS data (logging to /home/impdev/Impala/logs/data_loading/load-tpcds.log)...
16:11:31    Loading workload 'tpch' using exploration strategy 'core' OK (Took: 13 min 27 sec)
16:14:33    Loading workload 'tpcds' using exploration strategy 'core' OK (Took: 16 min 29 sec)
16:35:08    Loading workload 'functional-query' using exploration strategy 'exhaustive' OK (Took: 37 min 4 sec)

I tested dataloading with the following command on an 8-core, 32GB
machine. I saw 19GB of available memory during my run:
  ./buildall.sh -testdata -build_shared_libs -start_minicluster -start_impala_cluster -format

Change-Id: I836c4e1586f229621c102c4f4ba22ce7224ab9ac
Reviewed-on: http://gerrit.cloudera.org:8080/8320
Reviewed-by: Jim Apple <[email protected]>
Reviewed-by: Michael Brown <[email protected]>
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
Before this patch, decimal operations would never produce an error.
Division by and modulo zero would result in a NULL.

In this patch, we change this behavior so that we raise an error
instead of returning a NULL. We also modify the format of the decimal
expr tests format to also include an error field.

Testing:
- Added several expr and end to end tests.

Change-Id: If7a7131e657fcdd293ade78d62f851dac0f1e3eb
Reviewed-on: http://gerrit.cloudera.org:8080/8344
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
The bug is that the pointer was a uint8_t*, which C++ iostreams
print as a C-style string. The intent was to instead print the
memory address, which happens for void*. To avoid the subtlety,
instead convert to uintptr_t and print as hex.

Change-Id: I89250646bf683dd2d636dcb37a66ceb0428af8b2
Reviewed-on: http://gerrit.cloudera.org:8080/8371
Reviewed-by: anujphadke <[email protected]>
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
Testing:
Added test case to verify that CopyRows in select node is successfully
codegened.
Improved test coverage for select node with limit.

Performance:
Queries used (num_nodes set to 1):
500 Predicates: select * from (select * from tpch_parquet.lineitem
limit 6001215) t1 where l_partkey > 10 and l_extendedprice > 10000 and
l_linenumber > 1 and l_comment >'foo0' .... and l_comment >'foo500'
order by l_orderkey limit 10;

1 Predicate: select * from (select * from tpch_parquet.lineitem
limit 6001215) t1 where l_partkey > 10 order by l_orderkey limit 10;

+--------------+-----------------------------------------------------+
|              |      500 Predicates      |       1 Predicate        |
|              +------------+-------------+------------+-------------+
|              |   After    |   Before    |   After    |   Before    |
+--------------+------------+-------------+------------+-------------+
| Select Node  | 12s385ms   | 1m1s        | 234ms      | 797ms       |
| Codegen time | 2s619ms    | 1s962ms     | 200ms      | 181ms       |
+--------------+------------+-------------+------------+-------------+

Change-Id: Ie0d496d004418468e16b6f564f90f45ebbf87c1e
Reviewed-on: http://gerrit.cloudera.org:8080/8196
Reviewed-by: Bikramjeet Vig <[email protected]>
Tested-by: Impala Public Jenkins
Dataload typically follows a pattern of loading data into
a text version of a table, and then using an insert
overwrite from the text table to populate the table for
other file formats. This insert is always done in Impala
for Parquet and Kudu. Otherwise it runs in Hive.

Since Impala doesn't support writing nested data, the
population of complextypes_fileformat tries to hack
the insert to run in Hive by including it in the ALTER
part of the table definition. ALTER runs immediately
after CREATE and always runs in Hive. The problem is
that ALTER also runs before the base table
(functional.complextypes_fileformat) is populated.
The insert succeeds, but it is inserting zero rows.

This code change introduces a way to force the Parquet
load to run using Hive. This lets complextypes_fileformat
specify that the insert should happen in Hive and fixes
the ordering so that the table is populated correctly.

This is also useful for loading custom Parquet files
into Parquet tables. Hive supports the DATA LOAD LOCAL
syntax, which can read a file from the local filesystem.
This means that several locations that currently use
the hdfs commandline can be modified to use this SQL.
This change speeds up dataload by a few minutes, as it
avoids the overhead of the hdfs commandline.

Any other location that could use DATA LOAD LOCAL is
also switched over to use it. This includes the
testescape* tables which now print the appropriate
DATA LOAD commands as a result of text_delims_table.py.
Any location that already uses DATA LOAD LOCAL is also
switched to indicate that it must run in Hive. Any
location that was doing an HDFS command in the LOAD
section is moved to the LOAD_DEPENDENT_HIVE section.

Testing: Ran dataload and core tests. Also verified that
functional_parquet.complextypes_fileformat has rows.

Change-Id: I7152306b2907198204a6d8d282a0bad561129b82
Reviewed-on: http://gerrit.cloudera.org:8080/8350
Reviewed-by: Joe McDonnell <[email protected]>
Tested-by: Impala Public Jenkins
'Test case 16' in test_row_filters has been failing occasionaly on
ASAN as the runtime filters are not generated within the specified
RUNTIME_FILTER_WAIT_TIME_MS. The fix is to increase
RUNTIME_FILTER_WAIT_TIME_MS.

This patch updates all of the tests in test_row_filters to use the
same timeout, which is set to a higher value for ASAN builds.

Change-Id: Ia098735594b36a72f02bf7edd051171689618051
Reviewed-on: http://gerrit.cloudera.org:8080/8358
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I2c261f32aa90aec67dbd1e5f172323e4c3feaa1d
Reviewed-on: http://gerrit.cloudera.org:8080/8378
Reviewed-by: Jim Apple <[email protected]>
Tested-by: Impala Public Jenkins
This adds a warning to impala-shell if -r/--refresh_after_connect is
used, in anticipation of us removing the feature in a future version.

Change-Id: Id297f80c0f596a69ef8ecde948812b82d2a5c0fa
Reviewed-on: http://gerrit.cloudera.org:8080/8381
Reviewed-by: Philip Zeyliger <[email protected]>
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins
The tpcds-q22a.test test file has a comment before a "set" command.
The regex used to match "set" commands does not handle preceding
comments, which are part of the query statement.

Testing:
Ran the test with the below command and confirmed that DECIMAL_V2 was
automatically set back to 0.

  impala-py.test tests/query_test/test_tpcds_queries.py -k 22a \
      --capture=no

Change-Id: Id549dd3369dd163f3b3c8fe5685a52e0e6b2d134
Reviewed-on: http://gerrit.cloudera.org:8080/8384
Reviewed-by: Michael Brown <[email protected]>
Tested-by: Impala Public Jenkins
We may be seeing a race with errors like "java.io.FileNotFoundException:
File /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist".

This reverts commit e020c37.

Change-Id: I46da93f4315a5a4bdaa96fa464cb51922bd6c419
Reviewed-on: http://gerrit.cloudera.org:8080/8386
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Currently, the FE generates a number of runtime filters and assigns
them to the single node plan without taking the value of
RUNTIME_FILTER_MODE and DISABLE_ROW_RUNTIME_FILTERING query options
into account.

The backend then removes filters from exec nodes, based on the
following rules:
1. If DISABLE_ROW_RUNTIME_FILTERING is set, filters are removed from
   the exec nodes that are marked as targets not bound by partition
   columns.

2. If RUNTIME_FILTER_MODE is set to LOCAL, filters are removed from
   the exec nodes that are marked as remote targets.

This may cause some confusion to users because they may see runtime
filters in the output of explain that are not applied when the query
is executed.

This change moves the logic of runtime filter pruning to the planner
in the FE. The runtime filter assignment is done on the distributed
plan and the above constraints are enforced there directly.

Change-Id: Id0f0b200e02442edcad8df3979f652d66c6e52eb
Reviewed-on: http://gerrit.cloudera.org:8080/7564
Tested-by: Impala Public Jenkins
Reviewed-by: Alex Behm <[email protected]>
…oter size

This commit introduces the compile-time constant READ_SIZE_MIN_VALUE
in the newly created common/global-flags.h

A static_assert checks if READ_SIZE_MIN_VALUE is greater than or equal to
HdfsParquetScanner::FOOTER_SIZE.

Also, moved the definition of read_size flag to global-flags.cc,
and declared it in global-flags.h.

Runtime checks test if read_size is greater than or eaqual to
READ_SIZE_MIN_VALUE. If not, the process is terminated.

Change-Id: Ib7364147e7daf5380f11368871f8479646b448da
Reviewed-on: http://gerrit.cloudera.org:8080/8366
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Implements multi threaded block metadata loading on the Catalog
server where we fetch block metadata for multiple partitions of a
single table in parallel. Number of threads to load the metadata is
controlled by the following two parameters (set on the Catalog server
startup and applies for each table load)

-max_hdfs_partitions_parallel_load(default=5)
-max_nonhdfs_partitions_parallel_load(default=20)

We use different thread pool sizes for HDFS and non-HDFS tables since
non-HDFS supports much higher throughput of RPC calls for listStatus
/listFiles. Based on our experiments, S3 showed a linear speed up
(up to ~113x) with increasing number of loading threads where as the
HDFS throughput was limited to ~5x in un-secure clusters and up to
~3.7x in secure clusters. We narrowed it down to scalability
bottlenecks in HDFS RPC implementation (HADOOP-14558) on both the
server and the client side.

One thing to note here is that the thread pool based metadata fetching
is implemented only for loading HDFS block metadata and not for loading
HMS partition information. Our experiments showed that while loading
large partitioned tables, ~90% of the time is spent in connecting to NN
and loading the HDFS block information and optimizing the rest ~10% makes
the code unnecessarily complex without much gain.

Additional notes:

- The multithreading approach is implemented for
  * INVALIDATE (loading from scratch),
  * REFRESH (reusing existing md) code paths,
  * ALTER TABLE ADD/RECOVER PARTITIONS.

- This patch makes the implementation of ListMap thread-safe since
we use that data structure as a shared state between multiple partition
metadata loding threads.

Testing and Results:

- This patch doesn't add any new tests since there is enough test
coverage already. Passed core/exhaustive runs with HDFS/S3.

- We noticed up to ~113x speedup on S3 tables(thread_pool_size=160)
and up to ~5x speed up in un-secure HDFS clusters and ~3.7x in secure
HDFS clusters.

- Synthesized the following two large tables on HDFS and S3 and noticed
significant reduction in my test DDL queries.

  (1) 100K partitions + 1 million files
  (2) 80 partitions + 250K files

 100K-PARTITIONS-1M-FILES-CUSTOM-11-REFRESH-PARTITION     I -16.4%
 100K-PARTITIONS-1M-FILES-CUSTOM-08-ADD-PARTITION         I -17.25%
 80-PARTITIONS-250K-FILES-11-REFRESH-PARTITION            I -23.57%
 80-PARTITIONS-250K-FILES-S3-08-ADD-PARTITION             I -23.87%
 80-PARTITIONS-250K-FILES-09-INVALIDATE                   I -24.88%
 80-PARTITIONS-250K-FILES-03-RECOVER                      I -35.90%
 80-PARTITIONS-250K-FILES-07-REFRESH                      I -43.03%
 100K-PARTITIONS-1M-FILES-CUSTOM-12-QUERY-PARTITIONS      I -43.93%
 100K-PARTITIONS-1M-FILES-CUSTOM-05-QUERY-AFTER-INV       I -46.59%
 80-PARTITIONS-250K-FILES-10-REFRESH-AFTER-ADD-PARTITION  I -48.71%
 100K-PARTITIONS-1M-FILES-CUSTOM-07-REFRESH               I -49.02%
 80-PARTITIONS-250K-FILES-05-QUERY-AFTER-INV              I -49.05%
 100K-PARTITIONS-1M-FILES-CUSTOM-10-REFRESH-AFTER-ADD-PARTI -51.87%
 80-PARTITIONS-250K-FILES-S3-03-RECOVER                   I -67.17%
 80-PARTITIONS-250K-FILES-S3-05-QUERY-AFTER-INV           I -76.45%
 80-PARTITIONS-250K-FILES-S3-07-REFRESH                   I -87.04%
 80-PARTITIONS-250K-FILES-S3-10-REFRESH-AFTER-ADD-PART    I -88.57%

Change-Id: I07eaa7151dfc4d56da8db8c2654bd65d8f808481
Reviewed-on: http://gerrit.cloudera.org:8080/8235
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins
Impala currently kinits by forking off a child process. This
has proved to be expensive in many cases since the subprocess
tries to reserve as much memory as Impala is currently using
which can be quite a lot.

This patch adds a flag called 'use_kudu_kinit' that defaults to
true. When it's true, it uses the Kudu security library's kinit code
that programatically uses the krb5 library to kinit.
When it's false, we run our current path which kicks off the
kinit-thread and forks off a kinit process periodically to reacquire
tickets based on FLAGS_kerberos_reinit_interval.

Converted existing tests in thrift-server-test to run with and
without kerberos. We now run this BE test with kerberos by using
Kudu's MiniKdc utility. This introduces a new dependency on some
kerberos binaries that are checked through FindKerberosPrograms.cmake.
Note that this is only a test dependency and not a dependency for
the impalad binaries and friends. Compilation will still succeed if
the kerberos binaries for the MiniKdc are not found, however, the
thrift-server-test will fail. We run with and without the
'use_kudu_kinit' flag.

TODO: Since the setting up and tearing down of our security code
isn't idempotent, we can run only any one test in a process with
Kerberos now (IMPALA-6085).

Updated bin/bootstrap_system.sh to install new sasl-gssapi
modules and the kerberos binaries required for the MiniKdc.
Also fixed a bug that didn't transfer the environment into 'sudo'
in bin/bootstrap_system.sh.

Testing: Verified with thrift-server-test and also manually on a
live kerberized cluster.

Change-Id: I9cea56cc6e7412d87f4c2e92399a2f91ea6af6c7
Reviewed-on: http://gerrit.cloudera.org:8080/7938
Reviewed-by: Sailesh Mukil <[email protected]>
Tested-by: Impala Public Jenkins
This patch removes BufferedTupleStream::GetRows. This function pins a
stream and reads all the rows into a single batch. It is not a good API
since it creates an arbitrarily large row batch. In this patch the call
sites pin the stream and then directly use GetNext to retrieve a single
batch at a time.

Testing: It passes existing tests. A test case for GetRows is removed.

Change-Id: I3831c38994da2b69775a9809ff01de5d23584414
Reviewed-on: http://gerrit.cloudera.org:8080/8226
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
We don't know the root cause yet but try to improve things:
* Eliminate one possible cause of flakiness - unfinished fragments left
  from previous queries.
* Print a profile if an assertion fails so we can see why it failed.

Testing:
Ran core tests.

Change-Id: Ic332dddd96931db807abb960db43b99e5fd0f256
Reviewed-on: http://gerrit.cloudera.org:8080/8403
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Release of backend resources for query execution on the coordinator
daemon should occur eagerly as soon as query execution is finished
or cancelled. Before this patch some resources managed by QueryState,
like scratch files, were only released when the query was closed
and all of the query's control structures were torn down.

These resources are referred to as "ExecResources" in various places to
distinguish them from resources associated with the client request (like
the result cache) that are still required after the query finishes
executing.

This first change does not solve the admission control problem for two
reasons:
* We don't release the "admitted" memory on the coordinator until
  the query is unregistered.
* Admission control still considers the memory reserved until the
  query memtracker is unregistered, which happens only when the
  QueryState is destroyed: see MemTracker::GetPoolMemReserved().

The flow is mostly similar to initial_reservation_refcnt_, except the
coordinator also holds onto a reference count, which is released
when either the final row is returned or cancellation is initiated.
After the coordinator releases its refcount, the resources can be
freed as soon as local fragments also release their refcounts.

Also clean up Coordinator slightly by preventing runtime_state() from
leaking out to ClientRequestState - instead it's possible to log
the query MemTracker by following parent links in the MemTracker.

This patch is partially based on Joe McDonnell's IMPALA-1575 patch.

Testing:
Ran core tests.

Change-Id: I41ff374b0403f10a145f7fee9b3145953ee32341
Reviewed-on: http://gerrit.cloudera.org:8080/8303
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
HdfsAvroScanner::CodegenMaterializeTuple generates a function linear in
size to the number of columns. On 1000 column tables, codegen time is
significant. This commit roughly halves it for wide columns.
(Note that this had been much worse in recent history (<= Impala 2.9).)

It does so by breaking up MaterializeTuple() into multiple smaller
functions, and then calls them in order. When breaking up into
200-column chunks, there is a noticeable speed-up.

I've made the helper code for generating LLVM function prototypes
have a mutable function name, so that the builder can be re-used
multiple times.

I've checked by inspecting optimized LLVM that in the case where there's
only 1 helper function, code gets inlined so that there doesn't seem to
be an extra function.

I measured codegen time for various "step sizes." The case where there
are no helper functions is about 2.7s. The best case was about a step
size of 200, with timings of 1.3s.

For the query "select count(int_col16) from functional_avro.widetable_1000_cols",
codegen times as a function of step size are roughly as follows. This is
averaged across 5 executions, and rounded to 0.1s.

   step time
     10	2.4
     50	2.5
     75	2.9
    100	3.0
    125	3.0
    150	1.4
    175	1.3
    200	1.3 <-- chosen step size
    225	1.5
    250	1.4
    300	1.6
    400	1.6
    500	1.8
   1000	2.7

The raw data was generated like so, with some code that let me change the step size at runtime:

  $(for step in 10 50 75 100 125 150 175 200 225 250 300 400 500 1000; do for try in $(seq 5); do echo $step > /tmp/step_size.txt; echo -n "$step "; impala-shell.sh -q "select count(int_col16) from functional_avro.widetable_1000_cols; profile;" 2> /dev/null | grep -A9 'CodeGen:(Total: [0-9]*s' -m 1 | sed -e 's/ - / /' |
  sed -e 's/([0-9]*)//' | tr -d '\n' | tr -s ' ' ' '; echo; done; done) | tee out.txt
  ...
  200  CodeGen:(Total: 1s333ms, non-child: 1s333ms, % non-child: 100.00%) CodegenTime: 613.562us CompileTime: 605.320ms LoadTime: 0.000ns ModuleBitcodeSize: 1.95 MB NumFunctions: 38 NumInstructions: 8.44K OptimizationTime: 701.276ms PeakMemoryUsage: 4.12 MB PrepareTime: 10.014ms
  ...
  1000  CodeGen:(Total: 2s659ms, non-child: 2s659ms, % non-child: 100.00%) CodegenTime: 558.860us CompileTime: 1s267ms LoadTime: 0.000ns ModuleBitcodeSize: 1.95 MB NumFunctions: 34 NumInstructions: 8.41K OptimizationTime: 1s362ms PeakMemoryUsage: 4.11 MB PrepareTime: 10.574ms

I have run the core tests with this change.

Change-Id: I7f1b390be4adf6e6699a18344234f8ff7ee74476
Reviewed-on: http://gerrit.cloudera.org:8080/8211
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Currently a "select *" query in test_inline_view_limit fails during
exhaustive testing because Impala returns columns from HBase tables
in a different order (IMPALA-886) than the one expected. This fix
ensures the column order is consistent by specifying the output
columns in the right order in the select query.

Testing:
Tested locally, with and without exhaustive exploration strategy.

Change-Id: I11667872b8788a8b0040bf9252bf07b987b5d330
Reviewed-on: http://gerrit.cloudera.org:8080/8409
Reviewed-by: Lars Volker <[email protected]>
Tested-by: Impala Public Jenkins
Impala would previously update the ddl time of a table when dropping a
partition but not when adding one. This change removes updates to the
ddl time when partitions are added or removed to be consistent with
Hive.

Additionally the check in the ddl update test would fail if some
operations took longer than 20 seconds. Instead, this change makes sure
that the ddl time increases as intended.

To test this change I ran test_last_ddl_time_update in exhaustive mode
and also ran a private S3 build.

Change-Id: I3126252e7709304d3e1fa4bb06a0b847180bd6cf
Reviewed-on: http://gerrit.cloudera.org:8080/8411
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins
Coordinator::UpdateFilter wrongly assumed std::string::capacity() to be
0 after std::string::clear(), resulting in an DCHECK failure in memory
tracking. This patch tracks size() instead of capacity() to fix the bug.

Change-Id: Ia28f9eb8d0b3e795d798949a1cfcfeaaec7040ed
Reviewed-on: http://gerrit.cloudera.org:8080/8410
Reviewed-by: Tim Armstrong <[email protected]>
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
A recent patch for IMPALA-5129 introduced a use-after-free bug in
thrift-server-test. It is fixed in this patch.

Change-Id: I2cd434757de2cd384def5b360a479e51812cccca
Reviewed-on: http://gerrit.cloudera.org:8080/8412
Reviewed-by: Sailesh Mukil <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I93e99aec2fcbc12f94678e60ebb9d150e72fc77d
Reviewed-on: http://gerrit.cloudera.org:8080/8421
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins
test_wait_time has been flaky recently on ASAN due to hitting a
timeout. The fix is to increase the timeout for ASAN builds.

Change-Id: Iee005bee8e0a535ce59d2e23e56be6004f2eb9de
Reviewed-on: http://gerrit.cloudera.org:8080/8427
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
The approach is to re-materialize strings in those tuples that
survive conjunct evaluation and may reference disk I/O buffers
directly. This means that perf should not regress for the
following cases:
* Compressed Avro files.
* Non-string columns.
* Selective scans where the majority of tuples are filtered out.

This approach will also work for the Sequence and Text scanners.

Includes some improvements to Avro codegen to replace more constants to
help win back some performance (with limited success): replaced
InitTuple() with an optimised version and substituted
tuple_byte_size() with a constant.

Removes dead code for handling CHAR(n) - CHAR(n) is now always fixed
length.

Perf:
Did microbenchmarks on uncompressed Avro files, one with all columns
from lineitem and one with only l_comment. Tests were run with:
  set num_scanner_threads=1;

I ran the query 5 times and extracted MaterializeTupleTime from the
profile to measure CPU cost of materialization. Overall string
materialization got significantly slower, mainly because of the
extra memcpy() calls required.

Selecting one string from a table with multiple columns:

  select min(l_comment) from biglineitem_avro
  1.814 -> 2.096

Selecting one string from a table with one column:

  select min(l_comment) from biglineitem_comment; profile;
  1.708 -> 3.7

Selecting one string from a table with one column with predicate:

  select min(l_comment) from biglineitem_comment where length(l_comment) > 10000;
  1.691 -> 1.449

Selecting all columns:

  select min(l_orderkey), min(l_partkey), min(l_suppkey), min(l_linenumber),
        min(l_quantity), min(l_extendedprice), min(l_discount), min(l_tax),
        min(l_returnflag), min(l_linestatus), min(l_shipdate),
        min(l_commitdate), min(l_receiptdate), min(l_shipinstruct),
        min(l_shipmode), min(l_comment) from biglineitem_avro; profile;
  2.335 -> 3.711

Selecting an int column (no strings):
  select min(l_linenumber) from biglineitem_avro
  1.806 -> 1.819

Testing:
Ran exhaustive tests.

Change-Id: If1fc78790d778c874f5aafa5958c3c045a88d233
Reviewed-on: http://gerrit.cloudera.org:8080/8146
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
This is a revert of a revert, re-enabling parallel data load.  It avoid
the race condition by explicitly configuring the temporary directory in
question in load-data.py.

When the parallel data load change went in, we discovered
a race with a signature of:

  java.io.FileNotFoundException: File
  /tmp/hadoop-jenkins/mapred/local/1508958341829_tmp does not exist

The number in this path is milliseconds since the epoch, and the race
occurs when two queries submitted to HiveServer2, running with the local
runner, hit the same millisecond time stamp.  The upstream bug is
https://issues.apache.org/jira/browse/MAPREDUCE-6441, and I described the
symptoms in https://issues.apache.org/jira/browse/MAPREDUCE-6992 (which
is now marked as a dupe).

I've tested this by running data load 5 times on the same machines
where it failed before. I also ran data load manually and inspected
the system to make sure that the temporary directories are getting
created as expected in /tmp/impala-data-load-*.

Change-Id: I60d65794da08de4bb3eb439a2414c095f5be0c10
Reviewed-on: http://gerrit.cloudera.org:8080/8405
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Setting a very large value could cause strange behaviour like crashing,
hanging, excessive memory usage, spinning etc. This patch rejects values
out of the range [0,65536].

Change-Id: Idd5a2490a73b6915224160d7604b4badc72c1d97
Reviewed-on: http://gerrit.cloudera.org:8080/8419
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
@yuzzjj yuzzjj merged commit 74718b7 into yuzzjj:cdh5-trunk Nov 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.