Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

merge cloudera impala branchs #2

Merged
merged 319 commits into from
Jul 28, 2017
Merged

Conversation

yuzzjj
Copy link
Owner

@yuzzjj yuzzjj commented Jul 28, 2017

merge

dtsirogiannis and others added 30 commits May 12, 2017 19:39
This commit fixes an issue where dropping a table that is not loaded
correctly (throws TableLoadingException) generates an access event that doesn't
use a fully qualified table name.

Change-Id: Icd63f7e4accc7fda9719e13059fa8d432981618a
Reviewed-on: http://gerrit.cloudera.org:8080/6879
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
The following Hadoop metrics have been added to the /metrics page:

hedgedReadOps - the number of hedged reads that have occurred

hedgedReadOpsWin - the number of times the hedged read returned
faster than the original read

The metrics will be updated only when --use_hdfs_pread is set to
'true'.

This change depends on the following new commit to HDFS:
apache/hadoop@8c81a16

Testing: Not adding tests since it requires some custom hadoop
configuration. Tested manually by setting the configurations and
verifying that the metrics work.

Change-Id: Id4a5d396abb3373d352ad2df8c2272db018114da
Reviewed-on: http://gerrit.cloudera.org:8080/6886
Reviewed-by: Matthew Jacobs <[email protected]>
Reviewed-by: Lars Volker <[email protected]>
Tested-by: Impala Public Jenkins
Allow users to keep a longer history of queries if desired.  I
personally find it useful to keep a long history of queries to
reference and want to bump this up to a very large value, but
keep the default reasonable.  Also change the config loader
to not freak out over unknown parameters so as not to break
for users that end up with new options set running on older
shells.

Testing: Created .impalarc as follows, now getting more history saved.
Put broken things in .impalarc and make sure they are logged as
warnings.

[impala]
history_max=1000

Change-Id: Iaf65bbecb8fd7f1105aac62b6745d6125a603d7f
Reviewed-on: http://gerrit.cloudera.org:8080/6335
Reviewed-by: Michael Brown <[email protected]>
Tested-by: Impala Public Jenkins
A memory intensive UDF test takes a while to completely finish and for
the memory in Impala to be completely freed. This caused a problem in
ASAN builds (and potentially in normal builds) because we would start
the next test right away, before the memory is freed.

We fix the issue by checking that all fragments finish executing before
starting the next test.

Testing:
- Ran a private ASAN build which passed.

Change-Id: I0555b5327945c522f70f449caa1214ee0bfd84fe
Reviewed-on: http://gerrit.cloudera.org:8080/6893
Reviewed-by: Alex Behm <[email protected]>
Reviewed-by: Michael Ho <[email protected]>
Tested-by: Impala Public Jenkins
This gflags patch adds DEFINE_int32_hidden() etc. macros, which suppress
flags from appearing in /varz, --help and other flag enumerations.

Our toolchain glog is statically linked against gflags, and therefore
had to be rebuilt, however its version number did not change. You will
likely need to do the following:

rm -rf ${IMPALA_TOOLCHAIN_DIR}/glog-0.3.4-p2/

before running bin/bootstrap_toolchain.py, otherwise building Impala may
fail with a linking error.

Change-Id: Ibc09a750879a8eae8b3549b9438241cb7c4448ed
Reviewed-on: http://gerrit.cloudera.org:8080/6889
Reviewed-by: Lars Volker <[email protected]>
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
This change switches to a new Breakpad version, which includes fixes for
Breakpad bugs #681 and #728. The toolchain change was reviewed here:
https://gerrit.cloudera.org/6866

The change also undoes the workaround introduced in IMPALA-3794.

In addition to running test_breakpad.py in a loop for a while, I tested
Then I verified that the test fails with the old toolchain version
(88e5b2) and works with the new one (ffe3e4).

To test #728 I added a sleep() call before SendContinueSignalToChild()
and then killed the parent process, manually observing that the child
would die, too.

Change-Id: Ic541ccd565f2bb51f68c085747fc47ae8c905d19
Reviewed-on: http://gerrit.cloudera.org:8080/6883
Reviewed-by: Lars Volker <[email protected]>
Tested-by: Impala Public Jenkins
The recent Kudu TIMESTAMP patch (IMPALA-5137) made an
inadvertent change [1] to alltypeserror_tmp and
alltypeserrornonulls_tmp, changing 'timestamp_col' from
STRING to TIMESTAMP.

This seems to cause failures on exhaustive jobs which run
test_hdfs_scan_node_errors against all file-formats.
I haven't been able to reproduce this failure myself, so
cannot test whether this fixes the jobs that are failing, but
this change to revert these tables seems warranted given
they were changed inadvertently.

1: https://gerrit.cloudera.org/#/c/6526/11/testdata/datasets/functional/functional_schema_template.sql

Change-Id: I533f1921662802ea6e076eefac973f50c014fcb5
Reviewed-on: http://gerrit.cloudera.org:8080/6891
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Matthew Jacobs <[email protected]>
By default, Kudu assumes it has 80% of system memory which
is far too high for the minicluster. This sets a mem limit
of 2gb and lowers the limit of the block cache. These values
were tested on a gerrit-verify-dryrun job as well as an
exhaustive run.

This patch also simplifies TestKuduMemLimits which was
unnecessarily creating a large table during test execution.

Change-Id: I7fd7e1cd9dc781aaa672a2c68c845cb57ec885d5
Reviewed-on: http://gerrit.cloudera.org:8080/6844
Reviewed-by: Todd Lipcon <[email protected]>
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
This change builds on the support for reading and writing
TIMESTAMP columns to Kudu tables (see [1]), adding support
for pushing TIMESTAMP predicates to Kudu for scans.

Binary predicates and IN list predicates are supported.

Testing: Added some planner and EE tests to validate the
behavior.

1: https://gerrit.cloudera.org/#/c/6526/

Change-Id: I08b6c8354a408e7beb94c1a135c23722977246ea
Reviewed-on: http://gerrit.cloudera.org:8080/6789
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Adds support in DDL for timestamps in Kudu range partition syntax.

For convenience, strings can be specified with or without
explicit casts to TIMESTAMP.

E.g.
create table ts_ranges (ts timestamp primary key, i int)
partition by range (
  partition '2009-01-02 00:00:00' <= VALUES < '2009-01-03 00:00:00'
) stored as kudu

Range bounds are converted to Kudu UNIXTIME_MICROS during
analysis.

Testing: Adds FE and EE tests.

Change-Id: Iae409b6106c073b038940f0413ed9d5859daaeff
Reviewed-on: http://gerrit.cloudera.org:8080/6849
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Non-deterministic exprs which evaluate as constant should not be
used during HDFS partition pruning.  We consider Exprs which have no
SlotRefs as bound by default, and thus we end up trying to apply
them indisrciminately.  Constant propagation makes this situation
easier to run into and the behavior is rather unexpected.

The fix for now is to explicitly disallow non-deterministic Exprs
in partition pruning.

Change-Id: I91054c6bf017401242259a1eff5e859085285546
Reviewed-on: http://gerrit.cloudera.org:8080/6575
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I17268bdb480230938f94559fe1eabe34ac2448b7
Reviewed-on: http://gerrit.cloudera.org:8080/5589
Reviewed-by: Jim Apple <[email protected]>
Tested-by: Impala Public Jenkins
IMPALA-4166 introduced a bug by duplicating code that adds sort
expressions. Upon re-analysis, this code would hit an
IndexOutOfBoundsException.

Change-Id: Ibebba29509ae7eaa691fe305500cda6bd41a179a
Reviewed-on: http://gerrit.cloudera.org:8080/6921
Reviewed-by: Lars Volker <[email protected]>
Tested-by: Impala Public Jenkins
Previously, updates to the query state in ClientRequestState were
not immediately reflected in the query profile, potentially
leading to the profile showing an incorrect state for an extended
perioud during execution.

In particular, queries were being shown in the 'CREATED' state
long after they had started 'RUNNING'.

The fix is to update the profile whenever the state is updated.

Testing:
- Extended existing hs2 tests and added a beeswax test to check
  for expected query states in the profile

Change-Id: I952319b7308a24d4e2dff924199c0c771bce25b3
Reviewed-on: http://gerrit.cloudera.org:8080/6923
Reviewed-by: Dan Hecht <[email protected]>
Reviewed-by: Thomas Tauber-Marshall <[email protected]>
Tested-by: Impala Public Jenkins
The sortby() hint is superseded by the SORT BY SQL clause, which has
been introduced in IMPALA-4166. This changes removes the hint.

Change-Id: I83e1cd6fa7039035973676322deefbce00d3f594
Reviewed-on: http://gerrit.cloudera.org:8080/6885
Reviewed-by: Lars Volker <[email protected]>
Tested-by: Impala Public Jenkins
Without this, buildall.sh -ninja fails to run the backend tests or
runs them with the Makefiles that were created when buildall.sh was
last run without the -ninja flag.

Change-Id: Idb920dd4b08d8ef5fbc0bf1ea1b424a0c544e1db
Reviewed-on: http://gerrit.cloudera.org:8080/6942
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
The buffers contain the Parquet DataPages, which need to be
attached to the row batch if the rows point to var-len data
stored directly in the page. Otherwise the buffers can be
discarded once the values in the page have been materialized.

This reduces the amount of memory transferred between threads, which is
a known TCMalloc anti-pattern. It also allows us to free memory
earlier, which may help reduce memory consumption slightly.

Also fix a latent bug I noticed where needs_conversion_ is not
always initialised in the constructor.

Testing
Ran exhaustive build. Most of the Parquet tests use compressed Parquet,
which should exercise this code path.

Change-Id: I2dbd749f43078b222ff8e1ddcec840986c466de6
Reviewed-on: http://gerrit.cloudera.org:8080/6876
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Misc changes to improve usability of the profiles.

* Separate out detailed BufferPool metrics into a "Buffer pool"
  sub-profile.
* Only create the limit counter if there is a limit
* Show BufferPool using in query MemTracker (it was accidentally
  disabled before because there was no query-level profile).
* Reduce clutter in MemTracker dump by only showing buffer pool
  reservation, not usage (the usage was misleading anyway because
  it didn't include child usage).
* Remove TotalUnpinnedBytes, which had limited value - WriteIoBytes
  and PeakUnpinnedBytes can answer most of the same questions - i.e.
  did it unpin any pages, and how many did it need to write to disk.
* Add buffer pool metrics to /memz (if buffer pool is enabled) and
  reorder /memz so more useful information is up the top.

Change-Id: I34b7f4d94c3d396ac89026c7559d6b2c6e02697c
Reviewed-on: http://gerrit.cloudera.org:8080/6690
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Add CLUSTERED hint.

Update hint syntax in INSERT topic.

Also modernize the hint syntax as shown under INSERT
to include the -- and /* */ formats also. List
the [] style last since it is the least-preferred
option.

Switch to preferring /* */ syntax for hints
instead of using the [ ] notation by default.

Finally, take out references to the SORTBY hint because
it didn't actually make it in. Intent for future is to have a way
to get this behavior without using a hint.
Change-Id: Id3c1da9a87ace361b096fa73d8504b2f54e75bed
Reviewed-on: http://gerrit.cloudera.org:8080/5655
Reviewed-by: John Russell <[email protected]>
Tested-by: Impala Public Jenkins
…n Parquet"

Reverting IMPALA-2716 as SparkSQL does not agree with the approach
taken.

More details can be found at:
https://issues.apache.org/jira/browse/SPARK-12297

Change-Id: Ic66de277c622748540c1b9969152c2cabed1f3bd
Reviewed-on: http://gerrit.cloudera.org:8080/6896
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
The assertion was incorrect and racy - it is ok if the write error wins
the race with the Unpin() calls, causing them to fail.

Change-Id: I023193b9ad6c6ac0ee114ad77ddf04d7d7185809
Reviewed-on: http://gerrit.cloudera.org:8080/6953
Reviewed-by: Henry Robinson <[email protected]>
Reviewed-by: Dan Hecht <[email protected]>
Tested-by: Impala Public Jenkins
We use the new libHDFS API hdfsGetLastExceptionRootCause() to return
the last seen HDFS error on that thread.

This patch depends on the recent HDFS commit:
apache/hadoop@fda86ef

Testing: A test has been added which puts HDFS in safe mode and then
verifies that we see a 255 error with the root cause.

Change-Id: I181e316ed63b70b94d4f7a7557d398a931bb171d
Reviewed-on: http://gerrit.cloudera.org:8080/6894
Tested-by: Impala Public Jenkins
Reviewed-by: Alex Behm <[email protected]>
Start with placeholder for 2.9 new features topic.

Initially just point to the changelog file.

Change-Id: I1f6cabc2427daf1243bd69dbed295c6923c4091b
Reviewed-on: http://gerrit.cloudera.org:8080/6954
Reviewed-by: Michael Brown <[email protected]>
Tested-by: Impala Public Jenkins
UBSan reports "runtime error: load of value 32, which is not a valid
value for type 'bool'".

Change-Id: I0ddc496019941048b3e0775606fa5e8e3f9c075a
Reviewed-on: http://gerrit.cloudera.org:8080/6937
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I636a6f2dcd0555ab9b46304e3a7298c598a511da
Reviewed-on: http://gerrit.cloudera.org:8080/6964
Reviewed-by: Michael Brown <[email protected]>
Tested-by: Impala Public Jenkins
The writeup for sortby() was removed during the gerrit review
process. This bullet in the New Features list was left behind,
and is now being removed.

Change-Id: Ib0c32df2dcfbde47a16e4692e5953b31cb144bcc
Reviewed-on: http://gerrit.cloudera.org:8080/6965
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
Syntax:
<tableref> TABLESAMPLE SYSTEM(<number>) [REPEATABLE(<number>)]
The first number specifies the percent of table bytes to sample.
The second number specifies the random seed to use.

The sampling is coarse-grained. Impala keeps randomly adding
files to the sample until at least the desired percentage of
file bytes have been reached.

Examples:
SELECT * FROM t TABLESAMPLE SYSTEM(10)
SELECT * FROM t TABLESAMPLE SYSTEM(50) REPEATABLE(1234)

Testing:
- Added parser, analyser, planner, and end-to-end tests
- Private core/hdfs run passed

Change-Id: Ief112cfb1e4983c5d94c08696dc83da9ccf43f70
Reviewed-on: http://gerrit.cloudera.org:8080/6868
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
Holding client_request_state_map_lock_ and CRS::lock_ together in certain
paths could potentially block the impalad from registering new queries.
The most common occurrence of this is while loading the webpage of a
query while the query planning is still in progress. Since we hold the
CRS::lock_ during planning, it blocks the web page from loading which
inturn blocks incoming queries by holding client_request_state_map_lock_.

This patch makes client_request_state_map_lock_ a terminal lock so that
we don't have interleaving locking with CRS::lock_.

Testing: Tested it locally by adding a long sleep in
JniFrontend.createExecRequest() and still was able to refresh the web UI
and run parallel queries. Also added a custom cluster test that does the
same sequence of actions by injecting a metadata loading pause.

Change-Id: Ie44daa93e3ae4d04d091261f3ec4891caffe8026
Reviewed-on: http://gerrit.cloudera.org:8080/6707
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins
Instead of default 0 switching to 1 for verbose, now the default
is 1 (named 'standard') and extended is 2.

Change-Id: Ib18cfbfa35a4e3b324e6744da62de2fad86c1e67
Reviewed-on: http://gerrit.cloudera.org:8080/6966
Reviewed-by: Alex Behm <[email protected]>
Tested-by: Impala Public Jenkins
Matthew Jacobs and others added 28 commits July 20, 2017 02:40
Kudu tables did not treat some table properties correctly.

Change-Id: I69fa661419897f2aab4632015a147b784a6e7009
Reviewed-on: http://gerrit.cloudera.org:8080/7454
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Disables a test that seemed to get flaky recently, perhaps
related to testing with Java 8 or maybe even changes in YARN
(which get used by RequestPoolService).

Since we're not changing this code right now, let's disable
this test to unblock builds. Keeping the JIRA open to track
a better solution.

Change-Id: I616961457cd48d31d618c8b58f5279b89d3cdcd6
Reviewed-on: http://gerrit.cloudera.org:8080/7466
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Bugs:

- FunctionCallExpr's toSql() doesn't include IGNORE NULLS if present
  causing view definitions to break and leading to incorrect results.

- FunctionCallExpr's clone() implementation doesn't carry forward
  IGNORE NULLS option if present. One case that breaks with this is
  querying views containing analytic exprs causing wrong plans.

Fixed both the bugs and added a test that can reliably reproduce this.

Change-Id: I723897886c95763c3f29d6f24c4d9e7d43898ade
Reviewed-on: http://gerrit.cloudera.org:8080/7416
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins
Doing an O(n) consistency check every time the read or write
page was advanced results in O(n^2) overall runtime.

The fix is to separate the O(1) and O(n) checks and only do the
O(n) checks if:
* The function does an an O(n) pass over the pages anyway (e.g.
  PinStream())
* The function is called only once per read or write pass over the
  stream.

This should make the cost of the checks O(n) (if we make the reasonable
assumption that PrepareForWrite(), PrepareForRead(), PinStream() and
UnpinStream() are called a bounded number of times per stream).

Testing:
Ran BufferedTupleStreamV2Test.

Change-Id: I8b380fcd0568cb73b36a490954bcd316db969ede
Reviewed-on: http://gerrit.cloudera.org:8080/7459
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Fix to populate the non-default query options set by planner in the
runtime profile.

Added a corresponding test case.

Change-Id: I08e9dc2bebb83101976bbbd903ee48c5068dbaab
Reviewed-on: http://gerrit.cloudera.org:8080/7419
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Currently, creation of a Status object (non-OK and non-EXPECTED)
prints the stack trace to the log. Fetching the stack trace takes
a large chunk of CPU time close to 130ms and results in a significant
perf hit when encountered on hot paths.
Five such hot paths were identified and the following changes were
made to fix it:

1. In ImpalaServer::GetExecSummary(), create Status() without holding
the query_log_lock_.
2, 3 and 4. In impala::DeserializeThriftMsg<>(),
PartitionedAggregationNode::CodegenUpdateTuple() and
HdfsScanner::CodegenWriteCompleteTuple, use Status::Expected where
appropriate.
5. In Status::MemLimitExceeded(), create Status object without
printing stacktrace

Change-Id: Ief083f558fba587381aa7fe8f99da279da02f1f2
Reviewed-on: http://gerrit.cloudera.org:8080/7449
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
The change is mostly mechanical - added Status returns where
need.

In one place I restructured the the logic around
'current_encoding_' for Parquet to allow a cleaner solution
to the dropped status from FinalizeCurrentPage() call in
ProcessValue(): after the restructuring the call was no longer
needed. 'current_encoding_' was overloaded to represent both the
encoding of the current page and the preferred encoding
for subsequent pages.

Testing:
Ran exhaustive build.

Change-Id: I753d352c640faf5eaef650cd743e53de53761431
Reviewed-on: http://gerrit.cloudera.org:8080/7372
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
Privileges granted to a role assigned to a db/table whose name
contains upper case characters can disappear after a few seconds.
A privilege is inserted into the catalogObjectCache using a key
that uses the db/table name. The key gets converted to a lower
case before inserting.
Privilege name returned by sentryProxy is always lower case,
which might not match the privilegeName built in the catalog.
This triggers an update of the catalog object followed by a
removal of the old object. Since they both use the same key
in lower case it ends up deleting the newly updated object.

This change also adds a new catalogd startup option
(sentry_catalog_polling_frequency)
to configure the frequency at which catalogd polls the sentry service
to update any policy changes. The default value is 60 seconds.

Test:
Added a test which adds select privileges to 3 tables and dbs specified
in lower case, upper case and mixed case. The test verifies that the
privileges on the 3 tables do not disappear on a sentry update.

Change-Id: Ide3dfa601fcf77f5acc6adce9bea443aea600901
Reviewed-on: http://gerrit.cloudera.org:8080/7332
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Creating Kudu clients is very expensive as each will fetch
metadata from the Kudu master, so we should minimize the
number of Kudu clients that get created.

This patch stores a map from Kudu master addressed to Kudu
clients in KuduUtil to be used across the FE and catalog.
Another patch has already addressed the BE.

Future work will consider providing a way to invalidate
the stored Kudu clients in case something goes wrong
(IMPALA-5685)

This relies on two changes on the Kudu side: one that clears
non-covered range entries from the client's cache on table
open (d07ecd6ded01201c912d2e336611a6a941f48d98), and one
that automatically refreshes auth tokens when they expire
(603c1578c78c0377ffafdd9c427ebfd8a206bda3).

This patch disables some tests that no longer work as
they relied on Kudu metadata loading operations timing out,
but since we're reusing clients the metadata is already
loaded when the test is run.

Testing:
- Ran a stress test on a 10 node cluster: scan of a small
  Kudu table, 1000 concurrent queries, load on the Kudu
  master was reduced signficantly, from ~50% cpu to ~5%.
  (with the BE changes included)
- Ran the Kudu e2e tests.
- Manually ran a test with concurrent INSERTs and
  'ALTER TABLE ADD PARTITION' (which is affected by the
  Kudu side change mentiond above) and verified
  correctness.

Change-Id: I9b0b346f37ee43f7f0eefe34a093eddbbdcf2a5e
Reviewed-on: http://gerrit.cloudera.org:8080/6898
Reviewed-by: Thomas Tauber-Marshall <[email protected]>
Tested-by: Impala Public Jenkins
Impala currently supports total sorts (the entire set of data
is sorted) and top-n sorts (only the highest/lowest n elements
are sorted). This patch adds the ability to do partial sorts,
where the data is divided up into some number of subsets, each
of which is sorted individually.

It accomplishes this by adding a new exec node, PartialSortNode.
When PartialSortNode::GetNext() is called, it retrieves input
up to the query memory limit, uses the existing Sorter class to sort
it, and outputs it. This is faster than a total sort with SortNode
as it avoids the need to spill if the input is larger than the
memory limit.

Future work will look into setting a more restrictive memory limit
on the PartialSortNode. (IMPALA-5669)

In the planner, the SortNode plan node is used, with an enum value
indicating if it is a total or partial sort.

This also adds a new counter 'RunSize' to the runtime profile which
tracks the min, max, and avg size of the generated runs, in tuples.

As a first use case, partial sort is used where a total sort was
used previously for inserts/upserts into Kudu tables only. Future
work can extend this to other table sinks. (IMPALA-5649)

Testing:
- E2E test with a large INSERT into a Kudu table with a mem limit.
  Checks that no spills occurred.
- Updated planner tests.
- Existing E2E tests and stress test verify correctness of INSERT.
- Perf tests on the 10 node cluster: inserting tpch_100.lineitem
  into a Kudu table with mem_limit=3gb:
  Previously: 5 runs are spilled, sort took 7m33s
  Now: no spills, sort takes 6m19s, for ~18% speedup

Change-Id: Ieec2a15a0cc5240b1c13682067ab64670d1e0a38
Reviewed-on: http://gerrit.cloudera.org:8080/7267
Reviewed-by: Thomas Tauber-Marshall <[email protected]>
Tested-by: Impala Public Jenkins
FLAGS_be_service_threads does nothing, and can be removed. Backend
Thrift servers do not use a fix-sized thread pool, instead using one
thread-per-connection.

Change-Id: I10e48014f24eebd22251bac4734bc3c90dee47c0
Reviewed-on: http://gerrit.cloudera.org:8080/7483
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Two tests (LongReverse and the base64 tests in StringFunctions)
run their tests over all lengths from 0..{{some length}}. Both take
several minutes to complete. This adds a lot of runtime for not much
more confidence.

Pick a set of 'interesting' (including powers-of-two, prime numbers,
edge-cases) lengths to run them over instead. Test time is reduced by
>150s on my desktop machine in debug mode.

Change-Id: I2962115734aff8dcaae0cc405274765105e31572
Reviewed-on: http://gerrit.cloudera.org:8080/7474
Reviewed-by: Henry Robinson <[email protected]>
Tested-by: Impala Public Jenkins
In a recent patch (IMPALA-5036) a bug was introduced where a count(*)
query with a group by a string partition column returned incorrect
results. Data was being written into the tuple at an incorrect offset.

Testing:
- Added an end to end test where we are selecting from a table
  partitioned by string.

Change-Id: I225547574c2b2259ca81cb642d082e151f3bed6b
Reviewed-on: http://gerrit.cloudera.org:8080/7481
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
While working on another patch, I noticed that a lot of includes and
forward declarations were spurious and possibly the result of bit rot.
This patch removes them and hopefully improves compile time a little.

Testing: Made sure that Impala and the BE tests compile successfully.

Change-Id: Id0afed224fad6a00698701487b51506d414f83ac
Reviewed-on: http://gerrit.cloudera.org:8080/7482
Reviewed-by: Sailesh Mukil <[email protected]>
Tested-by: Impala Public Jenkins
Change allocation pattern for Codec objects in RowBatch to be
stack-allocated. Make c'tors and Init() methods of codec implementations
publicly visible in order to do so.

Fix bit-rotting bug in row-batch-serialize-benchmark that made it abort
on start up.

Change-Id: I6641f4a08bd2711c4f4515ab29a6e5418cbd5f51
Reviewed-on: http://gerrit.cloudera.org:8080/7478
Reviewed-by: Henry Robinson <[email protected]>
Tested-by: Impala Public Jenkins
Formerly the project used SVN and the instructions
were posted on a public page. Now it's at github
and the user has to get the doc source from the
project to view it. Therefore I'm changing both
the URL and the descriptive text of the link.

Change-Id: I668dc3739a9c95c788408bfc73480793ae5ba4c3
Reviewed-on: http://gerrit.cloudera.org:8080/7447
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
For 2.9, I believe the only new reserved
keyword is TABLESAMPLE from IMPALA-5309.

Based on commit history from:
https://github.com/apache/incubator-impala/commits/master/fe/src/main/jflex/sql-scanner.flex

Change-Id: If4a340a033ff3f529061e48c4a5558b1ad1637ef
Reviewed-on: http://gerrit.cloudera.org:8080/7452
Reviewed-by: Michael Brown <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I0b9414c21faca00e4a64a35888bd50caac94318f
Reviewed-on: http://gerrit.cloudera.org:8080/7486
Reviewed-by: Thomas Tauber-Marshall <[email protected]>
Tested-by: Impala Public Jenkins
Change-Id: I127b7806feca810503ba3dd000a8e972835e715a
Reviewed-on: http://gerrit.cloudera.org:8080/7487
Reviewed-by: Greg Rahn <[email protected]>
Reviewed-by: Sailesh Mukil <[email protected]>
Tested-by: Impala Public Jenkins
If the coordinator, in UpdateBackendExecStatus(), receives a report that
includes a TInsertExecStatus, it will call UpdateInsertExecStatus()
which takes the coordinator-wide lock. Avoid doing this for fragment
instances that would only send an empty TInsertExecStatus (including
instances that belong to SELECT queries, not DML queries).

Future changes should fix the locking around the
UpdateBackendExecStatus() path to remove dependencies on
Coordinator::lock_, but this fix is simple and addresses one point of
needless contention.

Change-Id: I314dd8d96922d273c6329266970d249ec8c5c608
Reviewed-on: http://gerrit.cloudera.org:8080/7457
Reviewed-by: Henry Robinson <[email protected]>
Tested-by: Impala Public Jenkins
Remove untested / unused mini-impala-cluster binary.

Change-Id: I677314fc1a998dffa9120c016bfcf761b4e39f05
Reviewed-on: http://gerrit.cloudera.org:8080/7488
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
The fix for IMPALA-1654 has broken the compute incremental stats child
query generation logic for general partition expressions. This commit
fixes it and also adds new queries to fix the test gap. These tests
fail consistently without the patch.

Change-Id: I227fc06f580eb9174f60ad0f515a3641cec19268
Reviewed-on: http://gerrit.cloudera.org:8080/7379
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins
Queries with a null-aware anti-join joining on a large number of NULLs
can take a long time to cancel if threads are stuck in
PartitionedHashJoinNode::EvaluateNullProbe(). This change adds the
RETURN_IF_CANCELLED macro to the function.

Testing:
Added logs to PartitionedHashJoinNode::EvaluateNullProbe() and made sure
that the function returns right away on cancellation.

Change-Id: I0800754d4ad31cbadbdfadc630c640963f3f6053
Reviewed-on: http://gerrit.cloudera.org:8080/7393
Tested-by: Impala Public Jenkins
Reviewed-by: Tim Armstrong <[email protected]>
IMPALA-4000 added basic authorization support for Kudu
tables, but it had several limitations:
* Only the ALL privilege level can be granted to Kudu tables.
  (Finer-grained levels such as only SELECT or only INSERT are
  not supported.)
* Column level permissions on Kudu tables are not supported.
* Only users with ALL privileges on SERVER may create external
  Kudu tables.

This patch relaxes the restrictions to allow:
* Allow column-level permissions
* Allow fine grained privileges SELECT and INSERT for those
  statement types.

DELETE/UPDATE/UPSERT privileges now require ALL privileges
because Sentry will eventually get fine grained privilege
actions, and at that point Impala should support the more
specific actions (IMPALA-3840). The assumption is that the
Kudu table authorization support is currently so limited
that most users are not using this functionality yet, but
this is a behavior change that needs to be clearly stated in
the Impala release notes.

Testing: Adds FE and EE tests.

Change-Id: Ib12d2b32fa3e142e69bd8b0f24f53f9e5cbf7460
Reviewed-on: http://gerrit.cloudera.org:8080/7307
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
Print the version info of each impalad that's used in a stress test run,
sorted by host name.

Testing done:
$ tests/stress/concurrent_select.py [redacted cluster options] --tpcds-db null --max-queries 0
Cluster Impalad Version Info:
host2.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385)
Built on Tue Jul 25 07:06:27 PDT 2017
host3.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385)
Built on Tue Jul 25 07:06:27 PDT 2017
host4.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385)
Built on Tue Jul 25 07:06:27 PDT 2017
host5.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385)
Built on Tue Jul 25 07:06:27 PDT 2017
host6.redacted: impalad version 2.10.0-SNAPSHOT RELEASE (build e862385)
Built on Tue Jul 25 07:06:27 PDT 2017
2017-07-25 12:38:52,732 12793 Thread-1 INFO:cluster[691]:Finding impalad binary location
...

Change-Id: Ie4b40783ddae6b1bfb2bb4e28c0e3bf97ab944c5
Reviewed-on: http://gerrit.cloudera.org:8080/7501
Reviewed-by: Michael Brown <[email protected]>
Tested-by: Michael Brown <[email protected]>
Read the start date and time of the impalad, catalogd and statestored processes
for the Debug Web UI. Uses the stat command on the /proc/<pid> directory and
format the date with the date command to local time format.

Change-Id: I05ae2f80835b1b0e4bc3b38731778ba0871338a4
Reviewed-on: http://gerrit.cloudera.org:8080/7363
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
I ran the stress test binary search locally and it produced a slightly
higher number for Q18 than the hardcoded value. This is enough to move
it above one of the thresholds, so may reduce flakiness.

Testing:
I wasn't able to reproduce the flakiness locally, so can't confirm
this fixes it.

Change-Id: I1ffa969061a52730c5147d142dcd2e3cb3626590
Reviewed-on: http://gerrit.cloudera.org:8080/7512
Reviewed-by: Matthew Jacobs <[email protected]>
Tested-by: Impala Public Jenkins
If $IMPALA_HOME ends with a /, the clean_cmake_files function in
distcc_env.sh will emit a find command with a double // at the end for
the cmake_modules directory, and since it contains the substring cmake,
find will match and delete its contents.

Fix is to use a whitelist of locations and filenames to look for, and
delete only those.

Testing: manually ran enable_distcc, observed that my files were still
there.

Change-Id: I8a6e34bedf8000aed9e2b0597cfe86f73222c6ed
Reviewed-on: http://gerrit.cloudera.org:8080/7493
Reviewed-by: Tim Armstrong <[email protected]>
Tested-by: Impala Public Jenkins
@yuzzjj
Copy link
Owner Author

yuzzjj commented Jul 28, 2017

could commit

@yuzzjj yuzzjj merged commit e500350 into yuzzjj:cdh5-trunk Jul 28, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.