A Clojure library designed to run Apache Kudu consistency tests using the Jepsen framework. Currently, a simple linearizability test for read/write register is implemented and run for several fault injection scenarios.
The Clojure code is integrated into the project using the
nebula-clojure-plugin.
The kudu-jepsen tests are invoked by executing the runJepsen
task.
The parameters are passed via the standard -D<property>=<value>
notation.
There is a dedicated Clojure wrapper script
kudu_test_runner.clj
in $KUDU_HOME/java/kudu-jepsen/src/utils
which
populates the test environment with appropriate properties and iteratively
runs all the registered tests with different nemeses scenarios.
To build the library the following components are required:
-
JDK 8
To build the project, run in the parent directory (i.e. $KUDU_HOME/java
)
$ ./gradlew clean assemble
The machines for Kudu master and tserver nodes should be created prior to running the test: the tests does not create those itself. The machines should be up and running when starting the test.
To run the test, the following components are required at the control node:
-
JDK 8
-
SSH client (and optionally, SSH authentication agent)
-
gnuplot (to visualize test results)
Jepsen uses SSH to perform operations at DB nodes. The kudu-jepsen assumes that SSH keys are installed accordingly:
-
The public part of the SSH key should be added into the
authorized_keys
file at all DB nodes for theroot
user -
For the SSH private key the options are:
-
Add the key to the SSH authentication agent running at the control node
-
Specify the path to the file with the key in plain (non-encrypted) format via the
sshKeyPath
property.
-
If using SSH authentication agent to hold the SSH key for DB nodes access, run in the parent directory:
$ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0"
If not using SSH authentication agent, specify the location of the file with
SSH private key via the sshKeyPath
property:
$ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0" \ -DsshKeyPath="/home/user/.ssh/vm_root_id_rsa"
Note that commas (not spaces) are used to separate the names of the nodes. The DNS resolver should be properly configured to resolve the specified hostnames into IP addresses.
The tserverNodes
property is used to specify the set of nodes where to run
Kudu tablet servers. The masterNodes
property is used to specify the set of
nodes to run Kudu master servers.
In the Jepsen terminology, Kudu master and tserver nodes are playing Jepsen DB node roles. The machine where the above mentioned Gradle command is run plays Jepsen control node role.
The following Bourne-again shell script can be used as a reference to build Kudu from source and run Jepsen tests.
When Jepsen’s analysis doesn’t find inconsistencies in the history of operations it outputs the following in the end of a test:
Everything looks good! ヽ(‘ー`)ノ
However, it might not be the case. If so, it’s crucial to understand why the test failed.
The majority of the kudu-jepsen test failures can be put into two classification buckets:
-
An error happened while setting up the testing environment, contacting machines at the Kudu cluster, starting up Kudu server-side components, or in any of the other third-party components the Jepsen uses (like clj-ssh), etc.
-
The Jepsen’s analysis detected inconsistent history of operations.
The former class of failures might be a manifestation of wrong configuration, a problem with the test environment, a bug in the test code itself or some other intermittent failure. Usually, encountering issues like that means the consistency analysis (which is the last step of a test scenario) cannot run. Such issues are reported as errors in the summary message. E.g., the example summary message below reports on 10 errors in 10 tests ran:
21:41:42 Ran 10 tests containing 10 assertions. 21:41:42 0 failures, 10 errors.
To get more details, take a closer look at the output of ./gradlew runJepsen
or at particular jepsen.log
files in
$KUDU_HOME/java/kudu-jepsen/store/rw-register/<test_timestamp>
directory. A
quick way to locate the corresponding section in the error log is to search for
^ERROR in \(
regex pattern. An example of error message from Jepsen’s output:
ERROR in (register-test-tserver-random-halves) (KuduException.java:110) expected: (:valid? (:results (jepsen/run! (tcasefun opts)))) actual: org.apache.kudu.client.NonRecoverableException: can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=28, DeadlineTracker(timeout=30000, elapsed=28571), ...
The latter class represents more serious issue: a manifestation of non-linearizable history of operations. This is reported as failure in the summary message. E.g., the summary message below reports finding 2 instances of non-linearizable history among 10 tests ran:
22:21:52 Ran 10 tests containing 10 assertions. 22:21:52 2 failures, 0 errors.
If Jepsen’s analysis finds non-linearizable history of operations, it outputs the following in the end of a test:
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
To troubleshoot, first it’s necessary to find where the failed test stores
the results: it should be one of the timestamp-named sub-directories
(e.g. 20170109T071938.000-0800
) under
$KUDU_HOME/java/kudu-jepsen/store/rw-register
in case of a linearizability
failure in one of the rw-register
test scenarios. One of the possible ways
to find the directory:
$ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register $ find . -name jepsen.log | xargs grep 'Analysis invalid' ./20170109T071938.000-0800/jepsen.log:Analysis invalid! (ノಥ益ಥ)ノ ┻━┻ $
Another way is to find sub-directories where the linear.svg
file is present:
$ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register $ find . -name linear.svg ./20170109T071938.000-0800/linear.svg $
Along with jepsen.log
and history.txt
files the failed test generates
linear.svg
file (gnuplot is required for that). The diagram in linear.svg
illustrates the part of the history which Jepsen found inconsistent:
the diagram shows the time/client operation status/system state relationship
and the sequences of legal/illegal operations paths. From this point, the next
step is to locate the corresponding part of the history in the history.txt
file. Usually the problem appears around an activation interval of the test
nemesis scenario. Once found, it’s possible to tie the vicinity of the
inconsistent operation sequence with the timestamps in the jepsen.log
file.
Having the timestamps of the operations and their sequence, it’s possible to
find relative messages in kudu-tserver.log
and kudu-master.log
log files
in sub-directories named as Kudu cluster nodes. Hopefully, that information
is enough to create a reproducible scenario for further troubleshooting
and debugging.