Skip to content

Latest commit

 

History

History

kudu-jepsen

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

jepsen.kudu

A Clojure library designed to run Apache Kudu consistency tests using the Jepsen framework. Currently, a simple linearizability test for read/write register is implemented and run for several fault injection scenarios.

Prerequisites and Requirements

Operating System Requirements

Only Debian/Ubuntu Linux is supported as a platform for the master and tablet server nodes. Tested to work on Debian 8 Jessie.

Overview

The Clojure code is integrated into the project using the nebula-clojure-plugin. The kudu-jepsen tests are invoked by executing the runJepsen task. The parameters are passed via the standard -D<property>=<value> notation. There is a dedicated Clojure wrapper script kudu_test_runner.clj in $KUDU_HOME/java/kudu-jepsen/src/utils which populates the test environment with appropriate properties and iteratively runs all the registered tests with different nemeses scenarios.

Usage

Building

To build the library the following components are required:

  • JDK 8

To build the project, run in the parent directory (i.e. $KUDU_HOME/java)

$ ./gradlew clean assemble

Running

The machines for Kudu master and tserver nodes should be created prior to running the test: the tests does not create those itself. The machines should be up and running when starting the test.

To run the test, the following components are required at the control node:

  • JDK 8

  • SSH client (and optionally, SSH authentication agent)

  • gnuplot (to visualize test results)

Jepsen uses SSH to perform operations at DB nodes. The kudu-jepsen assumes that SSH keys are installed accordingly:

  • The public part of the SSH key should be added into the authorized_keys file at all DB nodes for the root user

  • For the SSH private key the options are:

    • Add the key to the SSH authentication agent running at the control node

    • Specify the path to the file with the key in plain (non-encrypted) format via the sshKeyPath property.

If using SSH authentication agent to hold the SSH key for DB nodes access, run in the parent directory:

$ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0"

If not using SSH authentication agent, specify the location of the file with SSH private key via the sshKeyPath property:

$ ./gradlew runJepsen -DtserverNodes="t0,t1,t2,t3,t4" -DmasterNodes="m0" \
  -DsshKeyPath="/home/user/.ssh/vm_root_id_rsa"

Note that commas (not spaces) are used to separate the names of the nodes. The DNS resolver should be properly configured to resolve the specified hostnames into IP addresses.

The tserverNodes property is used to specify the set of nodes where to run Kudu tablet servers. The masterNodes property is used to specify the set of nodes to run Kudu master servers.

In the Jepsen terminology, Kudu master and tserver nodes are playing Jepsen DB node roles. The machine where the above mentioned Gradle command is run plays Jepsen control node role.

A reference script to build Kudu and run Jepsen tests

The following Bourne-again shell script can be used as a reference to build Kudu from source and run Jepsen tests.

Troubleshooting

When Jepsen’s analysis doesn’t find inconsistencies in the history of operations it outputs the following in the end of a test:

Everything looks good! ヽ(‘ー`)ノ

However, it might not be the case. If so, it’s crucial to understand why the test failed.

The majority of the kudu-jepsen test failures can be put into two classification buckets:

  • An error happened while setting up the testing environment, contacting machines at the Kudu cluster, starting up Kudu server-side components, or in any of the other third-party components the Jepsen uses (like clj-ssh), etc.

  • The Jepsen’s analysis detected inconsistent history of operations.

The former class of failures might be a manifestation of wrong configuration, a problem with the test environment, a bug in the test code itself or some other intermittent failure. Usually, encountering issues like that means the consistency analysis (which is the last step of a test scenario) cannot run. Such issues are reported as errors in the summary message. E.g., the example summary message below reports on 10 errors in 10 tests ran:

21:41:42 Ran 10  tests containing 10 assertions.
21:41:42 0 failures, 10 errors.

To get more details, take a closer look at the output of ./gradlew runJepsen or at particular jepsen.log files in $KUDU_HOME/java/kudu-jepsen/store/rw-register/<test_timestamp> directory. A quick way to locate the corresponding section in the error log is to search for ^ERROR in \( regex pattern. An example of error message from Jepsen’s output:

ERROR in (register-test-tserver-random-halves) (KuduException.java:110)
expected: (:valid? (:results (jepsen/run! (tcasefun opts))))
  actual: org.apache.kudu.client.NonRecoverableException: can not complete before timeout: KuduRpc(method=IsCreateTableDone, tablet=null, attempt=28, DeadlineTracker(timeout=30000, elapsed=28571), ...

The latter class represents more serious issue: a manifestation of non-linearizable history of operations. This is reported as failure in the summary message. E.g., the summary message below reports finding 2 instances of non-linearizable history among 10 tests ran:

22:21:52 Ran 10  tests containing 10 assertions.
22:21:52 2 failures, 0 errors.

If Jepsen’s analysis finds non-linearizable history of operations, it outputs the following in the end of a test:

Analysis invalid! (ノಥ益ಥ)ノ ┻━┻

To troubleshoot, first it’s necessary to find where the failed test stores the results: it should be one of the timestamp-named sub-directories (e.g. 20170109T071938.000-0800) under $KUDU_HOME/java/kudu-jepsen/store/rw-register in case of a linearizability failure in one of the rw-register test scenarios. One of the possible ways to find the directory:

$ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register
$ find . -name jepsen.log | xargs grep 'Analysis invalid'
./20170109T071938.000-0800/jepsen.log:Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
$

Another way is to find sub-directories where the linear.svg file is present:

$ cd $KUDU_HOME/java/kudu-jepsen/store/rw-register
$ find . -name linear.svg
./20170109T071938.000-0800/linear.svg
$

Along with jepsen.log and history.txt files the failed test generates linear.svg file (gnuplot is required for that). The diagram in linear.svg illustrates the part of the history which Jepsen found inconsistent: the diagram shows the time/client operation status/system state relationship and the sequences of legal/illegal operations paths. From this point, the next step is to locate the corresponding part of the history in the history.txt file. Usually the problem appears around an activation interval of the test nemesis scenario. Once found, it’s possible to tie the vicinity of the inconsistent operation sequence with the timestamps in the jepsen.log file. Having the timestamps of the operations and their sequence, it’s possible to find relative messages in kudu-tserver.log and kudu-master.log log files in sub-directories named as Kudu cluster nodes. Hopefully, that information is enough to create a reproducible scenario for further troubleshooting and debugging.