A Clojure project based on the Jepsen framework to find bugs in the distributed coordination of Apache Flink®.
Jepsen is a framework built to test the behavior of distributed systems under faults. The tests in this particular project deploy Flink on either YARN or Mesos, submit a job, and examine the availability of the job after injecting faults. A job is said to be available if all the tasks of the job are running. The faults that can be currently introduced to the Flink cluster include:
- Killing of TaskManager/JobManager processes
- Stopping HDFS NameNode
- Network partitions
There are many more properties other than job availability that could be verified but are not yet covered by this test suite, e.g., end-to-end exactly-once processing semantics.
See the Jepsen documentation
for how to set up the environment to run tests. The script under scripts/run-tests.sh
documents how to invoke
tests. The Flink job used for testing is located under
flink-end-to-end-tests/flink-datastream-allround-test
. You have to build the job first and copy
the resulting jar (DataStreamAllroundTestProgram.jar
) to the ./bin
directory of this project's
root.
To simplify development, we have prepared Dockerfiles and a Docker Compose template so that you can run the tests locally in containers. To build the images and start the containers, simply run:
$ cd docker
$ ./up.sh
After the containers started, open a new terminal window and run docker exec -it jepsen-control bash
.
This will allow you to run arbitrary commands on the control node.
To start the tests, you can use the run-tests.sh
script in the docker
directory,
which expects the number of test iterations, and a URI to a Flink distribution, e.g.,
./docker/run-tests.sh 1 https://example.com/flink-dist.tgz
The project's root is mounted as a volume to all containers under the path /jepsen
.
This means that changes to the test sources are immediately reflected in the control node container.
Moreover, this allows you to test locally built Flink distributions by copying the tarball to the
project's root and passing a URI with the file://
scheme to the run-tests.sh
script, e.g.,
file:///jepsen/flink-dist.tgz
.
The tests have high memory demands due to the many processes that are started by the control node. For example, to test Flink on YARN in a HA setup, we require ZooKeeper, HDFS NameNode, HDFS DataNode, YARN NodeManager, and YARN ResourceManager, in addition to the Flink processes. We found that the tests can be run comfortably in Docker containers on a machine with 32 GiB RAM.
Consult the jepsen.log
file for the particular test run in the store
folder. The final output of every test will be either
Everything looks good! ヽ('ー`)ノ
or
Analysis invalid! (ノಥ益ಥ)ノ ┻━┻
depending on whether the test passed or not. If neither output is generated, the test did not finish properly due to problems of the environment, bugs in Jepsen or in the test suite, etc.
In addition, the test directories contain all relevant log files aggregated from all hosts.