forked from keith-turner/goraci
-
Notifications
You must be signed in to change notification settings - Fork 0
saintstack/goraci
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
================================= = = = GORACI README = = @author Keith Turner = ================================= BACKGROUND ------------ Apache Accumulo [0] has a simple test suite that verifies that data is not lost at scale. This test suite is called continuous ingest. This test runs many ingest clients that continually create linked lists containing 25 million nodes. At some point the clients are stopped and a map reduce job is run to ensure no linked list has a hole. A hole indicates data was lost. The nodes in the linked list are random. This causes each linked list to spread across the table. Therefore if one part of a table loses data, then it will be detected by references in another part of the table. This project is a version of the test suite written using Apache Gora [1]. Goraci has been tested against Accumulo and HBase. THE ANATOMY OF GORACI TESTS ---------------------------- Below is rough sketch of how data is written. For specific details look at the Generator code (src/main/java/goraci/Generator.java) 1 Write out 1 million nodes 2 Flush the client 3 Write out 1 million that reference previous million 4 If this is the 25th set of 1 million nodes, then update 1st set of million to point to last 5 goto 1 The key is that nodes only reference flushed nodes. Therefore a node should never reference a missing node, even if the ingest client is killed at any point in time. When running this test suite w/ Accumulo there is a script running in parallel called the Aggitator that randomly and continuously kills server processes. The outcome was that many data loss bugs were found in Accumulo by doing this. This test suite can also help find bugs that impact uptime and stability when run for days or weeks. This test suite consists the following - a few Java programs - a little helper script to run the java programs - a maven script to build it. BUILDING GORACI --------------- This code currently depends on an unreleased version of Gora. To build Gora 0.2 run the following commands. svn export http://svn.apache.org/repos/asf/gora/trunk gora cd gora mvn install -DskipTests After this you can build goraci. git clone git://github.com/keith-turner/goraci.git cd goraci mvn compile The maven pom file has some profiles that attempt to make it easier to run goraci against different gora backends by copying the jars you need into lib. Before packaging its important to edit gora.properties and set it correctly for your datastore. To run against accumulo do the following. vim src/main/resources/gora.properties (set Accumulo properties) mvn package -Paccumulo-1.4 To run against hbase, do the following. vim src/main/resources/gora.properties (set HBase properties) mvn package -Phbase-0.92 To run against cassandra, do the following. vim src/main/resources/gora.properties (set Cassandra properties) mvn package -Pcassandra-1.0.2 For other datastores mentioned in gora.properties, you will need to copy the appropriate deps into lib. Feel free to update the pom with other profiles and send me pull request. GORA AND HADOOP ----------------- Gora uses Avro which uses a Json library that Hadoop has an old version of. The two libraries jackson-core and jackson-mapper need to be updated in <HADOOP_HOME>/lib and <HADOOP_HOME>/share/hadoop/lib/. I updated these to jackson-core-asl-1.4.2.jar and jackson-mapper-asl-1.4.2.jar. For details see HADOOP-6945 [3]. GORACI AND HBASE ----------------- The generator needs to be scaled back from 1,000,000 inserts at a time to 100,000 or it will hang for some unknown reason. This can be accomplished by modifing a constant in goraci.Generator and rebuilding. JAVA CLASS DESCRIPTION ----------------- Below is a description of the Java programs * goraci.Generator - A map only job that generates data. * goraci.Verify - A map reduce job that looks for holes. Look at the counts after running. REFERENCED and UNREFERENCED are ok, any UNDEFINED counts are bad. Do not run at the same time as the Generator. * goraci.Walker - A standalong program that start following a linked list and emits timing info. * goraci.Print - A standalone program that prints nodes in the linked list * goraci.Delete - A standalone program that deletes a single node goraci.sh is a helper script that you can use to run the above programs. It assumes all needed jars are in the lib dir. It does not need the package name. You can just run "./goraci.sh Generator", below is an example. $ ./goraci.sh Generator Usage : Generator <num mappers> <num nodes> For Gora to work, it needs a gora.properties file on the classpath and a mapping file on the classpath, the contents of both are datastore specific, more details can be found here [2]. You can edit the ones in src/main/resources and build the goraci-${version}-SNAPSHOT.jar with those. Alternatively remove those and put them on the classpath through some other means. GORACI AND HBASE ----------------- In order to make Goraci ingest quickly into HBase, the gora-hbase datastore code must be modified to disable autoflush. Apply the patch attached to https://issues.apache.org/jira/browse/GORA-114 (You may have it already as it has been committed to TRUNK). To improve performance running read jobs such as the Verify step, enable scanner caching on the command line. For example: $ ./gorachi.sh Verify-Dhbase.client.scanner.caching=1000 \ -Dmapred.map.tasks.speculative.execution=false verify_dir 1000 Dependent on how you have your hadoop and hbase deployed, you may need to change the gorachi.sh script around some. Here is one suggestion that may help in the case where your hadoop and hbase configuration are other than under the hadoop and hbase home directories. diff --git a/goraci.sh b/goraci.sh index db1562a..31c3c94 100755 --- a/goraci.sh +++ b/goraci.sh @@ -95,6 +95,4 @@ done #run it export HADOOP_CLASSPATH="$CLASSPATH" LIBJARS=`echo $HADOOP_CLASSPATH | tr : ,` -hadoop jar "$GORACI_HOME/lib/goraci-0.0.1-SNAPSHOT.jar" $CLASS -libjars "$LIBJARS" "$@" - - +CLASSPATH="${HBASE_CONF_DIR}" hadoop --config "${HADOOP_CONF_DIR} jar "$GORACI_HOME/lib/goraci-0.0.1-SNAPSHOT.jar" $CLASS -files "${HBASE_CONF_DIR}/hbase-site.xml" -libjars "$LIBJARS" "$@" You will need to define HBASE_CONF_DIR and HADOOP_CONF_DIR before you run your goraci jobs. For example: $ export HADOOP_CONF_DIR=/home/you/hadoop-conf $ export HADOOP_CONF_DIR=/home/you/hbase-conf $ PATH=/home/you/hadoop-1.0.2/bin:$PATH ./goraci.sh Generator 1000 1000000 CONCLUSIONS ------------ This test suite does not do everything that the Accumulo test suite does, mainly it does not collect statistics and generate reports. The reports are useful for assesing performance. Below shows running a test of the test. Ingest one linked list, deleted a node in it, ensure the verifaction map reduce job notices that the node is missing. Not all output is shown, just the important parts. $ ./goraci.sh Generator 1 25000000 $ ./goraci.sh Print -s 2000000000000000 -l 1 2000001f65dbd238:30350f9ae6f6e8f7:000004265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6 $ ./goraci.sh Print -s 30350f9ae6f6e8f7 -l 1 30350f9ae6f6e8f7:4867fe03de6ea6c8:000003265852:ef09f9dd-75b1-4c16-9f14-0fa84f3029b6 $ ./goraci.sh Delete 30350f9ae6f6e8f7 Delete returned true $ ./goraci.sh Verify gci_verify_1 2 11/12/20 17:12:31 INFO mapred.JobClient: goraci.Verify$Counts 11/12/20 17:12:31 INFO mapred.JobClient: UNDEFINED=1 11/12/20 17:12:31 INFO mapred.JobClient: REFERENCED=24999998 11/12/20 17:12:31 INFO mapred.JobClient: UNREFERENCED=1 $ hadoop fs -cat gci_verify_1/part\* 30350f9ae6f6e8f7 2000001f65dbd238 The map reduce job found the one undefined node and gave the node that referenced it. Below are some timing statistics for running goraci on a 10 node cluster. For Accumulo 1 billion nodes were generated and verified. For HBase 100 million nodes were generated and verified. Store | Task | Time | Undef | Unref | Ref ----------------+------------------------+---------+--------+-------+------------ accumulo-1.4.0 | Generator 10 100000000 | 35m 22s | N/A | N/A | N/A accumulo-1.4.0 | Verify /tmp/goraci1 19 | 9m 36s | 0 | 0 | 1000000000 hbase-0.92.1 | Generator 10 10000000 | 17m 53s | N/A | N/A | N/A hbase-0.92.1 | Verify /tmp/goraci2 40 | 33m 22s | 0 | 0 | 100000000 For the accumulo run, the table was presplit into 8 tablets using the following command in the accumulo shell. This was done before the generator was started. addsplits -t ci \x10 \x20 \x30 \x40 \x50 \x60 \x70 [0] http://accumulo.apache.org [1] http://gora.apache.org [2] http://gora.apache.org/docs/current/gora-conf.html [3] https://issues.apache.org/jira/browse/HADOOP-6945
About
Simple test suite that ensures data is not lost at scale.
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published
Languages
- Java 91.1%
- Shell 8.9%