-
Notifications
You must be signed in to change notification settings - Fork 21
Building and Running The Framework
IMPORTANT! This document assumes that you have hadoop, hdfs, and hbase all running properly on whatever machine you wish to run the hadoop sleuthkit on! This is intended to help you get the framework running on a pseudo-distributed hadoop setup. Note that this project uses tables named 'entries' and 'hash' in hbase, so you should make sure that those tables do not exist before you try running the project.
The pipeline code does end-to-end processing of a directory of documents (text extraction, document vectorization, cluster generation, etc.). To build it, do the following:
-
Run maven in the pre-build/ folder.
-
Run maven in the root project folder. This will build all of the subprojects for you and then build the pipeline jar and output it to the pipeline/target folder. This is the final jar you will run the framework from.
-
Checkout fsrip (git clone https://github.com/jonstewart/fsrip.git) and build with 'scons'. You may need to install some dependencies in order for fsrip to build successfully. The build process should help reveal what dependencies you are missing.
-
Add FSRIP_ROOT/deps/lib to LD_LIBRARY_PATH and FSRIP_ROOT/build/src/ to your PATH.
-
Set the HADOOP_HOME environment variable.
-
Copy in the report template:
% rm -Rf reports/data % hadoop fs -copyFromLocal reports /texaspete/template/reports
As a last step, extract the dependency jars from the output pipeline jar (these are in the /lib directory of the pipeline jar) to the $HADOOP_HOME/lib directory. This will ensure that these jars are always available to be loaded by any of the hadoop map/reduce jobs. You MUST RESTART HADOOP after performing this step or the pipeline will not work.
Before running the project, you need to put a file on HDFS with java regexes. After we extract text from files, we search for these regular expressions in the text. We cluster only files with regular expression matches. One with a few (uninteresting) regexes is in the match project folder:
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk% hadoop fs -put match/src/main/resources/regexes /texaspete/regexes
You can, of course, make your own regexes. Any standard java regex will work with one regex per line. If you use java globbing it should take the first glob as the "match", but this functionality is still experimental.
Now that all the code has been built, have a look at the output in the pipeline/target directory.
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk% cd pipeline/target
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk/pipeline/target% ls
archive-tmp/ maven-archiver/ *sleuthkit-pipeline-1-SNAPSHOT-job.jar*
classes/ sleuthkit-pipeline-1-SNAPSHOT.jar surefire/
jdoe@hpa-linux-jdoe:~/projects/TXPETE/trunk/pipeline/target%
The job jar is the one you should use with hadoop. First, you will want to run fsrip on an image to create a JSON metadata file for it. You will then want to copy BOTH the JSON Metadata file AND the image file onto HDFS (the usual directory for this is /texaspete/img, though you can put them wherever you like).
The recommended way to run the full pipeline is by using the tpkickoff.sh script located in the bin folder of the project directory. This will run the entire ingest/analysis/reporting cycle of mapreduce jobs on a single hard drive. You need to supply 3 parameters to this script. The first is a friendly name of the image (which can be any alphanumeric name that is a valid HDFS file name). This is used only for convenience; most jobs relating to this image will have this friendly name if you search for them in the hadoop job tracker. The second is the path to the image on the local file system (NOT ON HDFS; the file is copied there by the script), and the third parameter is a path to the directory containing the job jar you built previously.
When the tpkickoff.sh script completes, the output will be a reports.zip file inside of the /texaspete/data/$IMAGE_HASH/ folder.
Note that if you wish to run the individual components of the pipeline separately, you should be able to do that from this jar by invoking their java classes directly. Most have usage/help lines which may be of use.