GitHub - ScaleUnlimited/wikipedia-ngrams: Code to split/parse Wikipedia XML dump

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
doc		doc
lib		lib
src		src
.gitignore		.gitignore
README		README
README-Splitting		README-Splitting
build.properties		build.properties
build.xml		build.xml
pom.xml		pom.xml

Repository files navigation

This project contains

 - A stand-alone tool to convert the Wikipedia article dump (as XML) into multiple
   text files, each consisting of one <page>xxxx</page> record per line. This is
   then suitable for input to Hadoop. The use of this tool is described in the
   separate README-Splitting file
   
 - A Hadoop-based workflow that processes the dump, extracts ngrams, and
   generates counts. The rest of this document describes that tool.

======================================================
Prerequisites
======================================================

   This section assumes that you are set up to use Elastic MapReduce.
   If not, then you should complete the steps as described by the
   first module in this course.
   
   In particular, you'll need your AWS Access Key, Secret Key,
   and a keypair generated by AWS.
   
   Separately, you may want to pre-configure Foxy Proxy in
   Firefox if you wish to view the job details via the Hadoop
   JobTracker GUI. For instructions, see the "How to Install Foxy
   Proxy" section of the Amazon Elastic MapReduce Developer Guide,
   which you can view or download from
   http://aws.amazon.com/documentation/elasticmapreduce/

======================================================
Running the processor job using EMR
======================================================

1. Create a bucket in S3, using the AWS Console.

   https://console.aws.amazon.com/s3/home
   
   For example, call this bucket "aws-test-99"
   
   Inside of this bucket create three directories called "job", "logs", and "results"
   
2. Build the job jar

   [On your dev machine]
   
   % ant clean job

   This will create the wikipedia-ngrams-job.jar Hadoop job jar
   file in your build sub-directory. If you have Hadoop installed on
   your development machine, you can try running it locally via:
   
   % hadoop jar build/wikipedia-ngrams-job.jar -inputfile src/test/resources/enwiki-split.xml -outputdir build/test
   
   This will generate text output files in build/test/raw-counts and build/test/sorted-counts.
   To view the results, you can dump the output (these are text files), e.g.
   
   % cat build/test/sorted-counts/part-r-00000
   
3. Upload the job jar to <bucket name>/job/, using the AWS Console.

   For example, put it into aws-test-99/job/wikipedia-ngrams-job.jar

4. Start the Job Flow, using the AWS Console

   https://console.aws.amazon.com/elasticmapreduce/home
   
   - Click the "Create New Job Flow" button. This will start you down the six dialog path to enlightenment....
   
   Define Job Flow
   ===============
   
   - Give it a reasonable name, and set the "Choose a Job Type" menu to "Custom JAR"
   - Click the "Continue" button.
   
   Specify Parameters
   ==================
   
   - Set the JAR Location to the job jar you uploaded (e.g. aws-test-99/job/wikipedia-ngrams-job.jar)
   - Set the JAR Arguments to be "-inputfile s3n://datasets.elasticmapreduce/wikipediaxml/part-100.xml -outputdir s3n://<my bucket>/results -percent 10 -numreducers 1"
     [NOTE - you must change <my bucket> to be the bucket you created above, e.g. aws-test-99]
   - Click the "Continue" button.
   
   Configure EC2 Instances
   =======================
   
   - Set the Master Instance Group's Instance Type menu to "Small (m1.small)"
   - Set the Core Instance Group's Instance Count to 2, and the Instance Type menu to "Large (m1.large)"
   - Leave the Task Instance Group's Instance Count set to 0.
   - Click the "Continue" button.
   
   Advanced Options
   ================
   
   - Set the Amazon EC2 Key Pair menu to the name of the key pair you created previously.
   - Set the Amazon S3 Log Path to be s3n://<my bucket>/logs
     [NOTE - you must change <my bucket> to be the bucket you created above, e.g. aws-test-99]
   - Leave everything else unchanged.
   - Click the "Continue" button.

   Bootstrap Actions
   =================
   
   - Leave the "Proceed with no Bootstrap Actions" radio button selected.
   - Click the "Continue" button.
   
   Review
   ======
   
   - Behold the myriad settings you have specified.
   - Click the "Create Job Flow" button.
   - Click the "Close" button on the final dialog.
   
5. Monitor the Job Flow

   The AWS Console will list your job in the Elastic MapReduce tab:
   
   https://console.aws.amazon.com/elasticmapreduce/home

   The state will initially be "STARTING", which will eventually change to "RUNNING".
   
   If the job fails for any reason, wait about 5 minutes, then download and inspect
   the log files that have been uploaded to S3. These will be found inside of the
   <bucket name>/logs/ path you specified when defining the job, in the job-specific
   subdirectory. For example, aws-test-99/logs/j-T6AYPJJ31MRH/
   
6. Using the Hadoop GUI

   If you want to view high level details of the job as its running, using your browser,
   then it's easy - just get the "Master Public DNS Name" by selecting the runnng job
   in the list of Elastic MapReduce Job Flows, paste that into your browser, set the
   port to 9100, and you'll be able to monitor things like percentage completion for
   the various map and reduce tasks, total data read and written, etc. 
   
7. [Advanced] - Proxying the Hadoop GUI

   If you want to view all the details of the job as its running, using the Hadoop
   GUI via your browser, then you will need to have previously installed and configured
   FoxyProxy in Firefox as described above. 
   
   Once you have successfully configured Foxy Proxy (e.g. to proxy port 8157), you need
   set up an SSH SOCKS server. Open new terminal window, and enter:

   % ssh -i <path to keypair file> -ND 8157 hadoop@<public DNS name for master server>
   
   The public DNS name is available via the AWS Management Console, as per above.
   
   Once the SSH SOCKS server is running, you can open a browser window to the URL:
   
   <public DNS name>:9100
   
   This will show you the Hadoop JobTracker GUI. Note that once the job terminates, this
   GUI will no longer be available, so you'll only have a few minutes to try this out.
   
8. When the job has completed (about 6-10 minutes, with the above configuration)
   you can download and view the results. You should use the AWS Management Console
   to download the <my bucket>/results/sorted-counts/part-r-00000 file to your
   local disk, and then open it with any text editor.