-
Notifications
You must be signed in to change notification settings - Fork 4
ScaleUnlimited/wikipedia-ngrams
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This project contains - A stand-alone tool to convert the Wikipedia article dump (as XML) into multiple text files, each consisting of one <page>xxxx</page> record per line. This is then suitable for input to Hadoop. The use of this tool is described in the separate README-Splitting file - A Hadoop-based workflow that processes the dump, extracts ngrams, and generates counts. The rest of this document describes that tool. ====================================================== Prerequisites ====================================================== This section assumes that you are set up to use Elastic MapReduce. If not, then you should complete the steps as described by the first module in this course. In particular, you'll need your AWS Access Key, Secret Key, and a keypair generated by AWS. Separately, you may want to pre-configure Foxy Proxy in Firefox if you wish to view the job details via the Hadoop JobTracker GUI. For instructions, see the "How to Install Foxy Proxy" section of the Amazon Elastic MapReduce Developer Guide, which you can view or download from http://aws.amazon.com/documentation/elasticmapreduce/ ====================================================== Running the processor job using EMR ====================================================== 1. Create a bucket in S3, using the AWS Console. https://console.aws.amazon.com/s3/home For example, call this bucket "aws-test-99" Inside of this bucket create three directories called "job", "logs", and "results" 2. Build the job jar [On your dev machine] % ant clean job This will create the wikipedia-ngrams-job.jar Hadoop job jar file in your build sub-directory. If you have Hadoop installed on your development machine, you can try running it locally via: % hadoop jar build/wikipedia-ngrams-job.jar -inputfile src/test/resources/enwiki-split.xml -outputdir build/test This will generate text output files in build/test/raw-counts and build/test/sorted-counts. To view the results, you can dump the output (these are text files), e.g. % cat build/test/sorted-counts/part-r-00000 3. Upload the job jar to <bucket name>/job/, using the AWS Console. For example, put it into aws-test-99/job/wikipedia-ngrams-job.jar 4. Start the Job Flow, using the AWS Console https://console.aws.amazon.com/elasticmapreduce/home - Click the "Create New Job Flow" button. This will start you down the six dialog path to enlightenment.... Define Job Flow =============== - Give it a reasonable name, and set the "Choose a Job Type" menu to "Custom JAR" - Click the "Continue" button. Specify Parameters ================== - Set the JAR Location to the job jar you uploaded (e.g. aws-test-99/job/wikipedia-ngrams-job.jar) - Set the JAR Arguments to be "-inputfile s3n://datasets.elasticmapreduce/wikipediaxml/part-100.xml -outputdir s3n://<my bucket>/results -percent 10 -numreducers 1" [NOTE - you must change <my bucket> to be the bucket you created above, e.g. aws-test-99] - Click the "Continue" button. Configure EC2 Instances ======================= - Set the Master Instance Group's Instance Type menu to "Small (m1.small)" - Set the Core Instance Group's Instance Count to 2, and the Instance Type menu to "Large (m1.large)" - Leave the Task Instance Group's Instance Count set to 0. - Click the "Continue" button. Advanced Options ================ - Set the Amazon EC2 Key Pair menu to the name of the key pair you created previously. - Set the Amazon S3 Log Path to be s3n://<my bucket>/logs [NOTE - you must change <my bucket> to be the bucket you created above, e.g. aws-test-99] - Leave everything else unchanged. - Click the "Continue" button. Bootstrap Actions ================= - Leave the "Proceed with no Bootstrap Actions" radio button selected. - Click the "Continue" button. Review ====== - Behold the myriad settings you have specified. - Click the "Create Job Flow" button. - Click the "Close" button on the final dialog. 5. Monitor the Job Flow The AWS Console will list your job in the Elastic MapReduce tab: https://console.aws.amazon.com/elasticmapreduce/home The state will initially be "STARTING", which will eventually change to "RUNNING". If the job fails for any reason, wait about 5 minutes, then download and inspect the log files that have been uploaded to S3. These will be found inside of the <bucket name>/logs/ path you specified when defining the job, in the job-specific subdirectory. For example, aws-test-99/logs/j-T6AYPJJ31MRH/ 6. Using the Hadoop GUI If you want to view high level details of the job as its running, using your browser, then it's easy - just get the "Master Public DNS Name" by selecting the runnng job in the list of Elastic MapReduce Job Flows, paste that into your browser, set the port to 9100, and you'll be able to monitor things like percentage completion for the various map and reduce tasks, total data read and written, etc. 7. [Advanced] - Proxying the Hadoop GUI If you want to view all the details of the job as its running, using the Hadoop GUI via your browser, then you will need to have previously installed and configured FoxyProxy in Firefox as described above. Once you have successfully configured Foxy Proxy (e.g. to proxy port 8157), you need set up an SSH SOCKS server. Open new terminal window, and enter: % ssh -i <path to keypair file> -ND 8157 hadoop@<public DNS name for master server> The public DNS name is available via the AWS Management Console, as per above. Once the SSH SOCKS server is running, you can open a browser window to the URL: <public DNS name>:9100 This will show you the Hadoop JobTracker GUI. Note that once the job terminates, this GUI will no longer be available, so you'll only have a few minutes to try this out. 8. When the job has completed (about 6-10 minutes, with the above configuration) you can download and view the results. You should use the AWS Management Console to download the <my bucket>/results/sorted-counts/part-r-00000 file to your local disk, and then open it with any text editor.
About
Code to split/parse Wikipedia XML dump
Resources
Stars
Watchers
Forks
Releases
No releases published
Packages 0
No packages published