doc/README.hadoop

Instructions For Running Hadoop
-------------------------------

0) If necessary, download your favorite version of Hadoop from the
   Apache download site and install it into a location where it's
   accessible on all cluster nodes.  Usually this is on a NFS home
   directory.

   See below in 'Hadoop Patching' about patches that may be necessary
   for Hadoop depending on your environment and Hadoop version.

   See 'Convenience Scripts' in README about
   misc/magpie-download-and-setup.sh, which may make the
   downloading and patching easier.

1) Select an appropriate submission script for running your job.  You
   can find them in the directory submission-scripts/, with Slurm
   Sbatch scripts using srun in script-sbatch-srun, Moab Msub+Slurm
   scripts using srun in script-msub-slurm-srun, Moab Msub+Torque
   scripts using pdsh in script-msub-torque-pdsh, LSF scripts using
   mpirun in script-lsf-mpirun, and Flux scripts in
   script-flux-batch-run.

   You'll likely want to start with the base hadoop script
   (e.g. magpie.sbatch-srun-hadoop) for your scheduler/resource
   manager.  If you wish to configure more, you can choose to start
   with the base script (e.g. magpie.sbatch-srun) which contains all
   configuration options.

2) Setup your job essentials at the top of the submission script.  As
   an example, the following are the essentials for running with Moab.

   #MSUB -l nodes : Set how many nodes you want in your job

   #MSUB -l walltime : Set the time for this job to run

   #MSUB -l partition : Set the job partition

   #MSUB -q <my batch queue> : Set to batch queue

   MOAB_JOBNAME : Set your job name.

   MAGPIE_SCRIPTS_HOME : Set where your scripts are

   MAGPIE_LOCAL_DIR : For scratch space files

   MAGPIE_JOB_TYPE : This should be set to 'hadoop' initially

   JAVA_HOME : B/c you need to ...

3) Now setup the essentials for Hadoop.

   HADOOP_SETUP : Set to yes

   HADOOP_SETUP_TYPE : Set if you want to run both Yarn & HDFS, or
   just one only.  Most users will want to set to "MR".

   HADOOP_VERSION : Make sure your build matches HADOOP_SETUP_TYPE
   (i.e. don't say you want MapReduce 1 and point to Hadoop 2.0 build)

   HADOOP_HOME : Where your hadoop code is.  Typically in an NFS mount.

   HADOOP_LOCAL_DIR : A small place for conf files and log files local
   to each node.  Typically /tmp directory.

   HADOOP_FILESYSTEM_MODE : Most will likely want "hdfsoverlustre" or
   "hdfsovernetworkfs".  See below for details on HDFS over Lustre and
   HDFS over NetworkFS.

   HADOOP_HDFSOVERLUSTRE_PATH or equivalent: For HDFS over Lustre, you
   need to set this.  If not using HDFS over Lustre, set the
   appropriate path for your filesystem mode choice.

4) Select how your job will run by setting MAGPIE_JOB_TYPE and/or
   HADOOP_JOB.  Initially, you'll likely want to set MAGPIE_JOB_TYPE
   to 'hadoop' and setting HADOOP_JOB to 'terasort'.  This will allow
   you to run a pre-written job to make sure things are setup
   correctly.

   After this, you may want to run with MAGPIE_JOB_TYPE set to
   'interactive' to play around and figure things out.  In the job
   output you will see output similar to the following:

      ssh node70
      setenv JAVA_HOME "/usr/lib/jvm/jre-1.7.0-oracle.x86_64/"
      setenv HADOOP_HOME "/home/username/hadoop-2.7.2"
      setenv HADOOP_CONF_DIR "/tmp/username/hadoop/ajobname/1174570/conf"

   These instructions will inform you how to login to the master node
   of your allocation and how to initialize your session.  Once in
   your session, you can do as you please.  For example, you can
   interact with the Hadoop filesystem (bin/hadoop fs ...) or run a
   job (bin/hadoop jar ...).  There will also be instructions in your
   job output on how to tear the session down cleanly if you wish to
   end your job early.

   Once you have figured out how you wish to run your job, you will
   likely want to run with MAGPIE_JOB_TYPE set to 'script' mode.
   Create a script that will run your job/calculation automatically,
   set it in MAGPIE_JOB_SCRIPT, and then run your job.  You can find
   an example job script in examples/hadoop-example-job-script.

   See "Exported Environment Variables" in README for information on
   common exported environment variables that may be useful in
   scripts.

   See below in "Hadoop Exported Environment Variables", for
   information on Hadoop specific exported environment variables that
   may be useful in scripts.

5) Submit your job into the cluster by running "sbatch -k
   ./magpie.sbatchfile" for Slurm, "msub ./magpie.msubfile" for
   Moab, or "bsub < ./magpie.lsffile" for LSF.
   Add any other options you see fit.

6) Look at your job output file to see your output.  There will also
   be some notes/instructions/tips in the output file for viewing the
   status of your job in a web browser, environment variables you wish
   to set if interacting with it, etc.

   See "General Advanced Usage" in README for additional tips.
   See below in "Hadoop Advanced Usage" for additional Hadoop tips.

Hadoop Exported Environment Variables
-------------------------------------

The following environment variables are exported when your job is run
and may be useful in scripts in your run or in pre/post run scripts.

HADOOP_MASTER_NODE : the master node of the Hadoop allocation

HADOOP_WORKER_COUNT : number of compute/data nodes in your allocation
                      for Hadoop.  May be useful for adjusting run time
                      options such as reducer count.

HADOOP_WORKER_CORE_COUNT : Total cores on worker nodes in the
       		           allocation.  May be useful for adjusting run
       		           time options such as reducer count.

HADOOP_NAMENODE : the master namenode of the Hadoop allocation.  Often
 		  used for accessing HDFS when the namenode + port
 		  must be specified in a script.
 		  (e.g. hdfs://${HADOOP_NAMENODE}:${HADOOP_NAMENODE_PORT}/user/...)

                  Exported only if HDFS type file system used.

HADOOP_NAMENODE_PORT : the port of the namenode.  Often used for
 		  accessing HDFS when the namenode + port must be
 		  specified in a script.
 		  (e.g. hdfs://${HADOOP_NAMENODE}:${HADOOP_NAMENODE_PORT}/user/...)

                  Exported only if HDFS type file system used.

HADOOP_CONF_DIR : the directory that Hadoop configuration files local
                  to the node are stored.

HADOOP_LOG_DIR : the directory Hadoop log files are stored

Hadoop Convenience Scripts
--------------------------

The following job scripts may be convenient.  They can be run by
setting MAGPIE_JOB_TYPE set to 'script' and setting MAGPIE_JOB_SCRIPT
to the script.

job-scripts/hadoop-rebalance-hdfs-over-lustre-or-hdfs-over-networkfs-if-increasing-nodes-script.sh
- See "Basics of HDFS over Lustre/NetworkFS" section for details.

job-scripts/hadoop-hdfs-fsck-cleanup-corrupted-blocks-script.sh -
Cleanup/remove corrupted blocks in HDFS.

Example Job Output for Hadoop running Terasort
----------------------------------------------

The following is an example job output of Magpie running Hadoop and
running a Terasort.  This is run over HDFS over Lustre.  Sections of
extraneous text have been left out.

While this output is specific to using Magpie with Hadoop, the output
when using Spark, Storm, Hbase, etc. is not all that different.

1) First we see that HDFS over Lustre is being setup by formatting the
HDFS Namenode.

     *******************************************************
     * Performing Post Setup
     *******************************************************
     *******************************************************
     * Formatting HDFS Namenode
     *******************************************************
     16/07/18 23:18:20 INFO namenode.NameNode: STARTUP_MSG:
     /************************************************************
     STARTUP_MSG: Starting NameNode
     STARTUP_MSG:   host = apex69.llnl.gov/192.168.123.69
     STARTUP_MSG:   args = [-format]
     STARTUP_MSG:   version = 2.7.2
     <snip>
     <snip>
     <snip>
     16/07/18 23:18:26 INFO common.Storage: Storage directory /p/lscratchg/achu/testing/hdfsoverlustre/1174570/node-0/dfs/name has been successfully formatted.
     16/07/18 23:18:26 INFO namenode.NNStorageRetentionManager: Going to retain 1 images with txid >= 0
     16/07/18 23:18:26 INFO util.ExitUtil: Exiting with status 0
     16/07/18 23:18:26 INFO namenode.NameNode: SHUTDOWN_MSG:
     /************************************************************
     SHUTDOWN_MSG: Shutting down NameNode at apex69.llnl.gov/192.168.123.69
     ************************************************************/
     *******************************************************
     * Post Setup Complete
     *******************************************************

2) Next we get some details of the job

   *******************************************************
   * Magpie General Job Info
   *
   * Job Nodelist: apex[69-77]
   * Job Nodecount: 9
   * Job Timelimit in Minutes: 90
   * Job Name: test
   * Job ID: 1174570
   *
   *******************************************************

3) Hadoop begins to launch and startup daemons on all cluster nodes.

   Starting hadoop
   16/07/18 23:18:48 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   Starting namenodes on [apex69]
   apex69: starting namenode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-namenode-apex69.out
   apex75: starting datanode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-datanode-apex75.out
   apex77: starting datanode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-datanode-apex77.out
   apex74: starting datanode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-datanode-apex74.out
   apex73: starting datanode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-datanode-apex73.out
   apex72: starting datanode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-datanode-apex72.out
   apex70: starting datanode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-datanode-apex70.out
   apex76: starting datanode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-datanode-apex76.out
   apex71: starting datanode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-datanode-apex71.out
   Starting secondary namenodes [apex69]
   apex69: starting secondarynamenode, logging to /tmp/achu/hadoop/test/1174570/log/hadoop-achu-secondarynamenode-apex69.out
   16/07/18 23:19:10 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   starting yarn daemons
   starting resourcemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-resourcemanager-apex69.out
   apex70: starting nodemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-nodemanager-apex70.out
   apex73: starting nodemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-nodemanager-apex73.out
   apex77: starting nodemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-nodemanager-apex77.out
   apex75: starting nodemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-nodemanager-apex75.out
   apex76: starting nodemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-nodemanager-apex76.out
   apex71: starting nodemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-nodemanager-apex71.out
   apex72: starting nodemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-nodemanager-apex72.out
   apex74: starting nodemanager, logging to /tmp/achu/hadoop/test/1174570/log/yarn-achu-nodemanager-apex74.out
   Waiting 30 seconds to allows Hadoop daemons to setup

4) Next, we see output with details of the Hadoop setup.  You'll find
   addresses indicating web services you can access to get detailed
   job information.  You'll also find information about how to login
   to access Hadoop directly and how to shut down the job early if you
   so desire.

   *******************************************************
   *
   * Hadoop Information
   *
   * You can view your Hadoop status by launching a web browser and pointing to ...
   *
   * Yarn Resource Manager: http://apex69:8088
   *
   * Job History Server: http://apex69:19888
   *
   * HDFS Namenode: http://apex69:50070
   * HDFS DataNode: http://<DATANODE>:50075
   *
   * HDFS can be accessed directly at:
   *
   *   hdfs://apex69:54310
   *
   * To access Hadoop directly, you'll want to:
   *
   *   ssh apex69
   *   setenv JAVA_HOME "/usr/lib/jvm/jre-1.7.0-oracle.x86_64/"
   *   setenv HADOOP_HOME "/home/achu/hadoop/hadoop-2.7.2"
   *   setenv HADOOP_CONF_DIR "/tmp/achu/hadoop/test/1174570/conf"
   *
   * Then you can do as you please.  For example to interact with the Hadoop filesystem:
   *
   *   $HADOOP_HOME/bin/hdfs dfs ...
   *
   * To launch jobs you'll want to:
   *
   *   $HADOOP_HOME/bin/hadoop jar ...
   *
   * To end/cleanup your session & kill all daemons, login and set
   * environment variables per the instructions above, then run:
   *
   *   $HADOOP_HOME/sbin/stop-yarn.sh
   *   $HADOOP_HOME/sbin/stop-dfs.sh
   *   $HADOOP_HOME/sbin/mr-jobhistory-daemon.sh stop historyserver
   *
   *******************************************************
   <snip>
   <snip>
   <snip>
   *******************************************************
   * Run
   *
   * ssh apex69 kill -s 10 27588
   *
   * to exit job early.
   *
   * Warning: killing early may not kill jobs submitted to an internally
   * managed scheduler within Magpie.  The job will be canceled during teardown
   * of daemons.  Extraneous error messages from your job may occur until then.
   *
   * Magpie is aware that Yarn has been launched.  If user wishes to
   * kill all currently submitted Yarn jobs in the PREP or
   * RUNNING state to be killed before killing the job, run:
   *
   * ssh apex69 kill -s 28 27588
   *******************************************************

5) The job then runs Teragen

   *******************************************************
   * Executing TeraGen
   *******************************************************
   Running bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar teragen  -Dmapreduce.job.maps=38 -Ddfs.datanode.drop.cache.behind.reads=true -Ddfs.datanode.drop.cache.behind.writes=true 50000000 terasort-teragen
   16/07/18 23:20:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   16/07/18 23:20:01 INFO client.RMProxy: Connecting to ResourceManager at apex69/192.168.123.69:8032
   16/07/18 23:20:17 INFO terasort.TeraSort: Generating 50000000 using 38
   16/07/18 23:20:29 INFO mapreduce.JobSubmitter: number of splits:38
   16/07/18 23:20:35 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468909152215_0001
   16/07/18 23:20:35 INFO impl.YarnClientImpl: Submitted application application_1468909152215_0001
   16/07/18 23:20:36 INFO mapreduce.Job: The url to track the job: http://apex69:8088/proxy/application_1468909152215_0001/
   16/07/18 23:20:36 INFO mapreduce.Job: Running job: job_1468909152215_0001
   16/07/18 23:20:46 INFO mapreduce.Job: Job job_1468909152215_0001 running in uber mode : false
   16/07/18 23:20:46 INFO mapreduce.Job:  map 0% reduce 0%
   16/07/18 23:21:01 INFO mapreduce.Job:  map 13% reduce 0%
   16/07/18 23:21:02 INFO mapreduce.Job:  map 16% reduce 0%
   16/07/18 23:21:03 INFO mapreduce.Job:  map 19% reduce 0%
   16/07/18 23:21:07 INFO mapreduce.Job:  map 21% reduce 0%
   16/07/18 23:21:08 INFO mapreduce.Job:  map 22% reduce 0%
   16/07/18 23:21:10 INFO mapreduce.Job:  map 27% reduce 0%
   16/07/18 23:21:13 INFO mapreduce.Job:  map 33% reduce 0%
   16/07/18 23:21:17 INFO mapreduce.Job:  map 34% reduce 0%
   16/07/18 23:21:20 INFO mapreduce.Job:  map 42% reduce 0%
   16/07/18 23:21:22 INFO mapreduce.Job:  map 55% reduce 0%
   16/07/18 23:21:23 INFO mapreduce.Job:  map 62% reduce 0%
   16/07/18 23:21:24 INFO mapreduce.Job:  map 67% reduce 0%
   16/07/18 23:21:25 INFO mapreduce.Job:  map 80% reduce 0%
   16/07/18 23:21:26 INFO mapreduce.Job:  map 82% reduce 0%
   16/07/18 23:21:28 INFO mapreduce.Job:  map 87% reduce 0%
   16/07/18 23:21:29 INFO mapreduce.Job:  map 95% reduce 0%
   16/07/18 23:21:30 INFO mapreduce.Job:  map 97% reduce 0%
   16/07/18 23:21:31 INFO mapreduce.Job:  map 100% reduce 0%
   16/07/18 23:22:30 INFO mapreduce.Job: Job job_1468909152215_0001 completed successfully
   16/07/18 23:22:30 INFO mapreduce.Job: Counters: 31
           File System Counters
                   FILE: Number of bytes read=0
                   FILE: Number of bytes written=4536430
                   FILE: Number of read operations=0
                   FILE: Number of large read operations=0
                   FILE: Number of write operations=0
                   HDFS: Number of bytes read=3252
                   HDFS: Number of bytes written=5000000000
                   HDFS: Number of read operations=152
                   HDFS: Number of large read operations=0
                   HDFS: Number of write operations=76
           Job Counters
                   Launched map tasks=38
                   Other local map tasks=38
                   Total time spent by all maps in occupied slots (ms)=5960619
                   Total time spent by all reduces in occupied slots (ms)=0
                   Total time spent by all map tasks (ms)=1986873
                   Total vcore-milliseconds taken by all map tasks=1986873
                   Total megabyte-milliseconds taken by all map tasks=5086394880
           Map-Reduce Framework
                   Map input records=50000000
                   Map output records=50000000
                   Input split bytes=3252
                   Spilled Records=0
                   Failed Shuffles=0
                   Merged Map outputs=0
                   GC time elapsed (ms)=1115
                   CPU time spent (ms)=145580
                   Physical memory (bytes) snapshot=14764904448
                   Virtual memory (bytes) snapshot=103846907904
                   Total committed heap usage (bytes)=38351667200
           org.apache.hadoop.examples.terasort.TeraGen$Counters
                   CHECKSUM=107387891658806101
           File Input Format Counters
                   Bytes Read=0
           File Output Format Counters
                   Bytes Written=5000000000

6) The job then runs Terasort

   *******************************************************
   * Executing TeraSort
   *******************************************************
   Running bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar terasort  -Dmapreduce.job.reduces=16  -Ddfs.datanode.drop.cache.behind.reads=true -Ddfs.datanode.drop.cache.behind.writes=true terasort-teragen terasort-sort
   16/07/18 23:23:01 INFO terasort.TeraSort: starting
   16/07/18 23:23:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   16/07/18 23:23:02 INFO input.FileInputFormat: Total input paths to process : 38
   Spent 170ms computing base-splits.
   Spent 3ms computing TeraScheduler splits.
   Computing input splits took 173ms
   Sampling 10 splits of 38
   Making 16 from 100000 sampled records
   Computing parititions took 4961ms
   Spent 5137ms computing partitions.
   16/07/18 23:23:07 INFO client.RMProxy: Connecting to ResourceManager at apex69/192.168.123.69:8032
   16/07/18 23:23:21 INFO mapreduce.JobSubmitter: number of splits:38
   16/07/18 23:23:22 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468909152215_0002
   16/07/18 23:23:22 INFO impl.YarnClientImpl: Submitted application application_1468909152215_0002
   16/07/18 23:23:22 INFO mapreduce.Job: The url to track the job: http://apex69:8088/proxy/application_1468909152215_0002/
   16/07/18 23:23:22 INFO mapreduce.Job: Running job: job_1468909152215_0002
   16/07/18 23:23:28 INFO mapreduce.Job: Job job_1468909152215_0002 running in uber mode : false
   16/07/18 23:23:28 INFO mapreduce.Job:  map 0% reduce 0%
   16/07/18 23:23:41 INFO mapreduce.Job:  map 16% reduce 0%
   16/07/18 23:23:44 INFO mapreduce.Job:  map 32% reduce 0%
   16/07/18 23:23:45 INFO mapreduce.Job:  map 96% reduce 0%
   16/07/18 23:23:46 INFO mapreduce.Job:  map 100% reduce 0%
   16/07/18 23:23:53 INFO mapreduce.Job:  map 100% reduce 25%
   16/07/18 23:23:54 INFO mapreduce.Job:  map 100% reduce 67%
   16/07/18 23:23:56 INFO mapreduce.Job:  map 100% reduce 69%
   16/07/18 23:23:57 INFO mapreduce.Job:  map 100% reduce 70%
   16/07/18 23:23:59 INFO mapreduce.Job:  map 100% reduce 73%
   16/07/18 23:24:00 INFO mapreduce.Job:  map 100% reduce 80%
   16/07/18 23:24:02 INFO mapreduce.Job:  map 100% reduce 83%
   16/07/18 23:24:03 INFO mapreduce.Job:  map 100% reduce 85%
   16/07/18 23:24:05 INFO mapreduce.Job:  map 100% reduce 88%
   16/07/18 23:24:06 INFO mapreduce.Job:  map 100% reduce 91%
   16/07/18 23:24:08 INFO mapreduce.Job:  map 100% reduce 92%
   16/07/18 23:24:09 INFO mapreduce.Job:  map 100% reduce 97%
   16/07/18 23:24:11 INFO mapreduce.Job:  map 100% reduce 98%
   16/07/18 23:24:12 INFO mapreduce.Job:  map 100% reduce 99%
   16/07/18 23:24:15 INFO mapreduce.Job:  map 100% reduce 100%
   16/07/18 23:24:41 INFO mapreduce.Job: Job job_1468909152215_0002 completed successfully
   16/07/18 23:24:42 INFO mapreduce.Job: Counters: 50
           File System Counters
                   FILE: Number of bytes read=5200006366
                   FILE: Number of bytes written=10406540642
                   FILE: Number of read operations=0
                   FILE: Number of large read operations=0
                   FILE: Number of write operations=0
                   HDFS: Number of bytes read=5000004712
                   HDFS: Number of bytes written=5000000000
                   HDFS: Number of read operations=162
                   HDFS: Number of large read operations=0
                   HDFS: Number of write operations=32
           Job Counters
                   Launched map tasks=38
                   Launched reduce tasks=16
                   Data-local map tasks=26
                   Rack-local map tasks=12
                   Total time spent by all maps in occupied slots (ms)=1282143
                   Total time spent by all reduces in occupied slots (ms)=2726265
                   Total time spent by all map tasks (ms)=427381
                   Total time spent by all reduce tasks (ms)=545253
                   Total vcore-milliseconds taken by all map tasks=427381
                   Total vcore-milliseconds taken by all reduce tasks=545253
                   Total megabyte-milliseconds taken by all map tasks=1094095360
                   Total megabyte-milliseconds taken by all reduce tasks=2791695360
           Map-Reduce Framework
                   Map input records=50000000
                   Map output records=50000000
                   Map output bytes=5100000000
                   Map output materialized bytes=5200003648
                   Input split bytes=4712
                   Combine input records=0
                   Combine output record=0
                   Reduce input groups=50000000
                   Reduce shuffle bytes=5200003648
                   Reduce input records=50000000
                   Reduce output records=50000000
                   Spilled Records=100000000
                   Shuffled Maps =608
                   Failed Shuffles=0
                   Merged Map outputs=608
                   GC time elapsed (ms)=4693
                   CPU time spent (ms)=440090
                   Physical memory (bytes) snapshot=40267489280
                   Virtual memory (bytes) snapshot=183285145600
                   Total committed heap usage (bytes)=63197675520
           Shuffle Errors
                   BAD_ID=0
                   CONNECTION=0
                   IO_ERROR=0
                   WRONG_LENGTH=0
                   WRONG_MAP=0
                   WRONG_REDUCE=0
           File Input Format Counters
                   Bytes Read=5000000000
           File Output Format Counters
                   Bytes Written=5000000000

7) With the job complete, Magpie now tears down the session and cleans
   up all daemons.

   Stopping hadoop
   stopping yarn daemons
   stopping resourcemanager
   apex71: stopping nodemanager
   apex73: stopping nodemanager
   apex70: stopping nodemanager
   apex75: stopping nodemanager
   apex76: stopping nodemanager
   apex74: stopping nodemanager
   apex72: stopping nodemanager
   apex77: stopping nodemanager
   no proxyserver to stop
   stopping historyserver
   Saving namespace before shutting down hdfs ...
   Running bin/hdfs dfsadmin -safemode enter
   16/07/18 23:31:21 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   Safe mode is ON
   Running bin/hdfs dfsadmin -saveNamespace
   16/07/18 23:31:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   Save namespace successful
   Running bin/hdfs dfsadmin -safemode leave
   16/07/18 23:31:36 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   Safe mode is OFF
   16/07/18 23:31:37 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   Stopping namenodes on [apex69]
   apex69: stopping namenode
   apex73: stopping datanode
   apex76: stopping datanode
   apex74: stopping datanode
   apex72: stopping datanode
   apex77: stopping datanode
   apex71: stopping datanode
   apex70: stopping datanode
   apex75: stopping datanode
   Stopping secondary namenodes [apex69]
   apex69: stopping secondarynamenode

Hadoop Patching
---------------

Hadoop
- Patch to support non-ssh remote execution may be needed in some
  environments.  Patch can be applied directly to startup scripts, not
  needing a recompilation of source.

  Patches for this can be found in the patches/hadoop/ directory with
  'alternate-ssh' in the filename.

  The alternate remote execution command must be specified in the
  environment variable MAGPIE_REMOTE_CMD.

  This patch should be applied first.

- Hadoop terasort may require recompilation if running over 'rawnetworkfs'.

  There is a bug in the Terasort example, leading to issues with
  running Terasort against a parallel file system directly.  I submitted
  a patch in this Jira.

  https://issues.apache.org/jira/browse/MAPREDUCE-5528

  The patch to correct this is included in the patches/hadoop/ directory.

- If MAGPIE_NO_LOCAL_DIR support is desired, patches in
  patches/hadoop/ with the "no-local-dir.patch" suffix in the filename
  can be found for support.  See README.no-local-dir for more details.

  This patch should be applied second, after 'alternate-ssh' if one is
  available.

Basics of HDFS over Lustre/NetworkFS
------------------------------------

Instead of using local disk, we designate a Lustre/network file system
directory to "emulate" local disk for each compute node.  For example,
lets say you have 4 compute nodes.  If we create the following paths
in Lustre,

/lustre/myusername/node-0
/lustre/myusername/node-1
/lustre/myusername/node-2
/lustre/myusername/node-3

We can give each of these paths to one of the compute nodes, which
they can treat like a local disk.  HDFS operates on top of these
directories just as though there were a local disk on the server.

Q: Does that mean I have to HDFS starts with a clean slate everytime I
   start a job?

A: No, using node ranks, "disk-paths" can be consistently assigned to
   nodes so that all your HDFS files from a previous run can exist on
   a later run.  The next time you run your job, it doesn't matter
   what server you're running on, b/c your scheduler/resource manager
   will assign that node its appropriate rank.  The node will
   subsequently load HDFS from its appropriate directory.

Q: But that'll mean I have to consistently use the same number of
   cluster nodes?

A: Generally speaking no, but you can hit issues if you don't.  Just
   imagine what HDFS issues if you were on a traditional Hadoop
   cluster and added or removed nodes.

   Generally speaking, increasing the number of nodes you use for a
   job is fine.  Data you currently have in HDFS is still there and
   readable, but it is not viewed as "local" according to HDFS and
   more network transfers will have to happen.  You may wish to
   rebalance the HDFS blocks though.  The convenience script
   hadoop-rebalance-hdfs-over-lustre-or-hdfs-over-networkfs-if-increasing-nodes-script.sh
   be used instead.

   (Special Note: The start-balancer.sh that is
   normally used probably will not work.  All of the paths are in
   Lustre/NetworkFS, so the "free space" on each "local" path is identical,
   messing up calculations for balancing (i.e. no "local disk" is
   more/less utilized than another).

   Decreasing nodes is a bit more dangerous, as data can "disappear"
   just like if it were on a traditional Hadoop cluster.  If you try
   to scale down the number of nodes, you should go through the
   process of "decommissioning" nodes like on a real cluster,
   otherwise you may lose data.  You can decommission nodes through
   the "decommissionhdfsnodes" option in HADOOP_MODE.

Q: Can multiple Magpie instances be run in parallel based on the same
   lustre/networkfs path?

A: No, only one HDFS namenode can operate out of a specific set of
   paths.  If you imagine a traditional Hadoop cluster, what you
   effectively are trying to do is start two HDFS file systems out of
   the same local disk.

   If you wish to run multiple Magpie instances with HDFS over Lustre
   or HDFS over a NetworkFS, all you need to do is configure different
   base directories for those instances, via the
   HADOOP_HDFSOVERLUSTRE_PATH or HADOOP_HDFSOVERNETWORKFS_PATH
   settings respectively.

   If you are not concerned about data persisting between jobs, you
   may also consider using the MAGPIE_ONE_TIME_RUN option to always
   force HDFS paths to be different for each job.  This setting may be
   particularly useful if you initially running tests/experiments on
   CPU counts, node counts, settings, etc. and want to run many jobs
   in parallel.

Q: What should HDFS replication be?

A: The scripts in this package default to HDFS replication of 3 when
   HDFS over Lustre is done.  If HDFS replication is > 1, it can
   improve performance of your job reading in HDFS input b/c there
   will be fewer network transfer of data (i.e. Hadoop may need to
   transfer "non-local" data to another node).  In addition, if a
   datanode were to die (i.e. a node crashes) Hadoop has the ability
   to survive the crash just like in a traditional Hadoop cluster.

   The trade-off is space and HDFS writes vs HDFS reads.  With lower
   HDFS replication (lowest is 1) you save space and decrease time for
   writes.  With increased HDFS replication, you perform better on
   reads.

Q: What if I need to upgrade the HDFS version I'm using.

A: If you want to use a different Hadoop version than what you started
   with, you will have to go through the normal upgrade or rollback
   precedures for Hadoop.

   With Hadoop versions 2.2.0 and newer, there is a seemless upgrade
   path done by specifying "-upgrade" when running the "start-dfs.sh"
   script.  This is implemented in the "upgradehdfs" option for
   HADOOP_MODE in the launch scripts.

Pro vs Con of HDFS over Lustre/NetworkFS vs. Posix FS (e.g. rawnetworkfs, etc.)
-------------------------------------------------------------------------------

Here are some pros vs. cons of using a network filesystem directly vs
HDFS over Lustre/NetworkFS.

HDFS over Lustre/NetworkFS:

Pro: Portability w/ code that runs against a "traditional" Hadoop
cluster.  If it runs on a "traditional" Hadoop cluster w/ local disk,
it should run fine w/ HDFS over Lustre/NetworkFS.

Con: Must always run job w/ Hadoop & HDFS running as a job.

Con: Must "import" and "export" data from HDFS using job runs, cannot
read/write directly.  On some clusters, this may involve a double copy
of data. e.g. first need to copy data into the cluster, then run job to
copy data into HDFS over Lustre/NetworkFS.

Con: Possible difficulty changing job size on clusters.

Con: If HDFS replication > 1, more space used up.

Posix FS directly:

Pro: You can read/write files to Lustre without Hadoop/HDFS running.

Pro: Less space used up.

Pro: Can adjust job size easily.

Con: Portability issues w/ code that usually runs on HDFS.  As an
example, HDFS has no concept of a working directory while Posix
filesystems do.  In addition, absolute paths will be different.  Code
will have to be adjusted for this.

Con: User must "manage" and organize their files directly, not gaining
the block advantages of HDFS.  If not handled well, this can lead to
performance issues.  For example, a Hadoop job that creates a 1
terabyte file under HDFS is creating a file made up of smaller HDFS
blocks.  The same job may create a single 1 terabyte file under access
to the Posix FS directly.  In the case of Lustre, striping of the file
must be handled by the user to ensure satisfactory performance.

Hadoop Troubleshooting
----------------------

1) What does "Waiting 30 more seconds for namenode to exit safe mode"
   mean?

   When HDFS (regardless if it is HDFS over Lustre or otherwise) is
   being brought up, the HDFS namenode must communicate with all
   datanodes to discover what blocks of data are available.  This
   process can take awhile if you have a very large amount of data in
   HDFS.

   In rarer circumstances, it can be due to a bug/error in which the
   HDFS namenode has not come up at all (i.e. the HDFS namenode isn't
   running) or a large number of datanodes have not come up due to
   errors.  Check the appropriate log files to debug further.

Hadoop Advanced Usage
---------------------

1) If your cluster has a local SSD or NVRAM on each node, set a path
   to it via the HADOOP_LOCALSTORE environment variable in your
   submission scripts.  It will allow Hadoop to store intermediate
   shuffle data to it.  It should significantly improve performance.

2) Magpie configures the default number of reducers in a Hadoop job to
   the number of compute nodes in your allocation.  This is
   significantly superior to the Hadoop default of 1, however, it may
   not be optimal for many jobs.  Users should play around with the
   number of reducers in their mapreduce jobs to improve performance.
   The default can be tweaked in the submission scripts via the
   HADOOP_DEFAULT_REDUCE_TASKS environment variable.

3) Similarly Magpie configures the default number of map tasks in a
   Hadoop job depending on the number of compute nodes in your
   allocation and the number of cores on nodes.  However, this may not
   be optimal in many situations, such as nodes that have a low memory
   to cpu ratio.  Users should play around with the number of map
   tasks in their mapreduce jobs to improve performance.  The default
   can be tweaked in the submission scripts via the
   HADOOP_DEFAULT_MAP_TASKS environment variable.

4) Magpie configures a relatively conservative amount of memory for
   Hadoop, currently 80% of system memory.  While there should always
   be a buffer to allow the operating system, system daemons, and
   Hadoop daemons to operate, the 80% value may be on the conservative
   side and users wishing to push it higher to 90% or 95% of system
   memory may see benefits..

   Users can adjust the amount of memory used on each node through the
   YARN_RESOURCE_MEMORY environment variable in the submission
   scripts. (Note, this is only for Hadoop 2.0 and up)

   Users can also turn off memory checking for YARN containers with
   YARN_VMEM_CHECK and YARN_PMEM_CHECK environment variables in the
   submission scripts.

5) Depending on whether your job is map heavy or reduce heavy,
   adjusting the memory container sizes of map and reduce tasks may
   also benefit job performance.  These adjustments could lead to
   larger buffer sizes on individual tasks if memory sizes are
   increased or allow more tasks to be run in parallel if memory size
   is decreased.

   This can be adjusted through the HADOOP_CHILD_HEAPSIZE,
   HADOOP_CHILD_MAP_HEAPSIZE, and HADOOP_CHILD_REDUCE_HEAPSIZE
   environment variables in the submission scripts.

6) The mapreduce slowstart configuration determines the percentage of
   map tasks that must complete before reducers begin.  It defaults to
   a very conservative value of 0.05 (i.e. 5%).  This will be
   non-optimal for many jobs, including jobs that have relatively
   non-computationally heavy reduce tasks.  The reduce tasks will take
   up job resources (task slots, memory, etc.) that might be otherwise
   useful for map tasks.  For many users, they may wish to play with
   this using higher percentages.  The default can be tweaked in the
   submission scripts via the HADOOP_MAPREDUCE_SLOWSTART environment
   variable.

7) By default Magpie disables compression in Hadoop.  On some jobs, if
   the data is particularly large, the time spent
   compressing/decompressing data may be beneficial to job
   performance.  Compression can be enabled through the
   HADOOP_COMPRESSION environment variable in the submission scripts.

8) By default Magpie runs as the user who submitted the job. In the event
   that other users may need to use the same Yarn cluster you can allow 
   individual users or groups by manually adding the following to your 
   Magpie script:

   YARN_QUEUES_EXTRA_USERS : Allows extra system users to be added to 
   the default yarn queue in order to run jobs.

   YARN_QUEUES_EXTRA_GROUPS : Allows extra system groups to be added to 
   the default yarn queue in order to run jobs.