doc/README.spark

Instructions For Spark
----------------------

0) If necessary, download your favorite version of Spark from the
   Apache download site and install it into a location where it's
   accessible on all cluster nodes.  Usually this is on a NFS home
   directory.

   See below in 'Spark Patching' about patches that are necessary for
   Spark and patches that may be necessary depending on your
   environment and Spark version.

   See 'Convenience Scripts' in README about
   misc/magpie-download-and-setup.sh, which may make the
   downloading and patching easier.

1) Select an appropriate submission script for running your job.  You
   can find them in the directory submission-scripts/, with Slurm
   Sbatch scripts using srun in script-sbatch-srun, Moab Msub+Slurm
   scripts using srun in script-msub-slurm-srun, Moab Msub+Torque
   scripts using pdsh in script-msub-torque-pdsh, LSF scripts using
   mpirun in script-lsf-mpirun, and Flux scripts in
   script-flux-batch-run.

   You'll likely want to start with the base spark script
   (e.g. magpie.sbatch-srun-spark) or spark w/ hdfs
   (e.g. magpie.sbatch-srun-spark-with-hdfs) for your
   scheduler/resource manager (other script starters include yarn or yarn
   & hdfs).  If you wish to configure more, you can choose to start
   with the base script (e.g. magpie.sbatch-srun) which contains all
   configuration options.

   It should be noted that you can run Spark without HDFS.  You can
   access files normally through "file://<path>".

2) Setup your job essentials at the top of the submission script.  As
   an example, the following are the essentials for running with Moab.

   #MSUB -l nodes : Set how many nodes you want in your job

   #MSUB -l walltime : Set the time for this job to run

   #MSUB -l partition : Set the job partition

   #MSUB -q <my batch queue> : Set to batch queue

   MOAB_JOBNAME : Set your job name.

   MAGPIE_SCRIPTS_HOME : Set where your scripts are

   MAGPIE_LOCAL_DIR : For scratch space files

   MAGPIE_JOB_TYPE : This should be set to 'spark' initially

   JAVA_HOME : B/c you need to ...

3) Setup the essentials for Spark.

   SPARK_SETUP : Set to yes

   SPARK_SETUP_TYPE : Set if you want to run with or without Yarn.
   Most users will want to set to "STANDALONE".

   SPARK_VERSION : Set appropriately.

   SPARK_HOME : Where your Spark code is.  Typically in an NFS
   mount.

   SPARK_LOCAL_DIR : A small place for conf files and log files local
   to each node.  Typically /tmp directory.

   SPARK_LOCAL_SCRATCH_DIR : A scratch directory for Spark to use.  If
   a local SSD/NVRAM is available, it would be preferable to set this
   to that path.

4) Select how your job will run by setting MAGPIE_JOB_TYPE and/or
   SPARK_JOB.  Initially, you'll likely want to set MAGPIE_JOB_TYPE to
   'spark' and setting SPARK_JOB to 'sparkpi'.  This will allow you to
   run a pre-written job to make sure things are setup correctly.

   After this, you may want to run with MAGPIE_JOB_TYPE set to
   'interactive' to play around and figure things out.  In the job
   output you will see output similar to the following:

      ssh node70
      setenv JAVA_HOME "/usr/lib/jvm/jre-1.7.0-oracle.x86_64/"
      setenv SPARK_HOME "/home/username/spark-1.6.2-bin-hadoop2.6"
      setenv SPARK_CONF_DIR "/tmp/username/spark/ajobname/1174962/conf"

   These instructions will inform you how to login to the master node
   of your allocation and how to initialize your session.  Once in
   your session, you can do as you please.  For example, you can run a
   job using spark-class (bin/spark-class ...).  There will also be
   instructions in your job output on how to tear the session down
   cleanly if you wish to end your job early.

   Once you have figured out how you wish to run your job, you will
   likely want to run with MAGPIE_JOB_TYPE set to 'script' mode.
   Create a script that will run your job/calculation automatically,
   set it in MAGPIE_JOB_SCRIPT, and then run your job.  You can find
   an example job script in examples/spark-example-job-script.

   See "Exported Environment Variables" in README for information on
   common exported environment variables that may be useful in
   scripts.

   See below in "Spark Exported Environment Variables", for
   information on Spark specific exported environment variables that
   may be useful in scripts.

5) Spark does not require HDFS, but many choose to use it.  If you do,
   setup Hadoop w/ HDFS in your submission script.  See README.hadoop
   for Hadoop setup instructions.  Simply use the prefix "hdfs://" or
   "file://" appropriately for the filesystem you will access files
   from.

   You may wish to run with SPARK_MODE set to 'sparkwordcount' to test
   the HDFS setup.

6) Submit your job into the cluster by running "sbatch -k
   ./magpie.sbatchfile" for Slurm, "msub ./magpie.msubfile" for
   Moab, or "bsub < ./magpie.lsffile" for LSF.
   Add any other options you see fit.

7) Look at your job output file to see your output.  There will also
   be some notes/instructions/tips in the output file for viewing the
   status of your job in a web browser, environment variables you wish
   to set if interacting with it, etc.

   See "General Advanced Usage" in README for additional tips.

   See below in "Spark Advanced Usage" for additional Spark tips.

Spark Exported Environment Variables
------------------------------------

The following environment variables are exported when your job is run
and may be useful in scripts in your run or in pre/post run scripts.

SPARK_MASTER_NODE : the master node of the Spark allocation.  Often
		    used for launching Spark jobs
		    (e.g. spark://${SPARK_MASTER_NODE}:${SPARK_MASTER_PORT})

SPARK_MASTER_PORT : the master port for running Spark jobs.  Often
		    used for launching Spark jobs
		    (e.g. spark://${SPARK_MASTER_NODE}:${SPARK_MASTER_PORT})

                    Not exported if using Yarn.

SPARK_WORKER_COUNT : number of compute/data nodes in your allocation
                     for Spark.  May be useful for adjusting run time
                     options such as parallelism count.

SPARK_WORKER_CORE_COUNT : Total cores on worker nodes in the allocation.
       		          May be useful for adjusting run time options
       		          such as parallelism count.

SPARK_CONF_DIR : the directory that Spark configuration files local
                 to the node are stored.

SPARK_LOG_DIR : the directory Spark log files are stored

See "Hadoop Exported Environment Variables" in README.hadoop, for
Hadoop environment variables that may be useful.

Example Job Output for Spark running SparkPi
--------------------------------------------

The following is an example job output of Magpie running Spark and
running SparkPi.  This is run over HDFS over Lustre.  Sections of
extraneous text have been left out.

While this output is specific to using Magpie with Spark, the output
when using Hadoop, Storm, Hbase, etc. is not all that different.

1) First we get some details of the job

   *******************************************************
   * Magpie General Job Info
   *
   * Job Nodelist: apex[203-211]
   * Job Nodecount: 9
   * Job Timelimit in Minutes: 90
   * Job Name: test
   * Job ID: 1174575
   *
   *******************************************************

2) Next, Spark begins to launch and startup daemons on all cluster nodes.

   Starting spark
   starting org.apache.spark.deploy.master.Master, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.master.Master-1-apex217.out
   apex224: starting org.apache.spark.deploy.worker.Worker, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.worker.Worker-1-apex224.out
   apex222: starting org.apache.spark.deploy.worker.Worker, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.worker.Worker-1-apex222.out
   apex218: starting org.apache.spark.deploy.worker.Worker, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.worker.Worker-1-apex218.out
   apex225: starting org.apache.spark.deploy.worker.Worker, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.worker.Worker-1-apex225.out
   apex219: starting org.apache.spark.deploy.worker.Worker, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.worker.Worker-1-apex219.out
   apex223: starting org.apache.spark.deploy.worker.Worker, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.worker.Worker-1-apex223.out
   apex220: starting org.apache.spark.deploy.worker.Worker, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.worker.Worker-1-apex220.out
   apex221: starting org.apache.spark.deploy.worker.Worker, logging to /tmp/achu/spark/test/1174784/log/spark-achu-org.apache.spark.deploy.worker.Worker-1-apex221.out
      Waiting 30 seconds to allow Spark daemons to setup

3) Next, we see output with details of the Spark setup.  You'll find
   addresses indicating web services you can access to get detailed
   job information.  You'll also find information about how to login
   to access Spark directly and how to shut down the job early if you
   so desire.

   *******************************************************
   *
   * Spark Information
   *
   * You can view your Spark status by launching a web browser and pointing to ...
   *
   * Spark Master: http://apex217:8080
   * Spark Worker: http://<WORKERNODE>:8081
   * Spark Application Dashboard: http://apex217:4040
   *
   * The Spark Master for running jobs is
   *
   * spark://apex217:7077
   *
   * To access Spark directly, you'll want to:
   *
   *   ssh apex217
   *   setenv JAVA_HOME "/usr/lib/jvm/jre-1.7.0-oracle.x86_64/"
   *   setenv SPARK_HOME "/home/achu/hadoop/spark-1.6.2-bin-hadoop2.6"
   *   setenv SPARK_CONF_DIR "/tmp/achu/spark/test/1174784/conf"
   *
   * Then you can do as you please.  For example to run a job:
   *
   *   $SPARK_HOME/bin/spark-submit --class <class> <jar>
   *
   * To end/cleanup your session & kill all daemons, login and set
   * environment variables per the instructions above, then run:
   *
   *   $SPARK_HOME/sbin/stop-all.sh
   *
   *******************************************************

4) Then the SparkPi job is run

   Running bin/run-example org.apache.spark.examples.SparkPi 8
   16/07/18 23:28:32 INFO SparkContext: Running Spark version 1.6.2
   16/07/18 23:28:32 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
   16/07/18 23:28:32 INFO SecurityManager: Changing view acls to: achu
   16/07/18 23:28:32 INFO SecurityManager: Changing modify acls to: achu
   16/07/18 23:28:32 INFO SecurityManager: SecurityManager: authentication enabled; ui acls disabled; users with view permissions: Set(achu); users with modify permissions: Set(achu)
   16/07/18 23:28:36 WARN ThreadLocalRandom: Failed to generate a seed from SecureRandom within 3 seconds. Not enough entrophy?
   16/07/18 23:28:36 INFO Utils: Successfully started service 'sparkDriver' on port 37197.
   16/07/18 23:28:36 INFO Slf4jLogger: Slf4jLogger started
   16/07/18 23:28:36 INFO Remoting: Starting remoting
   16/07/18 23:28:36 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriverActorSystem@192.168.123.217:56249]
   16/07/18 23:28:36 INFO Utils: Successfully started service 'sparkDriverActorSystem' on port 56249.
   16/07/18 23:28:36 INFO SparkEnv: Registering MapOutputTracker
   16/07/18 23:28:36 INFO SparkEnv: Registering BlockManagerMaster
   16/07/18 23:28:36 INFO DiskBlockManager: Created local directory at /p/lscratchg/achu/testing/sparkscratch/node-0/blockmgr-370e9495-d9eb-41f8-ba84-ca5e643f7375
   16/07/18 23:28:36 INFO MemoryStore: MemoryStore started with capacity 35.7 GB
   16/07/18 23:28:39 INFO SparkEnv: Registering OutputCommitCoordinator
   16/07/18 23:28:39 INFO Utils: Successfully started service 'SparkUI' on port 4040.
   16/07/18 23:28:39 INFO SparkUI: Started SparkUI at http://192.168.123.217:4040
   16/07/18 23:28:39 INFO HttpFileServer: HTTP File server directory is /p/lscratchg/achu/testing/sparkscratch/node-0/spark-1456877b-4bf0-4c44-869f-b0a6f96c5546/httpd-9f583943-6701-40ca-8556-5537552a5b33
   16/07/18 23:28:39 INFO HttpServer: Starting HTTP Server
   16/07/18 23:28:39 INFO Utils: Successfully started service 'HTTP file server' on port 42304.
   16/07/18 23:28:41 INFO SparkContext: Added JAR file:/home/achu/hadoop/spark-1.6.2-bin-hadoop2.6/lib/spark-examples-1.6.2-hadoop2.6.0.jar at http://192.168.123.217:42304/jars/spark-examples-1.6.2-hadoop2.6.0.jar with timestamp 1468909721269
   16/07/18 23:28:41 INFO AppClient$ClientEndpoint: Connecting to master spark://apex217:7077...
   16/07/18 23:28:41 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20160718232841-0000
   16/07/18 23:28:41 INFO Utils: Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 41539.
   16/07/18 23:28:41 INFO NettyBlockTransferService: Server created on 41539
   16/07/18 23:28:41 INFO BlockManagerMaster: Trying to register BlockManager
   16/07/18 23:28:41 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.123.217:41539 with 35.7 GB RAM, BlockManagerId(driver, 192.168.123.217, 41539)
   16/07/18 23:28:41 INFO BlockManagerMaster: Registered BlockManager
   16/07/18 23:28:41 INFO AppClient$ClientEndpoint: Executor added: app-20160718232841-0000/0 on worker-20160718232805-192.168.123.223-34505 (192.168.123.223:34505) with 16 cores
   16/07/18 23:28:41 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160718232841-0000/0 on hostPort 192.168.123.223:34505 with 16 cores, 50.0 GB RAM
   16/07/18 23:28:41 INFO AppClient$ClientEndpoint: Executor added: app-20160718232841-0000/1 on worker-20160718232805-192.168.123.224-49414 (192.168.123.224:49414) with 16 cores
   16/07/18 23:28:41 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160718232841-0000/1 on hostPort 192.168.123.224:49414 with 16 cores, 50.0 GB RAM
   16/07/18 23:28:41 INFO AppClient$ClientEndpoint: Executor added: app-20160718232841-0000/2 on worker-20160718232805-192.168.123.220-35442 (192.168.123.220:35442) with 16 cores
   16/07/18 23:28:41 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160718232841-0000/2 on hostPort 192.168.123.220:35442 with 16 cores, 50.0 GB RAM
   16/07/18 23:28:41 INFO AppClient$ClientEndpoint: Executor added: app-20160718232841-0000/3 on worker-20160718232805-192.168.123.222-50299 (192.168.123.222:50299) with 16 cores
   16/07/18 23:28:41 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160718232841-0000/3 on hostPort 192.168.123.222:50299 with 16 cores, 50.0 GB RAM
   16/07/18 23:28:41 INFO AppClient$ClientEndpoint: Executor added: app-20160718232841-0000/4 on worker-20160718232806-192.168.123.225-55852 (192.168.123.225:55852) with 16 cores
   16/07/18 23:28:41 INFO SparkDeploySchedulerBackend: Granted executor ID app-20160718232841-0000/4 on hostPort 192.168.123.225:55852 with 16 cores, 50.0 GB RAM
   16/07/18 23:28:41l.gov, partition 0,PROCESS_LOCAL, 2158 bytes)
   16/07/18 23:28:53 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, apex219.llnl.gov, partition 1,PROCESS_LOCAL, 2158 bytes)
   16/07/18 23:28:53 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, apex219.llnl.gov, partition 2,PROCESS_LOCAL, 2158 bytes)
   16/07/18 23:28:53 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, apex219.llnl.gov, partition 3,PROCESS_LOCAL, 2158 bytes)
   16/07/18 23:28:53 INFO TaskSetManager: Starting task 4.0 tage 0.0 (TID 1) in 5259 ms on apex219.llnl.gov (7/8)
   16/07/18 23:28:59 INFO TaskSetManager: Finished task 4.0 in stage 0.0 (TID 4) in 5259 ms on apex219.llnl.gov (8/8)
   16/07/18 23:28:59 INFO DAGScheduler: ResultStage 0 (reduce at SparkPi.scala:36) finished in 16.885 s
   16/07/18 23:28:59 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
   16/07/18 23:28:59 INFO DAGScheduler: Job 0 finished: reduce at SparkPi.scala:36, took 17.096651 s
   Pi is roughly 3.1385
   16/07/18 23:28:59 INFO SparkUI: Stopped Spark web UI at http://192.168.123.217:4040
   16/07/18 23:28:59 INFO SparkDeploySchedulerBackend: Shutting down all executors
   16/07/18 23:28:59 INFO SparkDeploySchedulerBackend: Asking each executor to shut down
   16/07/18 23:28:59 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
   16/07/18 23:28:59 INFO MemoryStore: MemoryStore cleared
   16/07/18 23:28:59 INFO BlockManager: BlockManager stopped
   16/07/18 23:28:59 INFO BlockManagerMaster: BlockManagerMaster stopped
   16/07/18 23:28:59 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
   16/07/18 23:28:59 WARN Dispatcher: Message RemoteProcessDisconnected(apex217:7077) dropped.
   16/07/18 23:28:59 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon.
   16/07/18 23:28:59 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports.
   16/07/18 23:28:59 INFO SparkContext: Successfully stopped SparkContext
   16/07/18 23:28:59 INFO ShutdownHookManager: Shutdown hook called
   16/07/18 23:28:59 INFO ShutdownHookManager: Deleting directory /p/lscratchg/achu/testing/sparkscratch/node-0/spark-1456877b-4bf0-4c44-869f-b0a6f96c5546/httpd-9f583943-6701-40ca-8556-5537552a5b33
   16/07/18 23:28:59 INFO ShutdownHookManager: Deleting directory /p/lscratchg/achu/testing/sparkscratch/node-0/spark-1456877b-4bf0-4c44-869f-b0a6f96c5546

The Pi approximation is 3.1385.

5) With the job complete, Magpie now tears down the session and cleans
   up all daemons.

   Stopping spark
   apex222: stopping org.apache.spark.deploy.worker.Worker
   apex219: stopping org.apache.spark.deploy.worker.Worker
   apex218: stopping org.apache.spark.deploy.worker.Worker
   apex221: stopping org.apache.spark.deploy.worker.Worker
   apex223: stopping org.apache.spark.deploy.worker.Worker
   apex225: stopping org.apache.spark.deploy.worker.Worker
   apex220: stopping org.apache.spark.deploy.worker.Worker
   apex224: stopping org.apache.spark.deploy.worker.Worker
   stopping org.apache.spark.deploy.master.Master

Spark Patching
--------------
- Patch to support alternate config file directories is required.

- Patch to support non-ssh remote execution may be needed in some
  environments.  Patch can be applied directly to startup scripts, not
  needing a recompilation of source.

  The alternate remote execution command must be specified in the
  environment variable MAGPIE_REMOTE_CMD.

- For versions >= 1.1.0, a patch can be applied directly to startup
  scripts, not needing a recompilation of source.

  You can find patches for the "alternates" listed above in the
  patches/spark/ directory with the suffix "alternate.patch".

  This patch should be applied first.

  Patches for these versions can be found in the patches/spark/ directory.

- If MAGPIE_NO_LOCAL_DIR support is desired, patches in patches/spark/
  with the "no-local-dir.patch" suffix in the filename can be found
  for support.  See README.no-local-dir for more details.

  This patch should be applied second, after the "alternate" patch.

Spark Troubleshooting
---------------------

1) What does the error "Initial job has not accepted any resources;
   check your cluster UI to ensure that workers are registered and have
   sufficient resources" mean?

   This means that you are trying to submit a Spark job that desires
   more resources than you have allocated.  For example, you may be
   requesting more memory or CPUs than Spark can schedule.

   Incorrect settings may occur in several ways, such as the
   --executor-memory or --executor-cores options in spark-submit or
   the SPARK_JOB_MEMORY environment variable in the job submission
   script.

Spark Advanced Usage
---------------------

1) If your cluster has a local SSD/NVRAM on each node, set a path to
   it via the SPARK_LOCAL_SCRATCH_DIR environment variable in your
   submission scripts.

   Setting this SSD/NVRAM serves two purposes.  One, this local
   scratch directory can be used for spillover from map outputs to aid
   in shuffles.  This local scratch directory can greatly improve
   shuffle performance.

   Second, it can be used for quickly storing/caching RDDs to disk
   using MEMORY_AND_DISK and/or DISK_ONLY persistence levels.

2) Magpie configures the default parallelism in a Spark job depending
   on the Spark version.  In versions < 1.3.0, Magpie defaults the
   parallelism to the number of compute nodes in your allocation.
   This is significantly superior to the original Spark default of 8
   (pre-1.0).  However, it may not be optimal for many jobs.

   For versions >= 1.3.0, Magpie defaults to not setting this value.
   Spark will instead use a default parallelism equal to the number of
   cores in your allocation.  In some cases, this number will be too
   high and will cause too much overhead for your job.

   Users should play around with the parallelism in their job to
   improve performance.  A number of Spark collectives can take a
   partition number as an argument (such as reduceByKey).  The default
   can be tweaked in the submission scripts via the
   SPARK_DEFAULT_PARALLELISM environment variable.

3) Magpie configures a relatively conservative amount of memory for
   Spark, currently 80% of system memory.  While there should always
   be a buffer to allow the operating system, system daemons, and
   Spark (and potentially Hadoop HDFS) daemons to operate, the 80%
   value may be on the conservative side and users wishing to push it
   higher to 90% or 95% of system memory may see benefits..

   Users can adjust the amount of memory used by each Spark Worker
   through the SPARK_WORKER_MEMORY_PER_NODE environment variable in
   the submission scripts.

4) There are two major memory fraction configuration variables
   SPARK_STORAGE_MEMORY_FRACTION and SPARK_SHUFFLE_MEMORY_FRACTION
   which may have major effects on performance depending on your job.

   SPARK_STORAGE_MEMORY_FRACTION controls the percentage of memory
   used for the memory cache while SPARK_SHUFFLE_MEMORY_FRACTION
   controls the percentage used for shuffles.

   You may wish to adjust these for your specific job, as they can
   have a large influence on job performance.  Please see submission
   scripts for more information.

   Note that beginning Spark 1.6.0 memory fractions have been
   deprecated.

Known Issues
------------

From atleast Spark 2.0.0 to Spark 2.1.1, using Yarn with RawnetworkFS
worked within Magpie and its testing.  This test broke in Spark 2.2.0.

Upstream issue: https://issues.apache.org/jira/browse/SPARK-21570

Suggestion is Spark w/ Yarn and non-HDFS is not supported.

Using the Spark standalone scheduler, Spark w/ RawnetworkFS still
works.

Notes
-----

Beginning in Spark 3.1.1, Python 2.7, 3.4, and 3.5 support was removed.