magpie/doc at master · LLNL/magpie

History

Name		Name	Last commit message	Last commit date
parent directory ..
logo		logo
README		README
README.alluxio.md		README.alluxio.md
README.faq		README.faq
README.hadoop		README.hadoop
README.hbase		README.hbase
README.hive		README.hive
README.hostname		README.hostname
README.kafka		README.kafka
README.no-local-dir		README.no-local-dir
README.phoenix		README.phoenix
README.pig		README.pig
README.ray		README.ray
README.spark		README.spark
README.storm		README.storm
README.tensorflow		README.tensorflow
README.tensorflow-horovod.md		README.tensorflow-horovod.md
README.testsuite		README.testsuite
README.zeppelin		README.zeppelin
README.zookeeper		README.zookeeper

README

Magpie
------

Magpie contains a number of scripts for running Big Data software in
HPC environments. Thus far, Hadoop, Spark, Hbase, Hive, Storm, Pig,
Phoenix, Kafka, Zeppelin, and Zookeeper are supported. It
currently supports running over the parallel file system Lustre and
running over any generic network filesytem. There is
scheduler/resource manager support for Slurm, Moab, Torque, and LSF.

Some of the features presently supported:

- Run jobs interactively or via scripts.
- Run Mapreduce 1.0 or 2.0 jobs via Hadoop 1.0 or 2.0
- Run against a number of filesystem options, such as HDFS, HDFS over
Lustre, HDFS over a generic network filesystem, Lustre directly, or
a generic network filesystem.
- Take advantage of SSDs/NVRAM for local caching if available
- Make decent optimizations for your hardware

Experimental support for several distributed machine learning
frameworks has also been added. Presently tensorflow and tensorflow
w/ horovod is supported.

Basic Idea
----------

The basic idea behind these scripts are to:

1) Submit a Magpie batch script to allocate nodes on a cluster using
your HPC scheduler/resource manager. Slurm, Slurm+mpirun,
Moab+Slurm, Moab+Torque and LSF+mpirun are currently supported.

2) The batch script will create configuration files for all
appropriate projects (Hadoop, Spark, etc.) The configuration files
will be setup so the rank 0 node is the "master". All compute
nodes will have configuration files created that point to the node
designated as the master server.

The configuration files will be populated with values for your
filesystem choice and the hardware that exists in your cluster.
Reasonable attempts are made to determine optimal values for your
system and hardware (they are almost certainly better than the
default values). A number of options exist in the batch scripts to
adjust these values for individual jobs.

3) Launch daemons on all nodes. The rank 0 node will run master
daemons, such as the Hadoop Namenode. All remaining nodes will run
appropriate worker daemons, such as the Hadoop Datanodes.

4) Now you have a mini big data cluster to do whatever you want. You
can log into the master node and interact with your mini big data
cluster however you want. Or you could have Magpie run a script to
execute your big data calculation instead.

5) When your job completes or your allocation time has run out, Magpie
will cleanup your job by tearing down daemons. When appropriate,
Magpie may also do some additional cleanup work to hopefully make
re-execution on later runs cleaner and faster.

Requirements
------------

1) Magpie and all big data projects (Hadoop, Spark, etc.) should be
installed on all cluster nodes. It can be in a known location or
perhaps via a network file system location. Many users may simply
install them into their NFS home directories. These paths will be
later specified in job submission scripts.

Note that not all distributions of big data projects (Hadoop,
Spark, etc.) are supported. Generally speaking, only versions
from Apache have been tested. Your mileage mary vary with other
distributions.

Some projects may need patches applied. You can find patches in
Magpie's 'patches' directory. Most patches are only needed against
scripts within the projects, but on occassion a recompilation of
the source may also be necessary.

If you are unfamiliar with patches, see documentation for the
`patch` command. Although in most cases you can patch your project
via:

cd PROJECT-VERSION
patch -p1 PATH-TO-MAGPIE/patches/PROJECT/PROJECT-VERSION.patch

For example, to apply the alternate-ssh patch to hadoop

cd hadoop-2.9.2
patch -p1 < ../magpie/patches/hadoop/hadoop-2.9.2-alternate-ssh.patch

2) A passwordless remote shell execution mechanism must be available
for scripts to launch big data daemons (e.g. Hadoop Datanodes) on
all appropriate nodes. The most popular (and default mechanism) is
passwordless ssh. However, other mechanisms are more than
suitable.

3) A temporary local scratch space is needed on each node for Magpie
to store configuration files, log files, and other miscellaneous
files. A very small amount of scratch space is needed.

This local scratch space need not be a local disk. It could
hypothetically be memory based tmpfs.

Beginning with Magpie 1.60 the ability to use network file paths
for "local scratch" space was supported, but requires some extra
work. See README.no-local-dir for details.

4) Magpie and the projects it supports generally assume that all
software and the OS environment consistently use short hostnames or
fully qualified qualified domain names. For example, if the
"hostname" command returns a short hostname (e.g. 'foo' and not
'foo.host.com'), then the scheduler/resource manager should output
shortened hostnames in its output environment variables
(e.g. SLURM_JOB_NODELIST w/ Slurm, MOAB_NODELIST w/ Moab, etc.)

There are mechanisms in place to work around this if your
environment does not match in this way. See README.hostname for
details.

5) A minor set of software dependencies are required depending on your
environment.

The Moab+Torque submission scripts use Pdsh
(https://github.com/chaos/pdsh) to launch/run scripts across
cluster nodes.

The LSF submission scripts use mpirun to launch/run scripts across
cluster nodes.

The 'hostlist' command from lua-hostlist
(https://github.com/grondo/lua-hostlist) is preferred for a variety
of hostrange parsing needs in Magpie. If it is not available,
Magpie will use its internal tool 'magpie-expand-nodes', which
should be sufficient for most hostrange parsing, but may not
function for a number of nuanced corner cases.

Several checks for Zookeeper functionality assume netcat and the
'nc' command are available. If it is not available, the checks
cannot be done.

Local Configuration
-------------------

All HPC sites will have local differences and nuances to running jobs.
The job submission scripts in submission-scripts/ have a number of
defaults, such as the default location for network file systems, local
scratch space, etc.

You can adjust these defaults by editing the defaults listed in
submission-scripts/script-templates/Makefile and running 'make'
afterwards.

In addition, if your site requires local special requirements, such as
setting unique paths or loading specific modules before executing a
job, this can also be configured via the LOCAL_REQUIREMENTS
configuration in the same Makefile.

Supported Packages & Versions
-----------------------------

The following packages and their versions have been tested for minimal
support in this version of Magpie.

Versions not listed below should work with Magpie if the
configuration/setup of those versions is compatible with the versions
listed below. However, certain features or options may not work with
those versions.

* + - Requires patch against binary distro's scripts, no re-compilation needed
* ^ - Requires patch against source, requires re-compilation
* ! - Some issues may exist, see project readmes (i.e. README.hadoop) for details

Hadoop - 2.2.0+, 2.3.0+, 2.4.0+, 2.4.1+, 2.5.0+, 2.5.1+, 2.5.2+,
2.6.0+, 2.6.1+, 2.6.2+, 2.6.3+, 2.6.4+, 2.6.5+, 2.7.0+,
2.7.1+, 2.7.2+, 2.7.3+, 2.7.4+, 2.7.5+, 2.7.6+, 2.7.7+,
2.8.0+, 2.8.1+, 2.8.2+, 2.8.3+, 2.8.4+, 2.8.5+, 2.9.0+,
2.9.1+, 2.9.2+, 3.0.0+, 3.0.1+, 3.0.2+, 3.0.3+, 3.1.0+,
3.1.1+, 3.1.2+, 3.1.3+, 3.1.4+, 3.2.0+, 3.2.1+, 3.2.2+,
3.2.3+, 3.2.4+, 3.3.0+, 3.3.1+, 3.3.2+, 3.3.3+, 3.3.4+,
3.3.5+, 3.3.6+

Spark - 1.1.0-bin-hadoop2.3+, 1.1.0-bin-hadoop2.4+,
1.1.1-bin-hadoop2.3+, 1.1.1-bin-hadoop2.4+,
1.2.0-bin-hadoop2.3+, 1.2.0-bin-hadoop2.4+,
1.2.1-bin-hadoop2.3+, 1.2.1-bin-hadoop2.4+,
1.2.2-bin-hadoop2.3+, 1.2.2-bin-hadoop2.4+,
1.3.0-bin-hadoop2.3+, 1.3.0-bin-hadoop2.4+,
1.3.1-bin-hadoop2.3+, 1.3.1-bin-hadoop2.4+,
1.3.1-bin-hadoop2.6+, 1.4.0-bin-hadoop2.3+,
1.4.0-bin-hadoop2.4+, 1.4.0-bin-hadoop2.6+,
1.4.1-bin-hadoop2.3+, 1.4.1-bin-hadoop2.4+,
1.4.1-bin-hadoop2.6+, 1.5.0-bin-hadoop2.6+,
1.5.1-bin-hadoop2.6+, 1.5.2-bin-hadoop2.6+,
1.6.0-bin-hadoop2.6+, 1.6.1-bin-hadoop2.6+,
1.6.2-bin-hadoop2.6+, 1.6.3-bin-hadoop2.6+,
2.0.0-bin-hadoop2.6+, 2.0.0-bin-hadoop2.7+,
2.0.1-bin-hadoop2.6+, 2.0.1-bin-hadoop2.7+,
2.0.2-bin-hadoop2.6+, 2.0.2-bin-hadoop2.7+,
2.1.0-bin-hadoop2.6+, 2.1.0-bin-hadoop2.7+,
2.1.1-bin-hadoop2.6+, 2.1.1-bin-hadoop2.7+,
2.1.2-bin-hadoop2.6+, 2.1.2-bin-hadoop2.7+,
2.2.0-bin-hadoop2.6+!, 2.2.0-bin-hadoop2.7+!,
2.2.1-bin-hadoop2.6+!, 2.2.1-bin-hadoop2.7+!,
2.3.0-bin-hadoop2.6+!, 2.3.0-bin-hadoop2.7+!,
2.3.1-bin-hadoop2.6+!, 2.3.1-bin-hadoop2.7+!,
2.3.2-bin-hadoop2.6+!, 2.3.2-bin-hadoop2.7+!,
2.3.3-bin-hadoop2.6+!, 2.3.3-bin-hadoop2.7+!,
2.3.4-bin-hadoop2.6+!, 2.3.4-bin-hadoop2.7+!,
2.4.0-bin-hadoop2.6+!, 2.4.0-bin-hadoop2.7+!,
2.4.1-bin-hadoop2.6+!, 2.4.1-bin-hadoop2.7+!,
2.4.2-bin-hadoop2.6+!, 2.4.2-bin-hadoop2.7+!,
2.4.3-bin-hadoop2.6+!, 2.4.3-bin-hadoop2.7+!,
2.4.4-bin-hadoop2.6+!, 2.4.4-bin-hadoop2.7+!,
2.4.5-bin-hadoop2.6+!, 2.4.5-bin-hadoop2.7+!,
2.4.6-bin-hadoop2.6+!, 2.4.6-bin-hadoop2.7+!,
2.4.7-bin-hadoop2.6+!, 2.4.7-bin-hadoop2.7+!,
2.4.8-bin-hadoop2.6+!, 2.4.8-bin-hadoop2.7+!,
3.0.0-bin-hadoop2.7+!, 3.0.0-bin-hadoop3.2+!,
3.0.1-bin-hadoop2.7+!, 3.0.1-bin-hadoop3.2+!,
3.0.2-bin-hadoop2.7+!, 3.0.2-bin-hadoop3.2+!,
3.0.3-bin-hadoop2.7+!, 3.0.3-bin-hadoop3.2+!,
3.1.1-bin-hadoop2.7+!, 3.1.1-bin-hadoop3.2+!,
3.1.2-bin-hadoop2.7+!, 3.1.2-bin-hadoop3.2+!,
3.1.3-bin-hadoop2.7+!, 3.1.3-bin-hadoop3.2+!,
3.2.0-bin-hadoop2.7+!, 3.2.0-bin-hadoop3.2+!,
3.2.1-bin-hadoop2.7+!, 3.2.1-bin-hadoop3.2+!,
3.2.2-bin-hadoop2.7+!, 3.2.2-bin-hadoop3.2+!,
3.2.3-bin-hadoop2.7+!, 3.2.3-bin-hadoop3.2+!,
3.2.4-bin-hadoop2.7+!, 3.2.4-bin-hadoop3.2+!,
3.3.0-bin-hadoop2.7+!, 3.3.0-bin-hadoop3.2+!,
3.3.1-bin-hadoop2.7+!, 3.3.1-bin-hadoop3.2+!,
3.3.2-bin-hadoop2.7+!, 3.3.2-bin-hadoop3.2+!,
3.3.3-bin-hadoop3+!

TensorFlow - 1.9, 1.12

Hbase - 1.0.0+, 1.0.1+, 1.0.1.1+, 1.0.2+, 1.0.3+, 1.1.0+, 1.1.0.1+,
1.1.1+, 1.1.2+, 1.1.3+, 1.1.4+, 1.1.5+, 1.1.6+, 1.1.7+,
1.1.8+, 1.1.9+, 1.1.10+, 1.1.11+, 1.1.12+, 1.1.13+, 1.2.0+,
1.2.1+, 1.2.2+, 1.2.3+, 1.2.4+, 1.2.5+, 1.2.6+, 1.2.6.1+,
1.2.7+, 1.3.0+, 1.3.1+, 1.3.2+, 1.3.2.1+, 1.3.3+, 1.3.4+,
1.3.5+, 1.4.0+!, 1.4.1+, 1.4.2+, 1.4.3+, 1.4.4+, 1.4.5+,
1.4.6+, 1.4.7+, 1.4.8+, 1.4.9+, 1.4.10+, 1.4.13+, 1.5.0+,
1.6.0+

Hive - 2.3.0 [HiveNote]

Pig - 0.13.0, 0.14.0, 0.15.0, 0.16.0, 0.17.0

Zookeeper - 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.4.4, 3.4.5, 3.4.6, 3.4.7,
3.4.8, 3.4.9, 3.4.10, 3.4.11, 3.4.12, 3.4.13, 3.4.14

Storm - 0.9.3, 0.9.4, 0.9.5, 0.9.6, 0.9.7, 0.10.0, 0.10.1, 0.10.2,
1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.1.2, 1.2.0,
1.2.1, 1.2.2, 1.2.3

Phoenix - 4.5.0-Hbase-1.0+, 4.5.0-Hbase-1.1+, 4.5.1-Hbase-1.0+,
4.5.1-Hbase-1.1+, 4.5.2-HBase-1.0+, 4.5.2-HBase-1.1+,
4.6.0-Hbase-1.0+, 4.6.0-Hbase-1.1, 4.7.0-Hbase-1.0+,
4.7.0-Hbase-1.1, 4.8.0-Hbase-1.0+, 4.8.0-Hbase-1.1,
4.8.0-Hbase-1.2, 4.8.1-Hbase-1.0+, 4.8.1-Hbase-1.1,
4.8.1-Hbase-1.2, 4.8.2-Hbase-1.0+, 4.8.2-Hbase-1.1,
4.8.2-Hbase-1.2, 4.9.0-Hbase-1.1, 4.9.0-Hbase-1.2,
4.10.0-Hbase-1.1, 4.10.0-Hbase-1.2, 4.11.0-Hbase-1.1,
4.11.0-Hbase-1.2, 4.11.0-Hbase-1.3, 4.12.0-Hbase-1.1,
4.12.0-Hbase-1.2, 4.12.0-Hbase-1.3, 4.13.0-Hbase-1.3,
4.13.1-Hbase-1.1, 4.13.1-Hbase-1.2, 4.13.1-Hbase-1.3,
4.14.0-Hbase-1.1, 4.14.0-Hbase-1.2, 4.14.0-Hbase-1.3,
4.14.0-Hbase-1.4

Kafka - 2.11-0.9.0.0

Zeppelin - 0.6.0, 0.6.1, 0.6.2, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0,
0.8.1, 0.8.2

[HiveNote] - Hive uses PostgreSQL, the minimum version required is 9.1.13
PostgreSQL can be found at: https://www.postgresql.org/download/

Package Version Combinations
----------------------------

Many packages function together, for example Pig requires Hadooop,
Spark may use Hadoop to access HDFS, Hbase and Storm require
Zookeeper, and Phoenix requires Hbase. While the range of project
versions that work together is very large, we've found the following
to be a good starting point to use in running jobs.

Pig 0.13.X, 0.14.X w/ Hadoop 2.6.X
Pig 0.15.X -> 0.17.X w/ Hadoop 2.7.X

Hbase 1.0.X -> 1.6.X w/ Hadoop 2.7.X, Zookeeper 3.4.X

Phoenix 4.4.X -> 4.13.X - Beginning w/ Phoenix 4.4.0, versions
prebuilt for Hbase 1.0, 1.1, etc. are available. Use the version it
is prebuilt for appropriately.

Spark 1.X - Beginning w/ Spark 1.1 versions prebuilt for Hadoop 2.3,
2.4, 2.6, etc. are available. Use the version it is prebuilt for
appropriately. See above for supported versions.
Spark 2.X - Builds against Hadoop 2.3, 2.4, 2.6, and 2.7. Use the
version it is prebuilt for appropriately. See above for supported
versions.
Spark 3.X - Builds against Hadoop 2.7 and 3.2. Use the version it is
prebuilt for appropriately. See above for supported versions.

Storm 0.9.X, 0.10.0, 1.X.0 w/ Zookeeper 3.4.X

Kafka 2.11-0.9.0.0 w/ Zookeeper 3.4.X

Zeppelin 0.6.0 w/ Spark 1.6.X

Package Java Versions
---------------------

Some package versions from Apache require minimum Java versions.
Although the minimums may be lower than those listed here, these are
our recommendations based on testing & experience.

Hadoop 2.0 -> 2.5 - Java 1.6
Hadoop 2.6 -> 2.7.3 - Java 1.7
Hadoop 2.7.4 -> 2.7.X - Java 1.8
Hadoop 2.8.0 -> ... - Java 1.7
Hadoop 3.0.0 -> ... - Java 1.8

Hbase 1.0 -> ... - Java 1.7
Hbase 1.5 -> ... - Java 1.8

Spark 1.1 -> 1.3 - Java 1.6
Spark 1.4 -> 1.6 - Java 1.7
Spark 2.0 -> 2.1 - Java 1.7
Spark 2.2 - ... - Java 1.8

Storm 0.9.3 -> 0.9.4 - Java 1.6
Storm 0.9.5 -> ... - Java 1.7

Zeppelin 0.6 -> 0.7 - Java 1.7
Zeppelin 0.8 -> ... - Java 1.8

Package Attention
-----------------

Not all software packages and features have been given the same level
of attention in Magpie, so we feel it is important to inform you of
the level of trust you can have in Magpie support for individual
projects and/or features.

Core packages/features are considered amongst the core supported
packages in Magpie. Magpie developers are confident in their
functionality under a wide range of use cases and scenarios.

Well supported packages/features are not given quite the same
attention as core ones. Magpie developers have confidence they
will work with common use case scenarios, but non-common scenarios
may have not been tested or tried.

Experimental supported packages/features are not maintained with deep
attention. They may have been developed against a specific project
version or with specific use scenario. Their support should be
considered on the side of experimental.

- Core

Packages: Hadoop, Spark, Hbase, Pig, Zookeeper

- Well

Packages: Storm, Phoenix

Features: No-local-dir

- Experimental

Packages: Kafka, Zeppelin, Hive, TensorFlow w/ & w/o
Horovod, Ray

Documentation
-------------

General information about all of Magpie can be found below. For
information on individual projects, please see the following README
files.

Hadoop - See README.hadoop
Pig - See README.pig
Hbase - See README.hbase
Hive - See README.hive
Spark - See README.spark
TensorFlow - See README.tensorflow
TensorFlow Horovod - See README.tensorflow-horovod
Ray - See README.ray
Storm - See README.storm
Phoenix - See README.phoenix
Kafka - See README.kafka
Zeppelin - See README.zeppelin
Zookeeper - See README.zookeeper

Documentation on some optional features:

- Support HPC systems without (or very small) /tmp filesystems - See README.no-local-dir

Some miscellaneous documentation:

- Testsuite information - See README.testsuite
- FAQ of random common questions - See README.faq

Exported Environment Variables
------------------------------

The following environment variables are exported when your job is run
and may be useful in scripts in your run or in pre/post run scripts.

Note that they may not be automatically exported if you remote login
into your master node. See MAGPIE_ENVIRONMENT_VARIABLE_SCRIPT for a
convenient mechanism to export commonly used environment variables
during a remote login session.

Project specific environment variable exports are also available, see
those sections for more information.

MAGPIE_CLUSTER_NODERANK : the rank of the node you are on. It's often
convenient to do something like

if [ $MAGPIE_CLUSTER_NODERANK == 0 ]
then
....
fi

To only do something on one node of your allocation.

MAGPIE_NODE_COUNT : Number of nodes in this allocation.

MAGPIE_NODELIST : Nodes in your allocation.

MAGPIE_JOB_NAME : Job name

MAGPIE_JOB_ID : Job ID

MAGPIE_TIMELIMIT_MINUTES : Timelimit of job in minutes

Convenience Scripts
-------------------

A number of convenience scripts are included in the scripts/
directory, both for possible usefulness and as examples. They are
organized within the directory as follows:

job-scripts - These are scripts that you would run as a possible job
in Magpie. You would set these scripts in the MAGPIE_JOB_SCRIPT
environment variable.

pre-job-run-scripts - These are scripts that you would run before the
actual calculation is executed. You would set these scripts in the
MAGPIE_PRE_JOB_RUN environment variable.

post-job-run-scripts - These are scripts that you would run after the
actual calculation is executed. You would set these scripts in the
MAGPIE_POST_JOB_RUN environment variable.

Notable scripts worth mentioning:

pre-job-run-scripts/magpie-output-config-files-script.sh - This script
will output all of the conf files from your job. It's convenient for
debugging.

post-job-run-scripts/magpie-gather-config-files-and-logs-script.sh -
This script will get all of the conf files and log files from Hadoop,
Hbase, Pig, Spark, Storm, and/or Zookeeper and store it in a location
for post-analysis of your job. It's convenient for debugging. By
default files are stored in ${HOME}/${MAGPIE_JOB_NAME}, but the base
directory can be altered with the first argument passed into the
script.

In addition, the misc/magpie-download-and-setup.sh script may be
convenient for initially downloading and patching Apache projects for
you so you don't have to manually download them. It'll also configure
several paths for you in the launch scripts automatically.

General Advanced Usage
----------------------

The following are additional tips for advanced usage of Magpie.

1) The Magpie environment variables of MAGPIE_PRE_JOB_RUN and
MAGPIE_POST_JOB_RUN can be used to run scripts before and after
your primary job script executes.

The MAGPIE_POST_JOB_RUN is particularly useful, as it can gather
logs and/or other debugging data for you. The convenience script
post-job-run-scripts/magpie-gather-config-files-and-logs-script.sh
gathers most configuration and log data and stores it to your home
directory.

2) The Magpie environment variable MAGPIE_ENVIRONMENT_VARIABLE_SCRIPT
is useful for creating a file of popular and useful environment
variables. The file it creates can be used within scripts you
write, or it can sourced into your environment when you try to
interact with your job.

3) All configuration files in conf/ can be modified to be tuned for
individual applications. For the brave and adventurous, various
configurations such as JVM options and other tunables can be
adjusted. If you wish to experiment with different sets of
configuration files, consider making different directories with
different conf files in them. Then a quick change to project
CONF_FILE settings (e.g. HADOOP_CONF_FILES, SPARK_CONF_FILES,
HBASE_CONF_FILES, etc.) can quickly allow different files to be
experimented with.

4) It is possible to run multiple instances of Hadoop, Hbase,
etc. simultaneously on a cluster. However, it is important to
isolate each of those instances. In particular, if using default
configurations, multiple instances may attempt to read/write
identical locations on network filesystems, leading to problems
between jobs. For example, if you configure HDFS to operate out of
/lustre/hdfsoverlustre/ on multiple jobs, only one namenode will be
able to operate correctly at a time.

In order to solve this problem, all you need to do is create
different directories for each service operating out of a network
file system. For example, /lustre/hdfsoverlustre1 and
/lustre/hdfsoverlustre2 for two different jobs using HDFS.

If you are not concerned about the specific path you are using,
perhaps because you never intend to reuse those paths, consider
using MAGPIE_ONE_TIME_RUN. This setting may be particularly useful
if you initially running tests/experiments on different CPU counts,
node counts, settings, etc. and want to run many jobs in parallel.
Be careful to cleanup these directories from time to time, as
Magpie will not clear data from prior jobs.

Security
--------

Users should be aware that running Magpie w/ the big data software
supported here may be insecure in your environment. While Magpie
makes attempts to configure software with good "sanity"
configurations, they are not foolproof. In addition, some software
may not yet have security infrastructure built in.

If you are not running in an environment where your cluster allocation
is isolated (through a private virtualized network or something
similar) other users on the cluster may be able to communicate with a
number of the big data services setup by Magpie.

These issues are due to a variety of factors, including:

1) In "traditional" big data clusters, system administrators control
what users are allowed on the cluster and who is not, limiting the
exposure of data stored there. In the Magpie model, a "big data
cluster" is instantiated within a larger multi-user HPC cluster. The
Magpie user cannot control what other users have access to the HPC
cluster. This population of HPC users could access to the data of the
Magpie user without the Magpie user's knowledge.

2) In "traditional" big data clusters, important daemons are
owned/executed by a special user (e.g. hdfs, yarn, etc.). This may
limit the type of the exposure a nefarious/rogue process can have on
the system. When running in an HPC environment with Magpie, the
processes are run under the user's ownership. Since users are
typically not root, they have no way to change the ownership of the
process to a "special" user.

3) Some big data software have Kerberos or similar security functions
built into it. However, it is beyond the scope of most HPC users to
get proper kerberos configuration of Hadoop, HDFS, etc. from their
site staff before running their job.

4) Some big data software just doesn't really have any security built
in at all.

A few examples of security issues are listed below:

Hadoop HDFS - The Hadoop Namenode is generally available on an open
and public port. While HDFS has been configured with a good
default umask and ACLs, other users on the system can override this by
setting HADOOP_USER_NAME environment variable.

Hadoop YARN - Similar to Hadoop HDFS, good default configurations have
been setup. However they can be overridden with the HADOOP_USER_NAME
environment variable. This allows users to potentially run jobs as
another user on the cluster. This in turn can open up all of a user's
data to others within the system.

Spark - Spark shared secret keys have been configured for sanity
configuration. However, since the shared secret may be easy to
determine, it will allow user to run jobs as another user on the
cluster. This in turn can open up all of a user's data to others
within the system.

Web UIs - Generally speaking, most web UIs will be viewable by other
users on the cluster if firewall rules (or similar) are not setup on
your cluster by default.

Contributions
-------------

Feel free to send me patches for new environment variables, new
adjustments, new optimization possibilities, alternate defaults that
you feel are better, etc.

Any patches you submit to me for fixes will be appreciated. I am by
no means a bash expert ... in fact I'm quite bad at it.

Other Projects
--------------

We welcome additions of other projects into Magpie. Here's a somewhat
general guide to including other projects in Magpie. This is very
high level. Please see internal implementation for details.
Hopefully it's somewhat obvious.

1) Add appropriate "templates" into
submission-scripts/script-templates/ for the new project so the
project can be setup. You can copy templates from other projects
to begin. Although there can be variations depending on the
project's purpose, you'll most likely want to add:

magpie-XXX
magpie-magpie-customizations-job-XXX
magpie-magpie-customizations-testall-XXX

files for project XXX.

Then after that, update
submission-scripts/script-templates/Makefile to add your project
into the primary job submission files. Generate additional
submission scripts for the new project if you desire to. After
this you can run make and ensure your new project has been added
correctly into the job submission scripts.

2) Add appropriate input checks to 'magpie-check-inputs'

3) Add an appropriate "setup" file to magpie/setup/

4) Add an appropriate "run" file to magpie/run/

5) Update 'magpie-setup-projects' and 'magpie-run' appropriately for
new calls.

6) If necessary create new directories and setup master/worker
files in 'magpie-setup-core'

7) If necessary, following libraries could warrant updates:

magpie/lib/magpie-lib-node-identification - to identify
master/worker nodes

magpie/lib/magpie-lib-paths - set various path defaults

magpie/lib/magpie-lib-defaults - set various defaults

8) Add any necessary patches to patches/

9) Add new tests into Magpie's testsuite

- Add new test-generate-XXX.sh file to generate new tests.

- Update test-generate.sh appropriately for new test generation.

- Add new test test-submit-XXX file to submit new tests.

- Update test-validate.sh for validate jobs suceeded.

- Update test-download-projects.sh to download & patch projects if
necessary.

10) (Optional) Add download options for the project in
misc/magpie-download-and-setup.sh

Other Schedulers/Resource Managers
----------------------------------

While Slurm, Moab+Slurm, Moab+Torque, and LSF+mpirun are the currently
supported schedulers/resource managers, there's no reason to believe
that other schedulers/resource managers couldn't be supported. I'd
gladly welcome patches to support them.

To support another scheduler or resource manager, you'll want to make
your equivalent scheduler/resource manager header, similar to
submission-scripts/script-templates/magpie-config-sbatch-srun. You
may also need to create a new job running variant, such as
submission-scripts/script-templates/magpie-run-job-srun. Then add an
appropriate new section to
submission-scripts/script-templates/Makefile and a new directory for
these new submission scripts in submission-scripts.

If a new MAGPIE_SUBMISSION_TYPE is needed, you'll want to update
magpie/exports/magpie-exports-submission-type and add appropriate
input checks in magpie-check-inputs.

I'd be glad to accept patches back for other schedulers/resource
managers. Please send me a pull request.

Author
------

This is me. Feel free to contact me about Magpie, however please
consider posting support questions to Github's issue tracker so
everyone can see the questions & solutions to your problem.

Albert Chu
[email protected]

Credit
------

Credit must be given to Kevin Regimbal @ PNNL. Initial experiments
were done using heavily modified versions of scripts Kevin developed
for running Hadoop w/ Slurm & Lustre. A number of the ideas from
Kevin's scripts continue in spirit in these scripts.

Special thanks to David Buttler who came up with the clever name for
this project.

Thanks
------

Thanks to the following for contributions

Felix-Antoine Fortin ([email protected]) - Msub-Torque-Pdsh support & other misc patches
Brian Panneton ([email protected]) - LSF support, Phoenix, Kafka and Zeppelin support, & Number of misc patches
Adam Childs ([email protected]) - Hive/Tez support

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

doc

doc

README

Files

doc

Directory actions

More options

Directory actions

More options

Latest commit

History

doc

Folders and files

parent directory

README