doc
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
Magpie ------ Magpie contains a number of scripts for running Big Data software in HPC environments. Thus far, Hadoop, Spark, Hbase, Hive, Storm, Pig, Phoenix, Kafka, Zeppelin, and Zookeeper are supported. It currently supports running over the parallel file system Lustre and running over any generic network filesytem. There is scheduler/resource manager support for Slurm, Moab, Torque, and LSF. Some of the features presently supported: - Run jobs interactively or via scripts. - Run Mapreduce 1.0 or 2.0 jobs via Hadoop 1.0 or 2.0 - Run against a number of filesystem options, such as HDFS, HDFS over Lustre, HDFS over a generic network filesystem, Lustre directly, or a generic network filesystem. - Take advantage of SSDs/NVRAM for local caching if available - Make decent optimizations for your hardware Experimental support for several distributed machine learning frameworks has also been added. Presently tensorflow and tensorflow w/ horovod is supported. Basic Idea ---------- The basic idea behind these scripts are to: 1) Submit a Magpie batch script to allocate nodes on a cluster using your HPC scheduler/resource manager. Slurm, Slurm+mpirun, Moab+Slurm, Moab+Torque and LSF+mpirun are currently supported. 2) The batch script will create configuration files for all appropriate projects (Hadoop, Spark, etc.) The configuration files will be setup so the rank 0 node is the "master". All compute nodes will have configuration files created that point to the node designated as the master server. The configuration files will be populated with values for your filesystem choice and the hardware that exists in your cluster. Reasonable attempts are made to determine optimal values for your system and hardware (they are almost certainly better than the default values). A number of options exist in the batch scripts to adjust these values for individual jobs. 3) Launch daemons on all nodes. The rank 0 node will run master daemons, such as the Hadoop Namenode. All remaining nodes will run appropriate worker daemons, such as the Hadoop Datanodes. 4) Now you have a mini big data cluster to do whatever you want. You can log into the master node and interact with your mini big data cluster however you want. Or you could have Magpie run a script to execute your big data calculation instead. 5) When your job completes or your allocation time has run out, Magpie will cleanup your job by tearing down daemons. When appropriate, Magpie may also do some additional cleanup work to hopefully make re-execution on later runs cleaner and faster. Requirements ------------ 1) Magpie and all big data projects (Hadoop, Spark, etc.) should be installed on all cluster nodes. It can be in a known location or perhaps via a network file system location. Many users may simply install them into their NFS home directories. These paths will be later specified in job submission scripts. Note that not all distributions of big data projects (Hadoop, Spark, etc.) are supported. Generally speaking, only versions from Apache have been tested. Your mileage mary vary with other distributions. Some projects may need patches applied. You can find patches in Magpie's 'patches' directory. Most patches are only needed against scripts within the projects, but on occassion a recompilation of the source may also be necessary. If you are unfamiliar with patches, see documentation for the `patch` command. Although in most cases you can patch your project via: cd PROJECT-VERSION patch -p1 PATH-TO-MAGPIE/patches/PROJECT/PROJECT-VERSION.patch For example, to apply the alternate-ssh patch to hadoop cd hadoop-2.9.2 patch -p1 < ../magpie/patches/hadoop/hadoop-2.9.2-alternate-ssh.patch 2) A passwordless remote shell execution mechanism must be available for scripts to launch big data daemons (e.g. Hadoop Datanodes) on all appropriate nodes. The most popular (and default mechanism) is passwordless ssh. However, other mechanisms are more than suitable. 3) A temporary local scratch space is needed on each node for Magpie to store configuration files, log files, and other miscellaneous files. A very small amount of scratch space is needed. This local scratch space need not be a local disk. It could hypothetically be memory based tmpfs. Beginning with Magpie 1.60 the ability to use network file paths for "local scratch" space was supported, but requires some extra work. See README.no-local-dir for details. 4) Magpie and the projects it supports generally assume that all software and the OS environment consistently use short hostnames or fully qualified qualified domain names. For example, if the "hostname" command returns a short hostname (e.g. 'foo' and not 'foo.host.com'), then the scheduler/resource manager should output shortened hostnames in its output environment variables (e.g. SLURM_JOB_NODELIST w/ Slurm, MOAB_NODELIST w/ Moab, etc.) There are mechanisms in place to work around this if your environment does not match in this way. See README.hostname for details. 5) A minor set of software dependencies are required depending on your environment. The Moab+Torque submission scripts use Pdsh (https://github.com/chaos/pdsh) to launch/run scripts across cluster nodes. The LSF submission scripts use mpirun to launch/run scripts across cluster nodes. The 'hostlist' command from lua-hostlist (https://github.com/grondo/lua-hostlist) is preferred for a variety of hostrange parsing needs in Magpie. If it is not available, Magpie will use its internal tool 'magpie-expand-nodes', which should be sufficient for most hostrange parsing, but may not function for a number of nuanced corner cases. Several checks for Zookeeper functionality assume netcat and the 'nc' command are available. If it is not available, the checks cannot be done. Local Configuration ------------------- All HPC sites will have local differences and nuances to running jobs. The job submission scripts in submission-scripts/ have a number of defaults, such as the default location for network file systems, local scratch space, etc. You can adjust these defaults by editing the defaults listed in submission-scripts/script-templates/Makefile and running 'make' afterwards. In addition, if your site requires local special requirements, such as setting unique paths or loading specific modules before executing a job, this can also be configured via the LOCAL_REQUIREMENTS configuration in the same Makefile. Supported Packages & Versions ----------------------------- The following packages and their versions have been tested for minimal support in this version of Magpie. Versions not listed below should work with Magpie if the configuration/setup of those versions is compatible with the versions listed below. However, certain features or options may not work with those versions. * + - Requires patch against binary distro's scripts, no re-compilation needed * ^ - Requires patch against source, requires re-compilation * ! - Some issues may exist, see project readmes (i.e. README.hadoop) for details Hadoop - 2.2.0+, 2.3.0+, 2.4.0+, 2.4.1+, 2.5.0+, 2.5.1+, 2.5.2+, 2.6.0+, 2.6.1+, 2.6.2+, 2.6.3+, 2.6.4+, 2.6.5+, 2.7.0+, 2.7.1+, 2.7.2+, 2.7.3+, 2.7.4+, 2.7.5+, 2.7.6+, 2.7.7+, 2.8.0+, 2.8.1+, 2.8.2+, 2.8.3+, 2.8.4+, 2.8.5+, 2.9.0+, 2.9.1+, 2.9.2+, 3.0.0+, 3.0.1+, 3.0.2+, 3.0.3+, 3.1.0+, 3.1.1+, 3.1.2+, 3.1.3+, 3.1.4+, 3.2.0+, 3.2.1+, 3.2.2+, 3.2.3+, 3.2.4+, 3.3.0+, 3.3.1+, 3.3.2+, 3.3.3+, 3.3.4+, 3.3.5+, 3.3.6+ Spark - 1.1.0-bin-hadoop2.3+, 1.1.0-bin-hadoop2.4+, 1.1.1-bin-hadoop2.3+, 1.1.1-bin-hadoop2.4+, 1.2.0-bin-hadoop2.3+, 1.2.0-bin-hadoop2.4+, 1.2.1-bin-hadoop2.3+, 1.2.1-bin-hadoop2.4+, 1.2.2-bin-hadoop2.3+, 1.2.2-bin-hadoop2.4+, 1.3.0-bin-hadoop2.3+, 1.3.0-bin-hadoop2.4+, 1.3.1-bin-hadoop2.3+, 1.3.1-bin-hadoop2.4+, 1.3.1-bin-hadoop2.6+, 1.4.0-bin-hadoop2.3+, 1.4.0-bin-hadoop2.4+, 1.4.0-bin-hadoop2.6+, 1.4.1-bin-hadoop2.3+, 1.4.1-bin-hadoop2.4+, 1.4.1-bin-hadoop2.6+, 1.5.0-bin-hadoop2.6+, 1.5.1-bin-hadoop2.6+, 1.5.2-bin-hadoop2.6+, 1.6.0-bin-hadoop2.6+, 1.6.1-bin-hadoop2.6+, 1.6.2-bin-hadoop2.6+, 1.6.3-bin-hadoop2.6+, 2.0.0-bin-hadoop2.6+, 2.0.0-bin-hadoop2.7+, 2.0.1-bin-hadoop2.6+, 2.0.1-bin-hadoop2.7+, 2.0.2-bin-hadoop2.6+, 2.0.2-bin-hadoop2.7+, 2.1.0-bin-hadoop2.6+, 2.1.0-bin-hadoop2.7+, 2.1.1-bin-hadoop2.6+, 2.1.1-bin-hadoop2.7+, 2.1.2-bin-hadoop2.6+, 2.1.2-bin-hadoop2.7+, 2.2.0-bin-hadoop2.6+!, 2.2.0-bin-hadoop2.7+!, 2.2.1-bin-hadoop2.6+!, 2.2.1-bin-hadoop2.7+!, 2.3.0-bin-hadoop2.6+!, 2.3.0-bin-hadoop2.7+!, 2.3.1-bin-hadoop2.6+!, 2.3.1-bin-hadoop2.7+!, 2.3.2-bin-hadoop2.6+!, 2.3.2-bin-hadoop2.7+!, 2.3.3-bin-hadoop2.6+!, 2.3.3-bin-hadoop2.7+!, 2.3.4-bin-hadoop2.6+!, 2.3.4-bin-hadoop2.7+!, 2.4.0-bin-hadoop2.6+!, 2.4.0-bin-hadoop2.7+!, 2.4.1-bin-hadoop2.6+!, 2.4.1-bin-hadoop2.7+!, 2.4.2-bin-hadoop2.6+!, 2.4.2-bin-hadoop2.7+!, 2.4.3-bin-hadoop2.6+!, 2.4.3-bin-hadoop2.7+!, 2.4.4-bin-hadoop2.6+!, 2.4.4-bin-hadoop2.7+!, 2.4.5-bin-hadoop2.6+!, 2.4.5-bin-hadoop2.7+!, 2.4.6-bin-hadoop2.6+!, 2.4.6-bin-hadoop2.7+!, 2.4.7-bin-hadoop2.6+!, 2.4.7-bin-hadoop2.7+!, 2.4.8-bin-hadoop2.6+!, 2.4.8-bin-hadoop2.7+!, 3.0.0-bin-hadoop2.7+!, 3.0.0-bin-hadoop3.2+!, 3.0.1-bin-hadoop2.7+!, 3.0.1-bin-hadoop3.2+!, 3.0.2-bin-hadoop2.7+!, 3.0.2-bin-hadoop3.2+!, 3.0.3-bin-hadoop2.7+!, 3.0.3-bin-hadoop3.2+!, 3.1.1-bin-hadoop2.7+!, 3.1.1-bin-hadoop3.2+!, 3.1.2-bin-hadoop2.7+!, 3.1.2-bin-hadoop3.2+!, 3.1.3-bin-hadoop2.7+!, 3.1.3-bin-hadoop3.2+!, 3.2.0-bin-hadoop2.7+!, 3.2.0-bin-hadoop3.2+!, 3.2.1-bin-hadoop2.7+!, 3.2.1-bin-hadoop3.2+!, 3.2.2-bin-hadoop2.7+!, 3.2.2-bin-hadoop3.2+!, 3.2.3-bin-hadoop2.7+!, 3.2.3-bin-hadoop3.2+!, 3.2.4-bin-hadoop2.7+!, 3.2.4-bin-hadoop3.2+!, 3.3.0-bin-hadoop2.7+!, 3.3.0-bin-hadoop3.2+!, 3.3.1-bin-hadoop2.7+!, 3.3.1-bin-hadoop3.2+!, 3.3.2-bin-hadoop2.7+!, 3.3.2-bin-hadoop3.2+!, 3.3.3-bin-hadoop3+! TensorFlow - 1.9, 1.12 Hbase - 1.0.0+, 1.0.1+, 1.0.1.1+, 1.0.2+, 1.0.3+, 1.1.0+, 1.1.0.1+, 1.1.1+, 1.1.2+, 1.1.3+, 1.1.4+, 1.1.5+, 1.1.6+, 1.1.7+, 1.1.8+, 1.1.9+, 1.1.10+, 1.1.11+, 1.1.12+, 1.1.13+, 1.2.0+, 1.2.1+, 1.2.2+, 1.2.3+, 1.2.4+, 1.2.5+, 1.2.6+, 1.2.6.1+, 1.2.7+, 1.3.0+, 1.3.1+, 1.3.2+, 1.3.2.1+, 1.3.3+, 1.3.4+, 1.3.5+, 1.4.0+!, 1.4.1+, 1.4.2+, 1.4.3+, 1.4.4+, 1.4.5+, 1.4.6+, 1.4.7+, 1.4.8+, 1.4.9+, 1.4.10+, 1.4.13+, 1.5.0+, 1.6.0+ Hive - 2.3.0 [HiveNote] Pig - 0.13.0, 0.14.0, 0.15.0, 0.16.0, 0.17.0 Zookeeper - 3.4.0, 3.4.1, 3.4.2, 3.4.3, 3.4.4, 3.4.5, 3.4.6, 3.4.7, 3.4.8, 3.4.9, 3.4.10, 3.4.11, 3.4.12, 3.4.13, 3.4.14 Storm - 0.9.3, 0.9.4, 0.9.5, 0.9.6, 0.9.7, 0.10.0, 0.10.1, 0.10.2, 1.0.0, 1.0.1, 1.0.2, 1.0.3, 1.0.4, 1.1.0, 1.1.1, 1.1.2, 1.2.0, 1.2.1, 1.2.2, 1.2.3 Phoenix - 4.5.0-Hbase-1.0+, 4.5.0-Hbase-1.1+, 4.5.1-Hbase-1.0+, 4.5.1-Hbase-1.1+, 4.5.2-HBase-1.0+, 4.5.2-HBase-1.1+, 4.6.0-Hbase-1.0+, 4.6.0-Hbase-1.1, 4.7.0-Hbase-1.0+, 4.7.0-Hbase-1.1, 4.8.0-Hbase-1.0+, 4.8.0-Hbase-1.1, 4.8.0-Hbase-1.2, 4.8.1-Hbase-1.0+, 4.8.1-Hbase-1.1, 4.8.1-Hbase-1.2, 4.8.2-Hbase-1.0+, 4.8.2-Hbase-1.1, 4.8.2-Hbase-1.2, 4.9.0-Hbase-1.1, 4.9.0-Hbase-1.2, 4.10.0-Hbase-1.1, 4.10.0-Hbase-1.2, 4.11.0-Hbase-1.1, 4.11.0-Hbase-1.2, 4.11.0-Hbase-1.3, 4.12.0-Hbase-1.1, 4.12.0-Hbase-1.2, 4.12.0-Hbase-1.3, 4.13.0-Hbase-1.3, 4.13.1-Hbase-1.1, 4.13.1-Hbase-1.2, 4.13.1-Hbase-1.3, 4.14.0-Hbase-1.1, 4.14.0-Hbase-1.2, 4.14.0-Hbase-1.3, 4.14.0-Hbase-1.4 Kafka - 2.11-0.9.0.0 Zeppelin - 0.6.0, 0.6.1, 0.6.2, 0.7.0, 0.7.1, 0.7.2, 0.7.3, 0.8.0, 0.8.1, 0.8.2 [HiveNote] - Hive uses PostgreSQL, the minimum version required is 9.1.13 PostgreSQL can be found at: https://www.postgresql.org/download/ Package Version Combinations ---------------------------- Many packages function together, for example Pig requires Hadooop, Spark may use Hadoop to access HDFS, Hbase and Storm require Zookeeper, and Phoenix requires Hbase. While the range of project versions that work together is very large, we've found the following to be a good starting point to use in running jobs. Pig 0.13.X, 0.14.X w/ Hadoop 2.6.X Pig 0.15.X -> 0.17.X w/ Hadoop 2.7.X Hbase 1.0.X -> 1.6.X w/ Hadoop 2.7.X, Zookeeper 3.4.X Phoenix 4.4.X -> 4.13.X - Beginning w/ Phoenix 4.4.0, versions prebuilt for Hbase 1.0, 1.1, etc. are available. Use the version it is prebuilt for appropriately. Spark 1.X - Beginning w/ Spark 1.1 versions prebuilt for Hadoop 2.3, 2.4, 2.6, etc. are available. Use the version it is prebuilt for appropriately. See above for supported versions. Spark 2.X - Builds against Hadoop 2.3, 2.4, 2.6, and 2.7. Use the version it is prebuilt for appropriately. See above for supported versions. Spark 3.X - Builds against Hadoop 2.7 and 3.2. Use the version it is prebuilt for appropriately. See above for supported versions. Storm 0.9.X, 0.10.0, 1.X.0 w/ Zookeeper 3.4.X Kafka 2.11-0.9.0.0 w/ Zookeeper 3.4.X Zeppelin 0.6.0 w/ Spark 1.6.X Package Java Versions --------------------- Some package versions from Apache require minimum Java versions. Although the minimums may be lower than those listed here, these are our recommendations based on testing & experience. Hadoop 2.0 -> 2.5 - Java 1.6 Hadoop 2.6 -> 2.7.3 - Java 1.7 Hadoop 2.7.4 -> 2.7.X - Java 1.8 Hadoop 2.8.0 -> ... - Java 1.7 Hadoop 3.0.0 -> ... - Java 1.8 Hbase 1.0 -> ... - Java 1.7 Hbase 1.5 -> ... - Java 1.8 Spark 1.1 -> 1.3 - Java 1.6 Spark 1.4 -> 1.6 - Java 1.7 Spark 2.0 -> 2.1 - Java 1.7 Spark 2.2 - ... - Java 1.8 Storm 0.9.3 -> 0.9.4 - Java 1.6 Storm 0.9.5 -> ... - Java 1.7 Zeppelin 0.6 -> 0.7 - Java 1.7 Zeppelin 0.8 -> ... - Java 1.8 Package Attention ----------------- Not all software packages and features have been given the same level of attention in Magpie, so we feel it is important to inform you of the level of trust you can have in Magpie support for individual projects and/or features. Core packages/features are considered amongst the core supported packages in Magpie. Magpie developers are confident in their functionality under a wide range of use cases and scenarios. Well supported packages/features are not given quite the same attention as core ones. Magpie developers have confidence they will work with common use case scenarios, but non-common scenarios may have not been tested or tried. Experimental supported packages/features are not maintained with deep attention. They may have been developed against a specific project version or with specific use scenario. Their support should be considered on the side of experimental. - Core Packages: Hadoop, Spark, Hbase, Pig, Zookeeper - Well Packages: Storm, Phoenix Features: No-local-dir - Experimental Packages: Kafka, Zeppelin, Hive, TensorFlow w/ & w/o Horovod, Ray Documentation ------------- General information about all of Magpie can be found below. For information on individual projects, please see the following README files. Hadoop - See README.hadoop Pig - See README.pig Hbase - See README.hbase Hive - See README.hive Spark - See README.spark TensorFlow - See README.tensorflow TensorFlow Horovod - See README.tensorflow-horovod Ray - See README.ray Storm - See README.storm Phoenix - See README.phoenix Kafka - See README.kafka Zeppelin - See README.zeppelin Zookeeper - See README.zookeeper Documentation on some optional features: - Support HPC systems without (or very small) /tmp filesystems - See README.no-local-dir Some miscellaneous documentation: - Testsuite information - See README.testsuite - FAQ of random common questions - See README.faq Exported Environment Variables ------------------------------ The following environment variables are exported when your job is run and may be useful in scripts in your run or in pre/post run scripts. Note that they may not be automatically exported if you remote login into your master node. See MAGPIE_ENVIRONMENT_VARIABLE_SCRIPT for a convenient mechanism to export commonly used environment variables during a remote login session. Project specific environment variable exports are also available, see those sections for more information. MAGPIE_CLUSTER_NODERANK : the rank of the node you are on. It's often convenient to do something like if [ $MAGPIE_CLUSTER_NODERANK == 0 ] then .... fi To only do something on one node of your allocation. MAGPIE_NODE_COUNT : Number of nodes in this allocation. MAGPIE_NODELIST : Nodes in your allocation. MAGPIE_JOB_NAME : Job name MAGPIE_JOB_ID : Job ID MAGPIE_TIMELIMIT_MINUTES : Timelimit of job in minutes Convenience Scripts ------------------- A number of convenience scripts are included in the scripts/ directory, both for possible usefulness and as examples. They are organized within the directory as follows: job-scripts - These are scripts that you would run as a possible job in Magpie. You would set these scripts in the MAGPIE_JOB_SCRIPT environment variable. pre-job-run-scripts - These are scripts that you would run before the actual calculation is executed. You would set these scripts in the MAGPIE_PRE_JOB_RUN environment variable. post-job-run-scripts - These are scripts that you would run after the actual calculation is executed. You would set these scripts in the MAGPIE_POST_JOB_RUN environment variable. Notable scripts worth mentioning: pre-job-run-scripts/magpie-output-config-files-script.sh - This script will output all of the conf files from your job. It's convenient for debugging. post-job-run-scripts/magpie-gather-config-files-and-logs-script.sh - This script will get all of the conf files and log files from Hadoop, Hbase, Pig, Spark, Storm, and/or Zookeeper and store it in a location for post-analysis of your job. It's convenient for debugging. By default files are stored in ${HOME}/${MAGPIE_JOB_NAME}, but the base directory can be altered with the first argument passed into the script. In addition, the misc/magpie-download-and-setup.sh script may be convenient for initially downloading and patching Apache projects for you so you don't have to manually download them. It'll also configure several paths for you in the launch scripts automatically. General Advanced Usage ---------------------- The following are additional tips for advanced usage of Magpie. 1) The Magpie environment variables of MAGPIE_PRE_JOB_RUN and MAGPIE_POST_JOB_RUN can be used to run scripts before and after your primary job script executes. The MAGPIE_POST_JOB_RUN is particularly useful, as it can gather logs and/or other debugging data for you. The convenience script post-job-run-scripts/magpie-gather-config-files-and-logs-script.sh gathers most configuration and log data and stores it to your home directory. 2) The Magpie environment variable MAGPIE_ENVIRONMENT_VARIABLE_SCRIPT is useful for creating a file of popular and useful environment variables. The file it creates can be used within scripts you write, or it can sourced into your environment when you try to interact with your job. 3) All configuration files in conf/ can be modified to be tuned for individual applications. For the brave and adventurous, various configurations such as JVM options and other tunables can be adjusted. If you wish to experiment with different sets of configuration files, consider making different directories with different conf files in them. Then a quick change to project CONF_FILE settings (e.g. HADOOP_CONF_FILES, SPARK_CONF_FILES, HBASE_CONF_FILES, etc.) can quickly allow different files to be experimented with. 4) It is possible to run multiple instances of Hadoop, Hbase, etc. simultaneously on a cluster. However, it is important to isolate each of those instances. In particular, if using default configurations, multiple instances may attempt to read/write identical locations on network filesystems, leading to problems between jobs. For example, if you configure HDFS to operate out of /lustre/hdfsoverlustre/ on multiple jobs, only one namenode will be able to operate correctly at a time. In order to solve this problem, all you need to do is create different directories for each service operating out of a network file system. For example, /lustre/hdfsoverlustre1 and /lustre/hdfsoverlustre2 for two different jobs using HDFS. If you are not concerned about the specific path you are using, perhaps because you never intend to reuse those paths, consider using MAGPIE_ONE_TIME_RUN. This setting may be particularly useful if you initially running tests/experiments on different CPU counts, node counts, settings, etc. and want to run many jobs in parallel. Be careful to cleanup these directories from time to time, as Magpie will not clear data from prior jobs. Security -------- Users should be aware that running Magpie w/ the big data software supported here may be insecure in your environment. While Magpie makes attempts to configure software with good "sanity" configurations, they are not foolproof. In addition, some software may not yet have security infrastructure built in. If you are not running in an environment where your cluster allocation is isolated (through a private virtualized network or something similar) other users on the cluster may be able to communicate with a number of the big data services setup by Magpie. These issues are due to a variety of factors, including: 1) In "traditional" big data clusters, system administrators control what users are allowed on the cluster and who is not, limiting the exposure of data stored there. In the Magpie model, a "big data cluster" is instantiated within a larger multi-user HPC cluster. The Magpie user cannot control what other users have access to the HPC cluster. This population of HPC users could access to the data of the Magpie user without the Magpie user's knowledge. 2) In "traditional" big data clusters, important daemons are owned/executed by a special user (e.g. hdfs, yarn, etc.). This may limit the type of the exposure a nefarious/rogue process can have on the system. When running in an HPC environment with Magpie, the processes are run under the user's ownership. Since users are typically not root, they have no way to change the ownership of the process to a "special" user. 3) Some big data software have Kerberos or similar security functions built into it. However, it is beyond the scope of most HPC users to get proper kerberos configuration of Hadoop, HDFS, etc. from their site staff before running their job. 4) Some big data software just doesn't really have any security built in at all. A few examples of security issues are listed below: Hadoop HDFS - The Hadoop Namenode is generally available on an open and public port. While HDFS has been configured with a good default umask and ACLs, other users on the system can override this by setting HADOOP_USER_NAME environment variable. Hadoop YARN - Similar to Hadoop HDFS, good default configurations have been setup. However they can be overridden with the HADOOP_USER_NAME environment variable. This allows users to potentially run jobs as another user on the cluster. This in turn can open up all of a user's data to others within the system. Spark - Spark shared secret keys have been configured for sanity configuration. However, since the shared secret may be easy to determine, it will allow user to run jobs as another user on the cluster. This in turn can open up all of a user's data to others within the system. Web UIs - Generally speaking, most web UIs will be viewable by other users on the cluster if firewall rules (or similar) are not setup on your cluster by default. Contributions ------------- Feel free to send me patches for new environment variables, new adjustments, new optimization possibilities, alternate defaults that you feel are better, etc. Any patches you submit to me for fixes will be appreciated. I am by no means a bash expert ... in fact I'm quite bad at it. Other Projects -------------- We welcome additions of other projects into Magpie. Here's a somewhat general guide to including other projects in Magpie. This is very high level. Please see internal implementation for details. Hopefully it's somewhat obvious. 1) Add appropriate "templates" into submission-scripts/script-templates/ for the new project so the project can be setup. You can copy templates from other projects to begin. Although there can be variations depending on the project's purpose, you'll most likely want to add: magpie-XXX magpie-magpie-customizations-job-XXX magpie-magpie-customizations-testall-XXX files for project XXX. Then after that, update submission-scripts/script-templates/Makefile to add your project into the primary job submission files. Generate additional submission scripts for the new project if you desire to. After this you can run make and ensure your new project has been added correctly into the job submission scripts. 2) Add appropriate input checks to 'magpie-check-inputs' 3) Add an appropriate "setup" file to magpie/setup/ 4) Add an appropriate "run" file to magpie/run/ 5) Update 'magpie-setup-projects' and 'magpie-run' appropriately for new calls. 6) If necessary create new directories and setup master/worker files in 'magpie-setup-core' 7) If necessary, following libraries could warrant updates: magpie/lib/magpie-lib-node-identification - to identify master/worker nodes magpie/lib/magpie-lib-paths - set various path defaults magpie/lib/magpie-lib-defaults - set various defaults 8) Add any necessary patches to patches/ 9) Add new tests into Magpie's testsuite - Add new test-generate-XXX.sh file to generate new tests. - Update test-generate.sh appropriately for new test generation. - Add new test test-submit-XXX file to submit new tests. - Update test-validate.sh for validate jobs suceeded. - Update test-download-projects.sh to download & patch projects if necessary. 10) (Optional) Add download options for the project in misc/magpie-download-and-setup.sh Other Schedulers/Resource Managers ---------------------------------- While Slurm, Moab+Slurm, Moab+Torque, and LSF+mpirun are the currently supported schedulers/resource managers, there's no reason to believe that other schedulers/resource managers couldn't be supported. I'd gladly welcome patches to support them. To support another scheduler or resource manager, you'll want to make your equivalent scheduler/resource manager header, similar to submission-scripts/script-templates/magpie-config-sbatch-srun. You may also need to create a new job running variant, such as submission-scripts/script-templates/magpie-run-job-srun. Then add an appropriate new section to submission-scripts/script-templates/Makefile and a new directory for these new submission scripts in submission-scripts. If a new MAGPIE_SUBMISSION_TYPE is needed, you'll want to update magpie/exports/magpie-exports-submission-type and add appropriate input checks in magpie-check-inputs. I'd be glad to accept patches back for other schedulers/resource managers. Please send me a pull request. Author ------ This is me. Feel free to contact me about Magpie, however please consider posting support questions to Github's issue tracker so everyone can see the questions & solutions to your problem. Albert Chu [email protected] Credit ------ Credit must be given to Kevin Regimbal @ PNNL. Initial experiments were done using heavily modified versions of scripts Kevin developed for running Hadoop w/ Slurm & Lustre. A number of the ideas from Kevin's scripts continue in spirit in these scripts. Special thanks to David Buttler who came up with the clever name for this project. Thanks ------ Thanks to the following for contributions Felix-Antoine Fortin ([email protected]) - Msub-Torque-Pdsh support & other misc patches Brian Panneton ([email protected]) - LSF support, Phoenix, Kafka and Zeppelin support, & Number of misc patches Adam Childs ([email protected]) - Hive/Tez support