Skip to content

LSSTDESC/desc-spark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

38 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Apache Spark @ DESC

This repository provides:

  • How to set up a DESC python environment with Apache Spark at NERSC (batch/interactive + JupyterLab).
  • Basic tutorials to discover Apache Spark.
  • Links to Apache Spark developments in DESC (DC2 data access, 3x2pt, ...).

To come:

  • Stack environment + Apache Spark.
  • Bootcamp for DESC members.

Apache Spark

Apache Spark is a cluster computing framework, that is a set of tools to perform computation on a network of many machines. Spark started in 2009 as a research project, and it had a huge success so far in the industry. It is based on the so-called MapReduce cluster computing paradigm, popularized by the Hadoop framework using implicit data parallelism and fault tolerance.

Spark provides many functionalities exposed through Scala/Python/Java/R API (Scala is the native one). As far as DESC is concerned, I would advocate to use the Python API (called pyspark) for obvious reasons. But feel free to put your hands on Scala, it's worth it.

Working at NERSC (JupyterLab)

We provide a kernel to work with Apache Spark and DESC shipped with your DESC environment (see LSSTDESC/nersc). You just need to select the desc-pyspark kernel in the JupyterLab interface. Note that this kernel is automatically installed in your environment when you use the kernel setup script:

source /global/common/software/lsst/common/miniconda/kernels/setup.sh

Working at NERSC (interactive mode)

The previous JupyterLab mode is limited to 4 cores with a total 8GB memory. For more demanding work you may use the NERSC interactive (or batch see next part) queues. Then each node has 100GB memory and 32 cores. Note that the available cache is about 60% of the total memory.

To ease the Spark python usage (pyspark) we provide some scripts located in the /script directory:

  • on a cori interactive node, first run: source scripts/spark-interactive.sh NODES TIME where NODES is the number of requested nodes and TIME the session time (in minutes).
  • once logged in: source scripts/init_spark.sh
  • you can then run the pyspark shell with: scripts/run_psyspark

You enter an ipython shell from which you can run interactive commands and/or standard python scripts with the usual %run magic command.

The pyspark version is a customized one including Anaconda + a few other packages (as healpy). If you need other packages submit an issue.

Working at NERSC (batch mode)

NERSC provides support to run Spark at scale. Note that for Spark version 2.3.0+, Spark runs inside of Shifter. Complete information is available here. Example batch scripts will follow soon.

Going beyond

We started the AstroLab Software project as a platform for big data developments in astronomy, with focus on Apache Spark. There are several projects of interest for LSST, but they largely go beyond.

In case you want to generate your own DESC python + Apache Spark kernel, follow these steps:

# Clone the repo
git clone https://github.com/astrolabsoftware/spark-kernel-nersc.git
cd spark-kernel-nersc

# Create the kernel - it will be stored under
# $HOME/.local/share/jupyter/kernels/
python desc-kernel.py \
  -kernelname desc-pyspark-custom \
  -pyspark_args "--master local[4] \
  --driver-memory 32g --executor-memory 32g \
  --packages com.github.astrolabsoftware:spark-fits_2.11:0.7.1"

And then select the kernel desc-pyspark-custom in the JupyterLab interface. More information can be found at spark-kernel-nersc.

Releases

No releases published

Packages

No packages published