A reboot of the HGI's IaC project. This specific project has been created to address one, simple, initial objective: the lifecycle management of a spark cluster.
The code was not effective any more: the team was not confident with the codebase, the building process and the infrastructure generated by the code was missing a number of must-have features for today's infrastructures. We chose to have a fresh start on the IaC, rather then refactoring legacy code. This will let us choose simple and effective objectives, outline better requirements, and design around operability from the very beginning.
terraform 0.11
executable anywhere in yourPATH
packer 1.4
executable anywhere in yourPATH
docker
distribution installed- Ensure that the following packages are installed:
- build-essential
- cmake
- g++
- libatlas3-base
- liblz4-dev
- libnetlib-java
- libopenblas-base
- make
- openjdk-8-jdk
- python3
- python3-dev
- python3-pip
- r-base
- r-recommended
- scala
- Ensure that python requirements in
requirements.txt
are installed - Follow the setup runbook
invoke.sh
is shell script made to wrap pyinvoke
quite extensive list of
tasks and collections, and meke its usage even easier. invoke.sh
. To
understand how to use invoke.sh
, you can run:
bash invoke.sh --help
To have an idea of what the tasks are and do, please have a look at the tasks documentation. For a quick list of example usages, please refer to the users or ops runbooks.
Open your hail-master Jupyter URL http://<IP_OR_NAME>/jupyter/ in a web browser, create a notebook, then initialise it:
import os
import hail
import pyspark
tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()
hail.init(sc=sc, tmp_dir=tmp_dir)
(TODO: include a .ssh/config snippet to allow for an easier ssh run)
ssh
into your hail-master node:
$ ssh -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null ubuntu@<IP_OR_NAME>
Once you've logged in, become the application user (i.e. hgi -- for now)
$ sudo --login --user=hgi --group=hgi
The --login
option will create a login shell that will have a lot of
pre-configured environment variables and commands, including a pre-configured
alias to pyspark
, so you should not need to remember any option. Once you
started pyspark
, you can initialise hail like this:
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 2.4.3
/_/
Using Python version 3.7.3 (default, Mar 27 2019 22:11:17)
SparkSession available as 'spark'.
>>> import os
>>> import hail
>>> tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
>>> mail.init(sc=sc, tmp_dir=tmp_dir)
Hail initialisation in a non-interactive pyspark
session is the same as for
the Jupyter Notebooks:
import os
import hail
import pyspark
tmp_dir = os.path.join(os.environ['HAIL_HOME'], 'tmp')
sc = pyspark.SparkContext()
hail.init(sc=sc, tmp_dir=tmp_dir)
Read the CONTRIBUTING.md file
Read the LICENSE.md file