This repository includes all required steps to launch a MIS instance on AWS using EMR. The official MIS instance provides several state-of-the-art imputation reference-panels and is available at https://imputationserver.sph.umich.edu/.
Import: Please always check if you succesfully terminated your AWS EMR cluster by using the Amazon Console. This repository only includes steps to launch a new EMR cluster.
- AWS Command Line Interface (https://aws.amazon.com/cli/) (Ubuntu 18:
pip install awscli
) - run
aws configure
to set your credentials- Set AWS Access Key ID, AWS Secret Access Key, Default region name, Default output format
- AWS Secret Access Key is not stored on AWS, only available when creating a new Access Key ("Create Access Key")
- Create a KeyPair for your default region. (See "KeyName" when starting a cluster)
- Access to a S3 bucket including all required files and reference panels (see below)
The following folder structure must be available on a S3 bucket to launch a MIS instance. In this example we are using our public bucket s3://michigan-imputation-aws-public
that is synchronized with the EC2 EMR cluster during setup.
├── apps.yaml ├── bootstrap.sh ├── configuration │ ├── cloudgene-aws │ ├── cloudgene.conf │ ├── config │ │ └── settings.yaml │ └── pages │ ├── contact.stache │ ├── home.stache │ └── images └── reference-panels
apps.yaml
includes all currently installed apps and reference panels. By default, the Michigan Imputation Server app (imputationserver.zip
) and the HapMap2 reference panel are installed.
url: https://github.com/genepi/imputationserver/releases/download/v1.4.0/imputationserver.zip
---
url: s3://michigan-imputation-aws-public/reference-panels/hapmap2
The bootstrap.sh
script installs all required software packages:
- Installs Cloudgene (https://github.com/genepi/cloudgene)
- Installs Docker (Cloudgene use it to execute R and RMarkdown)
- Sync with
s3://michigan-imputation-aws-public/configuration
to customize Cloudgene (e.g. pages, help, ...) - Installs all applications listed in apps.yaml
- Starts
cloudgene-aws
in background, which is located on the S3 bucket (cloudgene-aws
waits until YARN service is started and starts Cloudgene)
instance-groups.json
contains hardware specifications for all nodes (e.g. number workers, instance type, size of EBS volume)
emr-config.json
contains all YARN specific parameters (e.g. task timeout, memory settings, ...). Local file paths need to start with file://
.
This file includes all required Cloudgene configuration. For a productive setup, we recommend to set an external database (H2 by default) and an S3 location to export final imputation results.
externalWorkspace: type: s3 location: database: database: password: driver: mysql port: 3306 host: user:
clusters/small
describes a small cluster with 1 master, 2 workers, all of them are m4.large instances with 128 GB EBS volumeclusters/spot
describes the same setup but adding 6 spot instances (m4.2xlarge) as TASKS with a bid price of 0.2.
The following command starts a cluster with provided instance groups and yarn config from folder spot
and a bootstrap action that installs Cloudgene, Michigan Imputation Server and HapMap2. Ensure that you have access to bucket s3://michigan-imputation-aws-public
and set your key. If the bucket is private, it needs to be in the same region as the EMR cluster.
When changing the location of the S3 bucket, please also adapt the s3 bucket location in bootstrap.sh
and apps.yaml
.
aws emr create-cluster \
--name Imputationserver \
--applications Name=Hadoop Name=Ganglia \
--release-label emr-5.29.0 \
--use-default-roles \
--ec2-attributes KeyName=<key-name> \
--instance-groups file://clusters/spot/instance-groups.json \
--configuration file://clusters/spot/emr-config.json \
--bootstrap-actions Path=s3://michigan-imputation-aws-public/bootstrap.sh,Args=[]
After submitting, you get a cluster id j-XXXXXXXXXX
. Use this id to check the state of the cluster:
aws emr describe-cluster --cluster-id j-XXXXXXXXXX
You can also use the AWS Console under EMR
to see the current status.
Cluster is ready after several minutes and you should see "State": "WAITING"
:
"Cluster": {
"Status": {
"Timeline": {
"ReadyDateTime": 1563957041.697,
"CreationDateTime": 1563956816.758
},
"State": "WAITING",
"StateChangeReason": {
"Message": "Cluster ready to run steps."
}
},
...
}
The DNS-Name of the Cloudgene instance can be found in property MasterPublicDnsName
(e.g. ec2-13-59-108-176.us-east-2.compute.amazonaws.com) Cloudgene runs on port 8082 (e.g. http://ec2-13-59-108-176.us-east-2.compute.amazonaws.com:8082).
Attention: If you start the cluster for the first time, you need to open port 8082. The security group of the master node can be found in property EmrManagedMasterSecurityGroup
. Open Amazon Web Console, click on EC2 -> Security Group and configure inboud traffic.
- Sets temp directory in
job.config
to /mnt (mounted EBS volume). - cloudgene-aws is currently run with sudo permission (required by Docker)