This repository contains playbooks and documentation to deploy stacks of virtual machines working together. Most of these stacks are virtual Linux HPC clusters, which can be used as collaborative, analytical sandboxes. All production clusters were named after robots that appear in the animated sitcom Futurama. Test/development clusters were named after other robots.
The main ingredients for (deploying) these clusters:
- Ansible playbooks for system configuration management.
- OpenStack for virtualization. (Note that deploying the OpenStack itself is not part of the configs/code in this repo.)
- Pulp to create freezes of Linux distros.
- CentOS 7 as OS for the virtual machines.
- Slurm as workload/resource manager to orchestrate jobs.
The master and develop branches of this repo are protected; updates can only be merged into these branches using reviewed pull requests.
Once a while we create releases, which are versioned using the format YY.MM.v
where:
YY
is the year of releaseMM
is the month of releasev
is the first release in that month and year. Hence it is not the day of the month.
E.g. 19.01.1
is the first release in January 2019.
We follow the Python PEP8 naming conventions for variable names, function names, etc.
This repo currently contains code and configs for the following clusters:
- Talos: Development cluster hosted by the Center for Information Technology (CIT) at the University of Groningen.
- Gearshift: UMCG Research IT production cluster hosted by the Center for Information Technology (CIT) at the University of Groningen.
- Hyperchicken: Development cluster hosted by The European Bioinformatics Institute (EMBL-EBI) in the Embassy Cloud.
- Fender: Solve-RD production cluster hosted by The European Bioinformatics Institute (EMBL-EBI) in the Embassy Cloud.
Deployment and functional administration of all clusters is a joined effort of the Genomics Coordination Center (GCC) and the Center for Information Technology (CIT) from the University Medical Center and University of Groningen, in collaboration with ELIXIR compute platform, EXCELERATE, EU-Solve-RD, European Joint Project for Rare disease and CORBEL projects.
The clusters are composed of the following type of machines:
- Jumphost: security-hardened machines for SSH access.
- User Interface (UI): machines for job management by regular users.
- Deploy Admin Interface (DAI): machines for deployment of bioinformatics software and reference datasets without root access.
- Sys Admin Interface (SAI): machines for maintenance / management tasks that require root access.
- Compute Node (CN): machines that crunch jobs submitted by users on a UI.
The clusters use the following types of storage systems / folders:
Filesystem/Folder | Shared/Local | Backups | Mounted on | Purpose/Features |
---|---|---|---|---|
/home/${home}/ | Shared | Yes | UIs, DAIs, SAIs, CNs | Only for personal preferences: small data == tiny quota. |
/groups/${group}/prm[0-9]/ | Shared | Yes | UIs, DAIs | permanent storage folders: for rawdata or final results that need to be stored for the mid/long term. |
/groups/${group}/tmp[0-9]/ | Shared | No | UIs, DAIs, CNs | temporary storage folders: for staged rawdata and intermediate results on compute nodes that only need to be stored for the short term. |
/groups/${group}/scr[0-9]/ | Local | No | Some UIs | scratch storage folders: same as tmp, but local storage as opposed to shared storage. Optional and available on all UIs. |
/local/${slurm_job_id} | Local | No | CNs | Local storage on compute nodes only available during job execution. Hence folders are automatically created when a job starts and deleted when it finishes. |
/mnt/${complete_filesystem} | Shared | Mixed | SAIs | Complete file systems, which may contain various home , prm , tmp or scr dirs. |
Some other stacks of related machines are:
- docs_library: web servers hosting documentation.
- ...: iRODS machines
Deploying a fully functional stack of virtual machines from scratch involves the following steps:
- Configure physical machines
- Off topic for this repo.
- Deploy OpenStack virtualization layer on physical machines to create an OpenStack cluster.
- Off topic for this repo.
- For the Shikra cloud, which hosts the Talos and Gearshift HPC clusters we use the ansible playbooks from the hpc-cloud repository to create the OpenStack cluster.
- For other HPC clusters we use OpenStack clouds from other service providers as is.
- Create, start and configure virtual machines on an OpenStack cluster to create a Slurm HPC cluster.
- This repo.
- Deploy bioinformatics software and reference datasets.
- Off topic for this repo.
- We use the ansible playbook from the ansible-pipelines repository to deploy Lua + Lmod + EasyBuild. The latter is then used to install bioinformatics tools.
Details for phase 3. Create, start and configure virtual machines on an OpenStack cluster to create a Slurm HPC cluster.
mkdir -p ${HOME}/git/
cd ${HOME}/git/
git clone https://github.com/rug-cit-hpc/league-of-robots.git
cd league-of-robots
#
# Create Python virtual environment (once)
#
python3 -m venv openstacksdk.venv
#
# Activate virtual environment.
#
source openstacksdk.venv/bin/activate
#
# Install OpenStack SDK (once) and other python packages.
#
pip3 install --upgrade pip
pip3 install wheel
pip3 install openstacksdk
pip3 install ruamel.yaml
#
# Optional: install Ansible with pip.
# You may skip this step if you already installed Ansible by other means.
# E.g. with HomeBrew on macOS, with yum or dnf on Linux, etc.
#
pip3 install ansible
#
# Optional: install Mitogen with pip.
# Mitogen provides an optional strategy plugin that makes playbooks a lot (up to 7 times!) faster.
# See https://mitogen.networkgenomics.com/ansible_detailed.html
#
pip3 install mitogen
ansible-galaxy install -r requirements.yml
Note: the default location where these dependencies will get installed with the above command is ${HOME}/.ansible/
.
The vault password is used to encrypt/decrypt the secrets.yml
file per stack_name,
which will be created in the next step if you do not already have one.
In addition a second vault passwd is used for various files in group_vars/all/
and which contain settings that are the same for all stacks.
If you have multiple stacks with their own vault passwd you will have multiple vault password files.
The pattern .vault*
is part of .gitignore
, so if you put the vault passwd files in the .vault/
subdir,
they will not accidentally get committed to the repo.
- To generate a new Ansible vault password and put it in
.vault/vault_pass.txt.[stack_name|all]
, use the following oneliner:LC_ALL=C tr -cd '[:alnum:]' < /dev/urandom | fold -w60 | head -n1 > .vault/vault_pass.txt.[stack_name|all]
- Or to use an existing Ansible vault password create
.vault/vault_pass.txt.[stack_name|all]
and use a text editor to add the password. - Make sure the
.vault/
subdir and it's content is private:chmod -R go-rwx .vault/
To create a new stack you will need group_vars
and a static inventory for that stack:
- See the
static_inventories/*.yml
files for existing stacks for examples.
Create a newstatic_inventories/[stack_name].yml
. - Create a
group_vars/[stack_name]/
folder with avars.yml
.
You'll find and examplevars.yml
file ingroup_vars/template/
.
To generate a newsecrets.yml
with new random passwords for the various daemons/components and encrypt this newsecrets.yml
file:The encrypted# # Activate Python virtual env created in step 0. # source openstacksdk.venv/bin/activate # # Configure this repo for a specific cluster. # This will set required ENVIRONMENT variables including # ANSIBLE_VAULT_IDENTITY_LIST='[email protected]/vault_pass.txt.all, [stack_name]@.vault/vault_pass.txt.[stack_name]' # . ./lor-init lor-config [stack_prefix] # # # Create new secrets.yml file based on a template and encrypt it with the vault password. # ./generate_secrets.py group_vars/template/secrets.yml group_vars/[stack_name]/secrets.yml ansible-vault encrypt --encrypt-vault-id [stack_name] group_vars/[stack_name]/secrets.yml
secrets.yml
can now safely be committed.
The.vault/vault_pass.txt.[stack_name]
file is excluded from the repo using the.vault*
pattern in.gitignore
.
To use use an existing encrypted group_vars/[stack_name]/secrets.yml
:
- Add a
.vault/vault_pass.txt.[stack_name]
file to this repo and use a text editor to add the vault password to this file.
We use an SSH public-private key pair to sign the host keys of all the machines in a cluster.
This way users only need the public key of the CA in their ~.ssh/known_hosts
file
and will not get bothered by messages like this:
The authenticity of host '....' can't be established.
ED25519 key fingerprint is ....
Are you sure you want to continue connecting (yes/no)?
- The default filename of the CA private key is
[stack_name]-ca
A different CA key file must be specified using thessh_host_signer_ca_private_key
variable defined ingroup_vars/[stack_name]/vars.yml
- The filename of the corresponding CA public key must be the same as the one of the private key suffixed with
.pub
- The password required to decrypt the CA private key must be specified using the
ssh_host_signer_ca_private_key_pass
variable defined ingroup_vars/[stack_name]/secrets.yml
, which must be encrypted withansible-vault
. - Each user must add the content of the CA public key to their
~.ssh/known_hosts
like this:E.g.:@cert-authority [names of the hosts for which the cert is valid] [content of the CA public key]
@cert-authority reception*,*talos,*tl-* ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDWNAF....VMZpZ5b9+5GA3O8w== UMCG HPC Development CA
- Example to create a new CA key pair with the
ed25519
algorithm and encryption after that:ssh-keygen -t ed25519 -a 101 -f ssh-host-ca/[stack_name]-ca -C "CA key for [stack_name]" ansible-vault encrypt --encrypt-vault-id [stack_name] ssh-host-ca/[stack_name]-ca
- Make sure you are a member of the
docker
group. Otherwise you will get this error:ERRO[0000] failed to dial gRPC: cannot connect to the Docker daemon. Is 'docker daemon' running on this host?: dial unix /var/run/docker.sock: connect: permission denied context canceled
- Execute:
cd promtools ./build.sh
Execute:
mkdir -p files/[stack_name]
dd if=/dev/urandom bs=1 count=1024 > files/[stack_name]/munge.key
ansible-vault encrypt --encrypt-vault-id [stack_name] files/[stack_name]/munge.key
The encrypted files/[stack_name]/munge.key
can now be committed safely.
If in group_vars/[stack_name]/vars.yml
you configured:
create_ldap: yes
: This cluster will create and run its own LDAP server. You will need to create a self-signed TLS certificate for the LDAP server.create_ldap: no
: This cluster will use an external LDAP, that was configured & hosted elsewhere, and this step can be skipped.
Execute:
openssl req -x509 -nodes -days 1825 -newkey rsa:4096 -keyout files/[stack_name]/ldap.key -out files/[stack_name]/ldap.crt
openssl dhparam -out files/[stack_name]/dhparam.pem 4096
ansible-vault encrypt --encrypt-vault-id [stack_name] files/[stack_name]/ldap.key
ansible-vault encrypt --encrypt-vault-id [stack_name] files/[stack_name]/ldap.crt
ansible-vault encrypt --encrypt-vault-id [stack_name] files/[stack_name]/dhparam.pem
The encrypted files in files/[stack_name]/
can now be committed safely.
There are two playbooks:
deploy-os_servers.yml
:- Creates virtual resources in OpenStack: networks, subnets, routers, volumes and finally the virtual machines.
- Interacts with the OpenstackSDK / API on localhost.
- Uses a static inventory from
static_inventories/*.yaml
parsed with our custom inventory plugininventory_plugins/yaml_with_jumphost.py
cluster.yml
:- Configures the virtual machines created with the
deploy-os_servers.yml
playbook. - Has no dependency on the OpenstackSDK / API.
- Uses a static inventory from
static_inventories/*.yaml
parsed with our custom inventory plugininventory_plugins/yaml_with_jumphost.py
- Configures the virtual machines created with the
- Login to the OpenStack web interface -> Identity -> Application Credentials -> click the Create Application Credential button.
This will result in a popup window: specify Name, Expiration Date, Expiration Time, leave the rest empty / use defaults and click the Create Application Credential button.
In the new popup window click the Download openrc file button and save the generated*-openrc.sh
file in the root of the repo. - Configure environment and run playbook:
# # Activate Python virtual env created in step 0. # source openstacksdk.venv/bin/activate # # Initialize the OpenstackSDK # source ./[Application_Credential_Name]-openrc.sh # # Configure this repo for deployment of a specifc stack. # source ./lor-init lor-config [stack_prefix] ansible-playbook deploy-os_servers.yml
Without local admin accounts we'll need to use
- Either a
root
account for direct login - Or a default user account for the image used to create the VMs.
This account must be able tosudo su
to become the root user.
In our case the CentOS cloud image comes with a default centos
user.
Note that:
- Direct login as root will be disabled by the playbook for security reasons, so you will need a local admin account to become root using sudo.
- An admin account must be local, so it does not depend on an external account management server like an LDAP.
- An admin account must have a home dir not in /home,
because we will mount home dirs for regular users from shared storage system over a network
and admin accounts must not depend on a
~/.ssh/authorized_keys
from an external storage system. - The default
centos
account will become useless after the first steps of the playbook have been deployed, because its home dir with~/.ssh/authorized_keys
is located in /home, which will vanish when we mount homes from shared storage. Changing the location of the defaultcentos
account is not trivial and can result in a situation where you lock yourself out.
Therefore the first step is to create additional local admin accounts:
- whose home dir is not located in /home and
- who are allowed to
sudo su
to the root user.
Without signed host keys, SSH host key checking must be disabled for this first step. The next step is to deploy the signed host keys. Once these first two steps have been deployed, the rest of the steps can be deployed with a local admin account and SSH host key checking enabled, which is the default.
In order to reach machines behind the jumphost you will need to configure your SSH client.
The templates for the documentation are located in this repo at:
roles/online_docs/templates/mkdocs/docs/
Deployed docs can currently be found at:
http://docs.gcc.rug.nl/
Once configured correctly you should be able to do a multi-hop SSH via a jumphost to a destination server using aliases like this:
- For login with the same account on both jumphost and destination:
ssh user@jumphost+destination
- For login with a different account on the jumphost:
export JUMPHOST_USER='user_on_jumphost' ssh user_on_destination@jumphost+destination
- Configure the dynamic inventory and jumphost for the Talos test cluster:
This can also be accomplished with less typing by sourcing an initialisation file, which provides the
export AI_PROXY='reception' export ANSIBLE_INVENTORY='static_inventories/talos_cluster.yml' export ANSIBLE_VAULT_IDENTITY_LIST='[email protected]/vault_pass.txt.all, [email protected]/vault_pass.txt.talos_cluster'
lor-config
function to configure these environment variables for a specific cluster/site:. ./lor-init lor-config tl
- Firstly, create the jumphost, which is required to access the other machines.
- Create local admin accounts.
- Deploy the signed hosts keys.
- Configure other stuff on the jumphost, which contains amongst others the settings required to access the other machines behind the jumphost.
export ANSIBLE_HOST_KEY_CHECKING=False ansible-playbook -u centos -l 'jumphost' single_role_playbooks/admin_users.yml ansible-playbook -u [admin_account] -l 'jumphost' single_role_playbooks/ssh_host_signer.yml export ANSIBLE_HOST_KEY_CHECKING=True ansible-playbook -u [admin_account] -l 'jumphost' cluster.yml
- Secondly, deploy the rest of the machines in the same order.
For creation of the local admin accounts you must (temporarily) set
JUMPHOST_USER
for the jumphost to your local admin account, because thecentos
user will no longer be able to login to the jumphost.export ANSIBLE_HOST_KEY_CHECKING=False export JUMPHOST_USER=[admin_account] # Requires SSH client config as per end user documentation: see above. ansible-playbook -u centos -l 'repo,cluster' single_role_playbooks/admin_users.yml ansible-playbook -u root -l 'docs' single_role_playbooks/admin_users.yml unset JUMPHOST_USER ansible-playbook -u [admin_account] -l 'repo,cluster,docs' single_role_playbooks/ssh_host_signer.yml export ANSIBLE_HOST_KEY_CHECKING=True ansible-playbook -u [admin_account] -l 'repo,cluster,docs' cluster.yml
- (Re-)deploying only a specific role - e.g. slurm_management - on the previously deployed test cluster Talos
ansible-playbook -u [admin_account] single_role_playbooks/slurm_management.yml
See the end user documentation, that was generated with the online_docs
role for instructions how to submit a job to test the cluster.