To use the computing and data services of CSD3, please first sign up this online application form (Raven login).
Notes:
- "Service Level" choose "Non-paying (SL3) only"
- "Compute Platforms" tick "Peta4-KNL" and "Wilkes2-GPU"
- "dedicated nodes" tick "none"
- SL2 resources
We can use this portal or email Stuart for help. They are very nice and helpful!
For Linux/MacOSX/UNIX system, open a command window.
For Windows system, please download and use Putty or Windows Visual Studio Code.
There are several login nodes, depending on the cluster to use:
icelake
login-icelake.hpc.cam.ac.uk
(1) To access the Peta4-Skylake (CPU cluster) nodes, type ssh <username>@login-cpu.hpc.cam.ac.uk
- Each Peta4-Skylake node has 32 CPU cores (2.6GHz), with 6GB per CPU (192GB total RAM) or 12GB per CPU (384GB total RAM).
- Within the "slurm_submit" file, use
#SBATCH -p skylake
to access 6GB per CPU nodes, and use#SBATCH -p skylake-himem
to access 12GB per CPU nodes. - Another partition on the CPU cluster is called cclake. Within the "slurm_submit" file, use
#SBATCH -p cclake
to access it. - If you want to access both skylake and cclake, use
#SBATCH -p cclake,skylake
- On Peta4-Skylake, SL1 and SL2 users are limited to 1280 cores in use at any one time (maximum walltime of 36 hours per job), and SL3 users are limited to 320 cores (maximum walltime of 12 hours per job per job).
(2) To access the Peta4-KNL (KNL cluster) nodes, type ssh <username>@login-knl.hpc.cam.ac.uk
- Each Peta4-KNL node contains 256 logical CPUs (1.30GHz)
- The memory mode of the KNL nodes allocated can be specified with the #SBATCH -C option.
- On Peta4-KNL, SL1 and SL2 users are limited to 128 nodes in use at any one time (maximum walltime of 36 hours per job), and SL3 users are limited to 64 nodes (maximum walltime of 12 hours per job per job).
(3) To access the Wilkes2-GPU (GPU cluster) nodes, type ssh <username>@login-gpu.hpc.cam.ac.uk
- Each Wilkes2-GPU node contains 4 NVIDIA P100 GPUs.
On Wilkes2-GPU, SL1 and SL2 are limited to 64 GPUs in use at any one time (maximum walltime of 36 hours per job), and SL3 users are limited to 32 GPUs (maximum walltime of 12 hours per job per job).
.
Replace 'username' by you CRSid, your password will be your Raven's one.
SL = Service Level.
For more info, see: https://docs.hpc.cam.ac.uk/hpc/user-guide/connecting.html
Charges for each type of cluster see here
Cluster | Paid unit hours | Price per unit hour |
---|---|---|
Peta4-Skylake | CPU core hours | £0.010 |
Peta4-KNL | KNL node hours | £0.140 |
Wilkes2-GPU | GPU hours | £0.200 |
KNL node is more expensive than Skylake node. So it's better to use Skylake node for small-scale jobs that only require a few CPUs.
KNL node is good for large-scale jobs that may require several hundred CPUs.
We will be asked to check if the host key fingerprints are correct during the first time of login.
Please check details here.
Loading a module establishes the environment required to find the related include and library files at compile-time and run-time.
Command | Description |
---|---|
module avail or module av |
Check the list of modules installed on the cluster |
module av r- |
Check the list of modules with heading 'r-' |
module list |
Check the modules that have been loaded |
module load <module_name> |
Load a moudle |
module unload <module> |
unload a module |
module whatis |
show available modules with brief explanation |
Partition
Node
SLURM
Command | Description |
---|---|
sinfo |
Information about given partition |
scontrol |
Information about given node |
squeue |
Show global cluster information |
scontrol show job nnnn |
Examine the job with jobid nnnn |
scontrol show node nodename |
Examine the node with name nodename |
sbatch |
Submits an executable script to the queueing system |
sintr |
Submits an interactive job to the queueing system |
srun |
Run a command either as a new job or within an existing job |
scancel |
Delete a job |
mybalance |
Show current balance of core hour credits |
Here is a cheatsheet and list of job managing commands.
sinfo -p skylake
Check info and available resources about skylake partitionsinfo -p skylake -l
As above, list formatsinfo -p skylake -Nel
As above, detailed infosinfo -p skylake -O nodelist,memory,cpus
Get memory and number of CPUssinfo -p skylake -O nodehost,memory,cpus
As above, one line per nodesinfo -a
Get info about all partitionsscontrol show nodes cpu-e-1146
Get detailed information about the node cpu-e-1146
The command sbatch
is used to submit jobs. For example, after creating a SLURM script "slurm_submit", we submit this job to CSD3 cluster using commend
sbatch slurm_submit
Below section explains SLURM scripts and the CPU/KNL/CPU SLURM templates you can find on the home directory in the CSD3 cluster.
Cambridge CSD3 cluster uses the SLURM submission system. In normal use of SLURM, one creates a batch job which is a shell script containing the set of commands to run, plus the resource requirements for the job which are coded as specially formatted shell comments at the top of the script. The batch job script is then submitted to SLURM with the sbatch command.
Templates of SLURM submission shell scripts can be found at your home directories /home/username
. For example,
slurm_submit.peta4-skylake
is for running CPU jobsslurm_submit.peta4-knl
is for running KNL jobsslurm_submit.wilkes2
is for running GPU jobs
Within each SLURM template, lines beginning #SBATCH
are directives to the batch system. The rest of each directive specifies arguments to the sbatch command. SLURM stops reading directives at the first executable (i.e. non-blank, and doesn’t begin with #) line.
Here is a detailed SLURM template for running CPU jobs on the skylake or skylake-himem partition, where detailed annotations are given with lines starting with the symbol #!
. Here is a detailed SLURM template for cclake partition, which is similar to the above one.
Here is a simplified SLURM template for running cclake CPU jobs. The annotations have been excluded for simplicity. cclake has a shorter queuing time, so it's better to submit to cclake currently. cclake is very similar to skylake. The difference can be found here.
Here is a template of SLURM script for running KNL jobs (pending).
Long jobs can run with wall times (i.e. real execution times) of up to 7 days.
Long job QoS is not given by default. To use long jobs, please contact the support portal or email [email protected]
to describe details of the jobs and explain why long jobs are necessary.
Long jobs need to use -long variants of the usual partitions (skylake-long, knl-long, pascal-long).
Array jobs allow the submission and management of multiple similar jobs. For example, 10 jobs can be submitted using a single Slurm script. Detailed info on job array can be found here.
Here is a Slurm template for submitting array jobs to the cclake
partition. Then, within R script, add below two commands
task_id <- Sys.getenv("SLURM_ARRAY_TASK_ID")
if (length(task_id) == 0) { stop("Need arguments!") }
where task_id
can be used as the index of each job.
Lauching a lob requires both mandatory parameters and accesories ones.
Command | Description |
---|---|
-A | Project to be charged (use mybalance to know which of you should use) |
-p | Partition to use (either skylake, skylake-himem, ...) |
--nodes | Number of nodes requested |
--cpus-per-task | Controls the number of CPUs allocated per task |
--time | Wallclock time required for the job |
--mem | Total memory requested |
SLURM CPU Management User and Administrator Guide
To get more info on slurm:
For Windows system, we can use WinSCP to transfer data and code between local disk and CSD3 cluster.
Set up of WinSCP can be found at here
module load pkg-config-0.29.2-gcc-6.2.0-we4glmw
module load R/4.0.3
module load gcc/9
Other versions of R or other packages can be loaded if necessary.
R packages are installed using the following steps:
- Load the correct version of R within the terminal (e.g., putty)
- Check if the target version of R has been loaded correctly by using
module list
- Run R interactively by calling
R
within the terminal - Install R packages using
install.packages
More info can be found at this page
NB: If you want to run an R Script on an icelake
node, and you need to install a package for it, you need to log into the icelake
partition on your terminal (via Putty, etc), load an R version that is compatible with icelake
(eg R/4.1.0-icelake
) and install the necessary packages using this version.
The Simple Features sf
package, useful for using and exploring spatial data and methods in R
, requires several dependencies that need to be loaded within the terminal prior to package installation. These modules are:
geos-3.6.2-gcc-5.4.0-4cvhomr
gdal-3.4.1-gcc-5.4.0-h4wkspp
gcc/9
R/4.0.3
pkg-config-0.29.2-gcc-6.2.0-we4glmw
Once these modules have been loaded on the cluster, R can be run interactively as in the procedure described above.
Download the exemplary Rscript and Slurm script.
Put these two files in a folder under your HPC directory /rds/rds-hs743-arbodynamic
.
In this Rscript, revise the working directory to the folder saving these two files.
In your terminal (e.g., putty), submit the job using command sbatch slurm_submit.peta4-cclake
module load beagle-lib-2.1.2-gcc-4.8.5-ti5kq5r
This is the most recent version of Beagle available on the cluster, and so far it has baan sufficient. If you want a more recent Beagle version, it needs to be installed from source code. There are instructions here.
Install BEAST in your working folder; when you use BEAST interactively you
execute from the bin sub-directory in BEAST. Navigate to the directory where
you want to install BEAST and:
- Install and unpack BEAST:
wget 'https://github.com/beast-dev/beast-mcmc/releases/download/v1.10.4/BEASTv1.10.4.tgz'
tar -zxvf BEASTv1.10.4.tgz
cd BEASTv1.10.4/bin
- Check if BEAST and Beagle are cooperating:
beast -beagle_info
- Run BEAST interactively from the
./bin
subdirectory by callingbeast
and its options:~/yourdir/BEASTv1.10.4/bin/beast -overwrite ~/myfiles/file1.xml
More info on beagle options for BEAST are on this page.
You can submit BEAST jobs to the CPU or GPU, depending on the size of your data. However, due to an incompatibility, we cannot use the cpu/cclake
partition. The main difference from other example slurm scripts is how BEAST and the options are called.
-
Here is a detailed SLURM template for running CPU jobs on the skylake or skylake-himem partition, in which detailed annotations are given with lines starting with the symbol
#!
. -
There are also sample SLURM templates for running CPU jobs saving states from the BEAST run regularly into checkpoint files to be able to resume them later, and also for resuming said BEAST runs from previously saved checkpoint files. Documentation on these options on the BEAST community website can be found here
-
Here is a detailed SLURM template for running GPU jobs on the ampere partition, in which detailed annotations are given with lines starting with the symbol
#!
. -
There are also sample SLURM templates for running GPU jobs on the ampere partition saving states from the BEAST run regularly into checkpoint files to be able to resume them later, and also for resuming said BEAST runs from previously saved checkpoint files. Documentation on these options on the BEAST community website can be found here
The following acknowledgement can be used in papers:
This study was performed using resources provided by the Cambridge Service for Data Driven Discovery (CSD3) operated by the University of Cambridge Research Computing Service (www.csd3.cam.ac.uk), provided by Dell EMC and Intel using Tier-2 funding from the Engineering and Physical Sciences Research Council (capital grant EP/T022159/1), and DiRAC funding from the Science and Technology Facilities Council (www.dirac.ac.uk).