Skip to content

Commit

Permalink
[GL-476] Docker Image Best Practices - Previously [GL-1379] Add AppSe…
Browse files Browse the repository at this point in the history
…c trivy scans on PR (broadinstitute#214)

* Add AppSec GitHub Trivy Action

* [dockers/broad] Set default ARG values in Dockerfiles

* [trivy] Add scans for prebuilt images

* [trivy] Minor cleanup of trivy.yml

* [trivy] Fix prebuilt-scan comment

* Cleanup GoTC Dockerfile

* updating illumina_iaap_autocall dockerfile to pass trivy scans

* Working on multi sample dockerfile

* Remove unused multi-sample array docker directory

* Remove multisample from actions

* Change VerifyBamID to 'blesses' base image

* Change and push GoTC, Illumina, BamID

* Update all GoTC to v2.5.7

* Update all GoTC to v2.5.7

* Update pipeline_version and changelog for WDLs using GoTC

* Add missing changelogs

* Update IlluminaGenotypingArray runtime and dependent WDLs

* Update BamProcessing runtimeand dependent WDLs

* Fix mismatched changelogs

* Rollback commits into single versions

* Make note about updated BWA version

* [GL-1567] dsde toolbox action scan (broadinstitute#321)

* Add prebuilt scan for dsde-toolbox and picard-cloud

* Add stable semantic tag for dsde-toolbox and bump picard-cloud scan to 2.25.5

* [GL-1590] Clean up broad image scripts (broadinstitute#334)

* Refactor build script for picard private

* Small cleanup

* Add help flag and change exit code

* VerifyBamID docker script

* Zcall script clean up

* Convert to single layer images....may change to multi stage builds later

* Finish Illumina Iaap build script and overall final cleanup

* Comment clean up

* Change back to Unix epoch timestamp

* Fix picard private to run from Dockerfile

* Uncomment push and small format

* Move artifactory url to Dockerfile for picard private

* Remove references to dockerhub

* [GL-204] Docker Images best practices (broadinstitute#337)

* Refactor build script for picard private

* Small cleanup

* Add help flag and change exit code

* VerifyBamID docker script

* Zcall script clean up

* Convert to single layer images....may change to multi stage builds later

* Finish Illumina Iaap build script and overall final cleanup

* Comment clean up

* Change back to Unix epoch timestamp

* Fix picard private to run from Dockerfile

* Uncomment push and small format

* Move artifactory url to Dockerfile for picard private

* Add tini

* Small formatting change

* More merge conflict

* Switch to official python3-alpine image for zcall

* Fix missing tini install in verify_bam_id

* [GL-1591] Update Broad Images to Python 3 (broadinstitute#332)

* Add prebuilt scan for dsde-toolbox and picard-cloud

* Add stable semantic tag for dsde-toolbox and bump picard-cloud scan to 2.25.5

* update broad images to python3

* verfied python3 for broad images, removing cache to reduce image size

* vscode doesnt pick up changes unless you save them, thats fun

Co-authored-by: Wes Dingman <[email protected]>

* Remove unneeded python3 install

* [GL-1575] Samtools specific docker images (broadinstitute#340)

* Create Samtools docker image and replace GoTC tasks that were only using Samtools

* Include missing WDL changes

* Add samtools image to trivy scan

* Fix help text

* Update Dockerfile

* Fix comments in build script

* Convert samtools to alpine instead

* Include missing WDL changes

* small formatting changes

* Swith line continuation syntax

* Fix build script typo

* Fix changelogs

* Add Utilities note to changelog

* [GL-1598] Build and use GATK-specific docker images instead of GOTC (broadinstitute#347)

* Create Samtools docker image and replace GoTC tasks that were only using Samtools

* Include missing WDL changes

* Add samtools image to trivy scan

* Fix help text

* Update Dockerfile

* Fix comments in build script

* Convert samtools to alpine instead

* Include missing WDL changes

* small formatting changes

* Swith line continuation syntax

* Fix build script typo

* adding gatk-specific docker image

* cleaned up new gatk Dockerfile and build script based on new best practices

* adding gatk to trivy action, updating changelogs

* updating alpine image, refactoring to use temp directory and remove unused source code

* refactoring gatk35 to gatk3

* adding bash dependency to gatk dockerfile

* aliasing python, adding py3-pip

* updating ~ to root for python alias

* updating gatk4 release path for download, fixing python

* updating date of last commit in changelogs, alphabetizing tools installed

Co-authored-by: Wes Dingman <[email protected]>

* [GL-1604]Standardized formatting across existing images (broadinstitute#350)

* Standardized formatting across existing images

* Apply suggestions from code review

Co-authored-by: cheyenne-gold <[email protected]>

Co-authored-by: cheyenne-gold <[email protected]>

* [GL- 1599] Build and use SAMTOOLS + PICARD + BWA-specific docker images instead of GOTC (broadinstitute#351)

* Create Samtools docker image and replace GoTC tasks that were only using Samtools

* Include missing WDL changes

* Add samtools image to trivy scan

* Fix help text

* Update Dockerfile

* Fix comments in build script

* Convert samtools to alpine instead

* Include missing WDL changes

* small formatting changes

* Swith line continuation syntax

* Fix build script typo

* adding gatk-specific docker image

* cleaned up new gatk Dockerfile and build script based on new best practices

* adding gatk to trivy action, updating changelogs

* updating alpine image, refactoring to use temp directory and remove unused source code

* refactoring gatk35 to gatk3

* adding bash dependency to gatk dockerfile

* aliasing python, adding py3-pip

* updating ~ to root for python alias

* updating gatk4 release path for download, fixing python

* updating date of last commit in changelogs, alphabetizing tools installed

* creating samtools_picard_bwa-specific image, updating Alignment.wdl runtime from gotc

* updating trivy scan, updating changelogs

* fixing bwa executable on image

* adding java to installed dependencies

* adding temp directory to install process

Co-authored-by: Wes Dingman <[email protected]>

* fixing spacing

Co-authored-by: Wes Dingman <[email protected]>

Co-authored-by: Wes Dingman <[email protected]>

* Update Qc.wdl

fixing syntax removed while fixing merge conflicts

* [GL-1592] Docker README (broadinstitute#359)

* Samtools README

* Add missing files

* Add more missing files

* Formatting

* Add samtools usage README

* Add zcall README and update image tag to include version

* WIP illumina

* fixed broken iaap docker image

* Working on README's

* Add missing files

* Add bash to illumina image

* Update changelogs for affected pipelines

* Move GoTC image to deprecated location

* Add missing files

* Remove GoTC from trivy scans

* revert zcall to python2 because of breaking syntax error

* Add Illumina and Samtools_Picard_BWA note

* Add PR fixes

* Update dockers/broad/gatk/README.md

Co-authored-by: ekiernan <[email protected]>

Co-authored-by: cgold <[email protected]>
Co-authored-by: ekiernan <[email protected]>

* [GL-1593] Docker style guide (broadinstitute#373)

* styleguide initial commit

* WIP

* WIP

* Style guide v1

* Fix small formatting

* Spelling

* Update dockers/README.md

Co-authored-by: ekiernan <[email protected]>

* Add troubleshooting section

* merge conflicts

* Incorporate liz comments

* Incorporate PR comments and fix somre spacing issues

Co-authored-by: ekiernan <[email protected]>

* fixing iaap docker version (broadinstitute#380)

* Add quay url to docker build scripts (broadinstitute#377)

* updating images impacted by apk-tools critical vulnerability

* testing arrays-picard-private trivy scan

* applying fix to illumina and zcall

* updating changelogs and pushing docker images to the cloud

* fixing pipeline version error

* Fix picard private docker image (broadinstitute#418)

* Fix picard private docker image

* Update changelogs

* output diff from compare GVCFs

* Increase mem to VerifyGermlineSingleSample

* Increase memory capacity for CompareGvcfs and make diff --speed-large-files

* Fix memory units

* Hardcoding memory and removing speed-large-files

* Add some debuggings logs

* Fix check for grep diff lines

* fix typo

* fix memory again

* Add some more logging statements and writes

* Add some more logging statements and writes

* Add monitoring_script

* fix parameter expansion

* Write output to stdout

* Make outputs optional

* Split out commands to not have pipes while running diff

* Add additional disk space

* Increase disk space and dont fail on return code 1

* Revert compareGVCFs task

* Imputation changelog

* Imputation release #

* Change imputation date

Co-authored-by: Wes Dingman <[email protected]>
Co-authored-by: Wes Dingman <[email protected]>
Co-authored-by: cgold <[email protected]>
Co-authored-by: cheyenne-gold <[email protected]>
Co-authored-by: ekiernan <[email protected]>
Co-authored-by: jessicaway <[email protected]>
  • Loading branch information
7 people authored Oct 6, 2021
1 parent 0aaf4b6 commit f2b29e0
Show file tree
Hide file tree
Showing 79 changed files with 1,627 additions and 731 deletions.
39 changes: 39 additions & 0 deletions .github/workflows/trivy.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
name: dsp-appsec-trivy
on: [pull_request]

jobs:
build-and-scan:
name: DSP AppSec Trivy check
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
context:
- dockers/broad/arrays_picard_private
- dockers/broad/gatk
- dockers/broad/illumina_iaap_autocall
- dockers/broad/verify_bam_id
- dockers/broad/samtools
- dockers/broad/samtools_picard_bwa
- dockers/broad/zcall
steps:
- uses: actions/checkout@v2

- uses: broadinstitute/dsp-appsec-trivy-action@v1
with:
context: ${{ matrix.context }}


prebuilt-scan:
name: DSP AppSec Trivy check
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
image:
- us.gcr.io/broad-gotc-prod/dsde-toolbox:stable_06-10-2021
- us.gcr.io/broad-gotc-prod/picard-cloud:2.25.5
steps:
- uses: broadinstitute/dsp-appsec-trivy-action@v1
with:
image: ${{ matrix.image }}
227 changes: 227 additions & 0 deletions dockers/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,227 @@
# Docker Style Guide

This style guide provides formatting guidelines and best practices for writing Dockerfiles in WARP.

## :book: Table of Contents

* [Overview](#overview)
* [Goals](#goals)
* [Small images](#small)
* [Alpine base](#alpine)
* [Minimal RUN steps](#minimal-run)
* [Publicly accessible](#publicly)
* [Image scanning](#scanning)
* [Semantic tagging](#semantic)
* [Proper process reaping](#process)
* [Build Scripts and README](#build)
* [Formatting](#formatting)
* [Troubleshooting](#trouble)
## <a name="overview"></a> Overview

WARP maintains a collection of docker images which are used as execution environments for various cloud-optimized data processing pipelines. Many of these image require specific sets of tools and dependencies to run and can be thought of as _custom_ images rather than traditional application images.
Building and maintaining these images can be challenging; this document provides a set of guidelines to assist in developing docker images for WARP.

## <a name="goals"></a> Goals

The following are some goals/guidelines we want to strive for when writing our Dockerfiles.

### <a name="small"></a> Small images

Building a smaller image offers advantages such as faster upload and download times along with reduced storage costs and minimized attack vector. Two of the easiest ways to minimize the size of your image is to use a small base image and to reduce the number of layers in your image.

#### <a name="alpine"></a> Alpine base

The easiest way to have a small image is to use an [Alpine](https://alpinelinux.org/) base image. Alpine linux, compared to Debian, RHEL etc., is designed specifically for security and resource efficiency, and lends itself perfectly to be used as a building block for Docker images.

Along with being a small base, Alpine also has built in deletion of package index and provides [tini](https://github.com/krallin/tini) natively through APK.

There are some instances where a Debian base image is unavoidable, specifically in the case where dependencies don't exists in APK. It is suggested that you only go to a Debian base as a last resort.


##### :eyes: Example
```dockerfile

#OKAY, NOT GREAT - uses debian
FROM python:debian

RUN set eux; \
apt-get-update; \
apt-get install -y \
curl \
bash \
; \
# Must clean up cache manually with Debian
apt-get clean && rm -rf /var/lib/apt/list/*

# GOOD - uses alpine
FROM alpine:3.9

RUN set eux; \
apk add --no-cache \
curl \
bash \
```

#### <a name="minimal-run"></a> Minimal RUN steps

Having minimal `RUN`steps (ideally one) is another highly effective way to reduce the size of your image. Each instruction in a Dockerfile creates a [layer](https://docs.docker.com/storage/storagedriver/) and these layers are what add up to build the final image.
When you use multple `RUN` steps it creates additional unnecessary layers and bloats your image.

An alternative to having a single `RUN` step is to use [multi-stage builds](https://docs.docker.com/develop/develop-images/multistage-build/) which are effective when the application your are containerizing is just a statically linked binary.
Just to note, many of the images maintained in WARP require a handful of system-level dependencies and custom packages so multi-stages builds are typically not used.

##### :eyes: Example
```dockerfile

# BAD - uses multiple RUN steps
RUN set eux
RUN apk add --no-cache curl bash wget
RUN wget https://www.somezipfile.com/zip
RUN unzip zip

# GOOD - uses single RUN step
RUN set eux; \
apk add --no-cache \
curl \
bash \
; \
wget https://www.somezipfile.com/zip; \
unzip zip
```

### <a name="publicly"></a> Publicly accessible

The pipelines that we maintain in WARP are designed for public use, ideally we would like our docker images to be publicly available as well. This would mean the following conditions must be true.

* Anybody can pull our images
* Anybody can build our images

For anybody to be able to pull our images they must be hosted on a public container registry, we host all of our images in publics repos on GCR (our 'official' location) and Quay (for discoverability).

* GCR - `us.gcr.io/broad-gotc-prod`
* Quay - `quay.io/broadinstitute/broad-gotc-prod`

For anybody to be able to build our images all of the functionality should be encapsulated in the Dockerfile. Any custom software packages, dependencies etc. have to be downloaded from public links within the Dockerfile, this obviously means that we should not be copying files from within the Broad network infrastucture into our images.

### <a name="scanning"></a> Image scanning


All of the images that we build are scanned for critical vulnerabilities on every pull request. For this we use a github-action that leverages [trivy](https://github.com/aquasecurity/trivy) for scanning. If you build a new image please add it to the action [here](../.github/workflows/trivy.yml).

### <a name="semantic"></a> Semantic tagging


We recommend against using rolling tags like `master` or `latest` when building images. Rolling tags make it hard to track down versions of images since the underlying image hash and content could be different across the same tags. Instead we ask that you use a semantic tag that follows the convention below:

##### `us.gcr.io/broad-gotc-prod/samtools:<image-version>-<samtools-version>-<unix-timestamp>`

This example is for an image we use containing `samtools`. The 'image-version' in this case is the traditional `major.minor.patch` version of the image being built, which is updated when changes to the image (underlying OS, system level packages, etc.) unrelated to `samtools` are made. The 'samtools-version' here correlates with the specific version of `samtools` being used, having this information in the tag makes it easy for users to identify and not have to track down. Lastly, a unix timestamp in included to avoid any potential issues with Cromwell image caching.

### <a name="process"></a> Proper process reaping


Classic init systems like systemd are used to reap orphaned, zombie processes. Typically these orphaned processes are reattached to the process at PID 1 which will reap them when they die. In a container this responsibility falls to process at PID 1 which is by default `/bin/sh`...this obviously will not handle process reaping. Because of this you run the risk of expending excess memory or resources within your container. A simple solution to this is to use `tini` in all of our images, a lengthy explanation of what this package does can be found [here](https://github.com/krallin/tini/issues/8).

Luckily `tini` is available natively through APK so all you have to do is install it and set it as the default entrypoint!

##### :eyes: Example
```dockerfile

FROM alpine:3.9

RUN set eux;
apk add --no-cache \
tini

ENTRYPOINT ["/sbin/tini" , "--"]
```

## <a name="build"></a> Build Scripts and README

To make life easier when building and pushing our images we like to have an easy-to-use `docker_build.sh` that sits next to each Dockerfile. These scripts should have configurable inputs for the version of tools (samtools, picard, zcall, etc.) being used in the image. Additionally, we like to keep a record of the versions built and being used by writing the images to the accompanying `docker_versions.tsv`, this should be done automatically by your build script.

For first time users of these images it is helpful to have a README which gives a high-level overview of the image.

See the examples for samtools([docker_build](./broad/samtools/docker_build.sh), [docker_versions](./broad/samtools/docker_versions.tsv), [README](./broad/samtools/README.md))

## Formatting

Formatting our Dockerfiles consistenty helps improve readability and eases maintenance headaches down the road. The following are a couple of tenants that we follow when writing our Dockerfiles:

* ARGS, ENV, LABEL in that order
* Always add versions of tools in the LABEL
* Single RUN steps
* Alphabetize package install
* Clean up package index cache
* Use ; instead of && for line continuation
* Logically seperate steps within RUN
* Four spaces per tab indent
* Short comments to describe each step
* tini is always default entrypoint

The following is a good example for our `verify_bam_id` image. This Dockerfile shows how to install packages, install tini and clean up cached index files for a debian base image.

##### :eyes: Example
```dockerfile
# Have to use debian based image, many of the installed packages here are not available in Alpine
FROM us.gcr.io/broad-dsp-gcr-public/base/python:debian

ARG GIT_HASH=c1cba76e979904eb69c31520a0d7f5be63c72253

ENV TERM=xterm-256color \
BAMID_URL=https://github.com/Griffan/VerifyBamID/archive \
TINI_VERSION=v0.19.0

LABEL MAINTAINER="Broad Institute DSDE <[email protected]>" \
GIT_HASH=${GIT_HASH}

WORKDIR /usr/gitc

# Install dependencies
RUN set eux; \
apt-get update; \
apt-get install -y \
autoconf \
cmake \
g++ \
gcc \
git \
libbz2-dev \
libcurl4-openssl-dev \
libhts-dev \
libssl-dev \
unzip \
wget \
zlib1g-dev \
; \
# Install BamID
wget ${BAMID_URL}/${GIT_HASH}.zip; \
unzip ${GIT_HASH}.zip; \
\
cd VerifyBamID-${GIT_HASH}; \
mkdir build; \
cd build; \
CC=$(which gcc) CXX=$(which g++) cmake ..; \
\
cmake; \
make; \
make test; \
\
cd ../../; \
mv VerifyBamID-${GIT_HASH}/bin/VerifyBamID .; \
rm -rf ${GIT_HASH}.zip VerifyBamID-${GIT_HASH} \
; \
# Install tini
wget https://github.com/krallin/tini/releases/download/$TINI_VERSION/tini -O /sbin/tini; \
chmod +x /sbin/tini \
; \
# Clean up cached files
apt-get clean && rm -rf /var/lib/apt/lists/*

# Set tini as default entrypoint
ENTRYPOINT [ "/sbin/tini", "--" ]
```

## <a link="trouble"></a> Troubleshooting

If you have any questions or would like some more guidance on writing Dockerfiles please reach out at [[email protected]]([email protected]).
23 changes: 0 additions & 23 deletions dockers/broad/VerifyBamId/Dockerfile

This file was deleted.

10 changes: 0 additions & 10 deletions dockers/broad/VerifyBamId/build_docker.sh

This file was deleted.

41 changes: 41 additions & 0 deletions dockers/broad/arrays_picard_private/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
FROM frolvlad/alpine-glibc:glibc-2.33

ARG PICARD_PRIVATE_VERSION=257537c72dae29257b09bacf413505eed295ac32

ENV TERM=xterm-256color \
NO_VAULT=true \
ARTIFACTORY_URL=https://broadinstitute.jfrog.io/artifactory/libs-snapshot-local/org/broadinstitute/picard-private

LABEL MAINTAINER="Broad Institute DSDE <[email protected]>" \
PICARD_PRIVATE_VERSION=${PICARD_PRIVATE_VERSION}

WORKDIR /usr/gitc

# Install dependencies
RUN set -eux; \
apk --no-cache upgrade; \
apk add --no-cache \
bash \
curl \
findutils \
jq \
openjdk8-jre \
python3 \
tini \
unzip \
wget \
; \
# Download picard private jar
curl -L ${ARTIFACTORY_URL}/${PICARD_PRIVATE_VERSION}/jars/picard-private-all-${PICARD_PRIVATE_VERSION}.jar > picard-private.jar \
; \
# Download the gsutil install script
wget https://dl.google.com/dl/cloudsdk/channels/rapid/install_google_cloud_sdk.bash -O - | bash \
; \
# Set up Vault
curl -L https://releases.hashicorp.com/vault/1.0.2/vault_1.0.2_linux_amd64.zip > temp.zip; \
unzip temp.zip; \
rm temp.zip; \
mv vault /usr/local/bin/

# Set tini as default entry point
ENTRYPOINT ["/sbin/tini", "--"]
Loading

0 comments on commit f2b29e0

Please sign in to comment.