The scripts stored in this repository enable the use of AWS lambda functions to align 640 million reads with bwa in under 3 minutes - a process that takes bwa 20 hours when executing optimised software using a single thread. An entire UMI RNA-seq pipeline converting fastq files to transcript counts takes under 20 minutes on a m4 4x large instance. This is 100x faster than the 30 hours needed for the original unoptimised pipeline and 12x faster than an optimised pipeline executed with 16 threads.
The repo contains the scripts used to execute the pipeline. Additional files that need to be uploaded to the s3 bucket are here:
https://drive.google.com/open?id=1nj6IoltH77i_Ikd04ey-jBcT_QRlpsk4
This includes the executables and human reference files.
The RNA-seq pipeline itself is described in Holistic optimization of an RNA-seq workflow for multi-threaded environments https://academic.oup.com/bioinformatics/advance-article-abstract/doi/10.1093/bioinformatics/btz169/5374759
The executables used in the lambda workflow are bwa https://github.com/lh3/bwa and 3 binaries umisplit, umimerge_filter, umi_merge https://github.com/lhhunghimself/LINCS_RNAseq_cpp
The executable demultiplexes the reads and separates them into smaller files (maximum size can be set by user). This needs to be compiled on the EC2 instance that will run umisplit.
This converts a SAM file formatted alignment into a hash value based on the gene that the read is aligned to and the position of the alignment. This is run on the Lambda functions and should be compiled on an instance with Amazon Linux (the OS used by the Lambda functions).
This dedups the reads with identical barcodes that map to the same position and produces the final set of transcript counts. Fhis needs to be compiled on the EC2 instance that will run umimerge-parallel.
This is the master shell script runs the pipeline on an EC2 instance and launches the Lambda functions. It times and launches the other component scripts that are described next.
The script uses multiple threads to download the fastq files from S3 to the EC2 instance.
This script is a simple wrapper around the umisplit executable on EC2. It is meant to be launched as a background process so that the upload of files can proceed as soon as a split file has been generated.
These two scripts are used to upload the split files to S3. runUploadSplitFiles.sh is a script that looks for complete files generated by umisplit and uploads them. It also checks when umisplit is finished executing. umisplit signals that a file is ready for transfer or that umisplit is finished splitting by writing special files. runUploadSplitFiles.sh checks whether umisplit has finished by looking for these files. If so it will call uploadSplitFiles.sh to upload the remaining split files to S3 and then terminate. Otherwise it will call uploadSplitFiles.sh to upload existing complete files, sleep for 1 second and check again whether umisplit is finished.
This script launches the Lambda functions, assigning to each Lambda function, one small demultiplexed fastq file. It monitors the alignment output files to determine which lambdas have finished processing. When all files have been aligned, the script exits.
This script removes all the files generated by the pipeline. It is used between duplicate timing runs of runPipeline.sh
This Python script is executed by each of the Lambda functions. A json payload informs the function as to which file should be aligned.