Skip to content
forked from a-slide/pycoQC

Python 3 package for Jupyter Notebook, computing metrics and generating plots from Oxford Nanopore Albacore report

License

Notifications You must be signed in to change notification settings

hangzhang/pycoQC

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

45 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DOI

pycoQC 1.1a1 package documentation


PycoQC is a Python 3 package for Jupyter Notebook, computing metrics and generating simple QC plots from the sequencing summary report generated by Oxford Nanopore technologies Albacore basecaller


pycoQC is a very simple quality control package for Nanopore data written in pure python3, meant to be used directly in a jupyter notebook 4.0.0 +. As opposed to more exhaustive QC programs for nanopore data, pycoQC is very fast as it relies entirely on the sequencing_summary.txt file generated by ONT Albacore Sequencing Pipeline Software 1.2.1+, during base calling. Consequently, pycoQC will only provide metrics at read level metrics (and not at base level). The package supports 1D and 1D2 runs analysed with Albacore.

PycoQC requires the following fields in the sequencing.summary.txt file:

  • 1D run => read_id, run_id, channel, start_time, sequence_length_template, mean_qscore_template
  • 1D2 run =>read_id, run_id, channel, start_time, sequence_length_2d, mean_qscore_2d

In addition it will try to get the following optional fields if they are available:

  • num_events, calibration_strand_genome_template, passes_filtering

Installation

Ideally, before installation, create a clean python3 virtual environment to deploy the package, using virtualenvwrapper for example (see http://www.simononsoftware.com/virtualenv-tutorial-part-2/).

Required packages:

  • numpy>=1.13.0

  • pandas>=0.20.0

  • matplotlib>=2.0.0

  • seaborn>= 0.7.0

  • notebook>=4.0.0

Option 1: Direct installation with pip from github (recommended)

Install the package with pip3. All the required dependencies will be automatically installed.

pip3 install git+https://github.com/a-slide/pycoQC.git

To update the package:

pip3 install git+https://github.com/a-slide/pycoQC.git --upgrade

Option 2: Clone the repository and install locally in develop mode

With this option, the package will be locally installed in “editable” or “develop” mode. This allows the package to be both installed and editable in project form. This is the recommended option if you wish to participate to the development of the package. As for the option before, the required dependencies will be automatically installed.

git clone https://github.com/a-slide/pycoQC.git

cd pycoQC

chmod u+x setup.py

pip3 install -e ./

With this option you can also run the testing notebook located in the source directory pycoQC/test_pycoQC.ipynb

Option 3: Local installation without pip (not recommended)

This option is also suitable if you are interested in further developing the package, but requires a little bit more hands-on.

Clone the repository locally

git clone https://github.com/a-slide/pycoQC.git

  • Add the package directory (./pycoQC/pycoQC) to you python3 PATH (depending on you OS and whether you want it to be permanent ot not)

  • Install the dependencies (numpy, pandas, matplotlib, seaborn and notebook)

pip3 install numpy pandas matplotlib seaborn notebook

Usage

The package is meant to be used in a jupyter notebook 4.0.0 +

Running jupyter in a virtualenv (optional)

If you installed the package in a virtual environment with virtualenvwrapper, jupyter can run the virtualenv as a kernel as explained here http://help.pythonanywhere.com/pages/IPythonNotebookVirtualenvs

Notebook setup

Launch the notebook in a terminal

jupyter notebook

If it does not autolaunch your web browser, open manually the following URL http://localhost:8888/tree

From Jupyter home page you can navigate to the directory you want to work in. Then, create a new Python3 Notebook.

In the notebook, import matplotlib and use the jupyter magic command to enable direct plotting in the current Notebook.

Using the svg format as a backend for matplotlib will generate beautiful vector plots, but is CPU/memory hungry, particularly for the 2D scatter plot

import matplotlib.pyplot as pl
%matplotlib inline
%config InlineBackend.figure_format = 'svg'

One can also tweak the pandas output to enlarge the dataframes for the tabular data generated by pycoQC

import pandas as pd
pd.options.display.max_colwidth = 200

Default pylab parameters can be defined at the beginning of the notebook as well (see http://matplotlib.org/users/customizing.html for more options)

pl.rcParams['figure.figsize'] = 20,7
pl.rcParams['font.family'] = 'sans-serif'
pl.rcParams['font.sans-serif'] = ['DejaVu Sans']

General package information

pycoQC is a simple class that is initialized with a sequencing_summary file generated by Albacore 1.2.1 +.

The instantiated object can be subsequently called with various methods that will generates tables and plots

Each function has specific options that are comprehensively detailed in the test notebook provided with the package or in directly on Github: Test_notebook

Most of the plotting functions return a matplotlib fig, ax tuple. This allows users to further customize the plotting areas thanks to the numerous set methods associated with the object (for instance Axes.set_axis_off, Axes.set_xlim, Axes.set_xscale...). Extensive information is available in the Matplotlib API documentation: http://matplotlib.org/api/axes_api.html.

All the plotting functions can take a matplotlib "style" option. To list all available styles in your environment, use:

print(pl.style.available)
['seaborn-talk', 'dark_background', 'seaborn-white', 'seaborn', 'seaborn-dark', 'seaborn-whitegrid', 'fivethirtyeight', 'seaborn-notebook', 'seaborn-darkgrid', 'seaborn-dark-palette', 'seaborn-bright', 'Solarize_Light2', 'seaborn-muted', 'seaborn-colorblind', 'grayscale', 'fast', 'seaborn-paper', 'seaborn-pastel', '_classic_test', 'seaborn-poster', 'seaborn-ticks', 'bmh', 'seaborn-deep', 'classic', 'ggplot']

Import the package

from pycoQC.pycoQC import pycoQC

One can also import the jprint and jhelp function from pycoQC to get a improve the default print and help function in jupyter.

from pycoQC.pycoQC_fun import jhelp, jprint

jhelp Can be used to provide a full description of the pycoQC functions using the full option.

jhelp(pycoQC.reads_qual_bins, full=True)

reads_qual_bins (self, bins=[-1, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 40])

Count the number of reads per interval of sequence quality and return a dataframe

  • bins: LIST [Default [-1,0,2,4,6,8,10,12,14,16,18,20,40]]

Limits of the intervals as a list

Or alternatively one can also use the jupyter magic "?"

?pycoQC.reads_qual_bins

A sample test file previously generated by Albacore are provided with the package. They can be listed using the following function

df = pycoQC.example_data_files()
display(df)
path description
1D_DNA_1.2.1 /home/aleg/Programming/Python3/pycoQC/pycoQC/d... Sequencing summary file generated by a 1D_DNA ...
1D_RNA_2.0.1 /home/aleg/Programming/Python3/pycoQC/pycoQC/d... Sequencing summary file generated by a 1D_RNA ...
1D2_DNA_1.2.1 /home/aleg/Programming/Python3/pycoQC/pycoQC/d... Sequencing summary file generated by a 1D2_DNA...

I recommend using of of theses files to test pycoQC, but you can obviously use your own files instead

Initialize pycoCQ

jhelp (pycoQC.__init__)

init (self, seq_summary_file, run_type='', runid_list=[], filter_zero_len=False, filter_fail=False, filter_calibration=False, verbose=False, **kwargs)

Parse Albacore sequencing_summary.txt file and clean-up the data

Basic initialization

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt", verbose=True)

Importing data

 50000 reads found in initial file

Verify and rearrange fields

 1D Run type

Order run IDs by start time

 Processing reads with Run_ID ad3de3b63de71c4c6d5ea4470a82782cf51210d9

 Processing reads with Run_ID 7082b6727942b3939a023beaf03ef24cec1722e5

Reindex and sort

 50000 Total valid reads found

Initialization with runids reordering

If several runids are present in the file, pycoQC will order the runids bases on their order in the file, which does not always correspond to the sequencing order. Unfortunately their is no way to know the right order based on the information contained in the sequencing_summary.txt file alone. However if you know the order you can specify it at initialisation (or even exclude specific runids).

runid_list = ["7082b6727942b3939a023beaf03ef24cec1722e5", "ad3de3b63de71c4c6d5ea4470a82782cf51210d9"]
p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt", runid_list=runid_list, verbose=True)

Importing data

 50000 reads found in initial file

Verify and rearrange fields

 1D Run type

Order run IDs by start time

 Processing reads with Run_ID 7082b6727942b3939a023beaf03ef24cec1722e5

 Processing reads with Run_ID ad3de3b63de71c4c6d5ea4470a82782cf51210d9

Reindex and sort

 50000 Total valid reads found

Initialization with read filtering

Some reads are not "basecallable" and consequently have a length of zero. These reads can be filtered out with the option filter_zero_len.

Starting from Albacore 2.0, ONT introduced additional fields in the sequencing_summary.txt to flag sequences that did not meet the quality requirements and sequences aligned on the internal control. These sequences can be filtered out with the options filter_calibration and filter_fail.

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_RNA_Albacore_2.0.1.txt", filter_calibration=True, filter_fail=True, filter_zero_len=True, verbose=True)

Importing data

 50000 reads found in initial file

Verify and rearrange fields

 1D Run type

Filter out failed reads

 45462 reads discarded

Filter out reads corresponding to the calibration strand

 125 reads discarded

Filter out zero length reads

 0 reads discarded

Order run IDs by start time

 Processing reads with Run_ID 3a0ea63a73db0f9fb611b9da3a37045d249a9be0

 Processing reads with Run_ID 2f4d52a34ec56518aa0d051dc4484c2b454abc6a

 Processing reads with Run_ID e7d9b3c6bb26250ffaf1f8be9d2d1ae0105204b9

 Processing reads with Run_ID f6d788dc15a52f5bbb736aa82c5dee7b9c50d63f

 Processing reads with Run_ID 5db3f3d44b7ce2c468a7d786060fe39e59282240

 Processing reads with Run_ID a175388e5c1ed0e6a78791f120de1c9efcb46b43

 Processing reads with Run_ID b4013533403ec7bbe89d2e9e4021d06c69fe6cf5

 Processing reads with Run_ID 135e6b0c7d4223d4047216f10bede4ca5a84eb28

 Processing reads with Run_ID 7e95428dd57055c0665696cce1bffc73fd5b5d29

 Processing reads with Run_ID aa23fdac499ddcbe80b86a240ee2e803f39d62ea

Reindex and sort

 4413 Total valid reads found

Generate an overview of the data

jhelp(pycoQC.overview)

overview (self, cmap='Set3', plot_style='ggplot')

Generate a quick overview of the data (tables + plots)

## You don't need to initialize pycoQC every times. But for this tutorial I will do it to show the output obtained with different example files
p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt")
g = p.overview (cmap='Set3', plot_style='ggplot')

Overall counts

Count
Reads 5.000000e+04
Bases 4.598551e+08
Events 8.422545e+08
Active Channels 5.070000e+02
Run Duration (h) 4.779043e+01


Read count per Run ID

reads
ad3de3b63de71c4c6d5ea4470a82782cf51210d9 49603
7082b6727942b3939a023beaf03ef24cec1722e5 397


Distribution of quality scores and read lengths

Quality score distribution Read length distribution
count 50000.000000 50000.000000
mean 11.018961 9197.102300
std 2.093471 12475.543239
min 2.784000 5.000000
10% 7.720000 744.000000
25% 9.546000 2067.000000
50% 11.552000 3516.000000
75% 12.692000 10581.250000
90% 13.316000 28132.200000
max 15.255000 49902.000000


Distributions per run IDs

svg

svg

Analyse the mean read quality distribution

pycoQC can generate a mean read quality score as a Dataframe or as a kernel density distribution plot

reads_qual_bins

jhelp(pycoQC.reads_qual_bins)

reads_qual_bins (self, bins=[-1, 0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 40])

Count the number of reads per interval of sequence quality and return a dataframe

## Again you don't need to initialize pycoQC every times. But if you missed it before, for this tutorial I will do it to show the output obtained with different example files
p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D2_DNA_Albacore_1.2.1.txt", filter_zero_len=True)
p.reads_qual_bins( bins=[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20, 40])
Count
Sequence quality ranges
(0, 2] 0
(2, 4] 1
(4, 6] 36
(6, 8] 104
(8, 10] 385
(10, 12] 2420
(12, 14] 3606
(14, 16] 2100
(16, 18] 1112
(18, 20] 209
(20, 40] 2

reads_qual_distribution

jhelp(pycoQC.reads_qual_distribution)

reads_qual_distribution (self, figsize=[30, 7], color='orangered', alpha=0.5, bandwith=0.1, sample=100000, min_qual=0, max_qual=None, min_freq=0, max_freq=None, plot_style='ggplot', **kwargs)

Plot the univariate kernel density estimate of mean read quality

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D2_DNA_Albacore_1.2.1.txt", filter_zero_len=True)
g = p.reads_qual_distribution (figsize=[15, 4], color='dodgerblue', alpha=0.5, bandwith=0.5, sample=100000, min_qual=4, max_qual=20, plot_style='ggplot', )

svg

Analyse the read length distribution

Similarly pycoQC can also compute the read length distribution as a Dataframe or as a kernel density distribution plot

reads_len_bins

jhelp(pycoQC.reads_len_bins)

reads_len_bins (self, bins=[-1, 0, 25, 50, 100, 500, 1000, 5000, 10000, 100000, 10000000])

Count the number of reads per interval of sequence length and return a dataframe

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt", filter_zero_len=True)
p.reads_len_bins(bins=[0, 10, 25, 50, 100, 500, 1000, 5000, 10000, 100000, 10000000])
Count
Sequence lenght ranges
(0, 10] 27
(10, 25] 50
(25, 50] 65
(50, 100] 152
(100, 500] 2172
(500, 1000] 4705
(1000, 5000] 25188
(5000, 10000] 4705
(10000, 100000] 12936
(100000, 10000000] 0

reads_len_distribution

jhelp(pycoQC.reads_len_distribution)

reads_len_distribution (self, figsize=[30, 7], color='orangered', alpha=0.5, bandwith=None, sample=100000, min_len=0, max_len=None, min_freq=0, max_freq=None, xlog=False, ylog=False, plot_style='ggplot', **kwargs)

Plot the univariate kernel density estimate of read length in base pairs

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_RNA_Albacore_2.0.1.txt", filter_zero_len=True, filter_calibration=True, filter_fail=True)
fig, ax = p.reads_len_distribution(figsize=[15,4], color='green', alpha=0.5, min_len=0, max_len=600, plot_style='ggplot')

svg

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt", filter_zero_len=True)
fig, ax = p.reads_len_distribution(figsize=[15,4], color='dodgerblue', alpha=0.5, min_len=500, xlog=True, plot_style='seaborn-white')

svg

Generate a 2D distribution of read length and mean quality score

jhelp(pycoQC.reads_len_quality)

reads_len_quality (self, figsize=12, kde=True, scatter=True, margin_plot=True, kde_cmap='copper', scatter_color='orangered', margin_plot_color='orangered', kde_alpha=1, scatter_alpha=0.01, margin_plot_alpha=0.5, sample=100000, kde_levels=10, kde_shade=False, min_len=None, max_len=None, min_qual=None, max_qual=None, plot_style='ggplot', **kwargs)

Draw a bivariate plot of read length vs mean read quality with marginal univariate plots.

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_RNA_Albacore_2.0.1.txt", filter_calibration=True, filter_fail=True, filter_zero_len=True)
g = p.reads_len_quality (figsize=10, kde=True, scatter=True, margin_plot=True, kde_levels=15, min_len=0, max_len=600, min_qual=6.5, max_qual=11, scatter_alpha=0.1)

svg

Analyse the reads/bases/events output over the time of the run

jhelp(pycoQC.output_over_time)

output_over_time (self, level='reads', figsize=[30, 7], runid_lines=True, color='orangered', alpha=0.5, bin_size=240, bin_smothing=3, cumulative=False, sample=100000, plot_style='ggplot', **kwargs)

Plot the output over the time of the experiment at read, base or event level

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt", filter_zero_len=True)
g = p.output_over_time(level='bases', figsize=[15, 4], bin_size=240, bin_smothing=5)

svg

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D2_DNA_Albacore_1.2.1.txt", filter_zero_len=True)
g = p.output_over_time(level='bases', figsize=[15, 4], color='orangered', cumulative=True)

svg

Analyse the evolution of the mean read quality over the time of the run

jhelp(pycoQC.quality_over_time)

quality_over_time (self, runid_lines=True, figsize=[30, 7], color='orangered', alpha=0.25, win_size=0.25, plot_style='ggplot', **kwargs)

Plot the evolution of the mean read quality over the time of the experiment

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_DNA_Albacore_1.2.1.txt", filter_zero_len=True)
g = p.quality_over_time(figsize=[15, 4], win_size=0.5)

svg

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_RNA_Albacore_2.0.1.txt", filter_zero_len=True)
g = p.quality_over_time(runid_lines=True, figsize=[15, 4], color='dodgerblue', win_size=0.1, plot_style='seaborn-white')

svg

Overview of the activity of flowcell channels

jhelp(pycoQC.channels_activity)

channels_activity (self, level='reads', figsize=[24, 12], cmap='OrRd', alpha=1, robust=True, annot=True, fmt='d', cbar=False, plot_style='seaborn-white', **kwargs)

Plot the activity of channels at read, base or event level. The layout does not represent the physical layout

of the flowcell based on seaborn heatmap funtion

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_RNA_Albacore_2.0.1.txt", filter_zero_len=True, filter_fail=True)
g = p.channels_activity(level='reads', figsize=[12,6])

svg

p = pycoQC("/home/aleg/Programming/Python3/pycoQC/pycoQC/data/sequencing_summary_1D_RNA_Albacore_2.0.1.txt", filter_zero_len=True, filter_fail=True)
g = p.channels_activity(level='events', cmap="viridis_r", cbar=True, annot=False, figsize=[15,6])

svg

Note to power-users and developers

Please be aware that pycoQC is an experimental package that is still under development. It was tested under Linux Ubuntu 16.04 and in an HPC environment running under Red Hat Enterprise 7.1.

You are welcome to contribute by requesting additional functionalities, reporting bugs or by forking and submitting patches or updates pull requests

Thank you

Contributors

Jon Sanders Github

Acknowledgments

Thanks to Kim Judge for providing a few example sequencing summary files.

About

Python 3 package for Jupyter Notebook, computing metrics and generating plots from Oxford Nanopore Albacore report

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Jupyter Notebook 99.5%
  • Python 0.5%