Skip to content

Commit

Permalink
Update genome_assembly_qc.rst
Browse files Browse the repository at this point in the history
  • Loading branch information
JonEilers committed Dec 21, 2023
1 parent 0997fb3 commit 580f029
Showing 1 changed file with 1 addition and 3 deletions.
4 changes: 1 addition & 3 deletions docs/source/genome_assembly_qc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,9 @@ It's important to distinguish between the lack of rigorous methodology and the c
Summary Statistics
-------------------

A standard metric for genome contiguity is the N50 value. N50 is a tricky beast to understand and I've seen more blogs and descriptions get it wrong than right. Thankfully, `wikipedia <https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50>`_ gets it right. Without getting into the details on it, the thing that matters when considering the N50 is that the majority of reads in an assembly will be shorter than the N50 value. If you have an N50 of 9kb, than the majority of the assembly will be scaffolds or contigs shorter than 9kb. Having an N50 of 9kb consequently means gene prediction will likely capture the bulk of the genes, but there will be a large number of fragmented genes such as `titin <https://en.wikipedia.org/wiki/Titin>`_. While there are a number of tools for acquiring this metric, probably my favorite way to visualize it is the snail plot produced via `Blobtoolkit <https://www.g3journal.org/content/10/4/1361>`_. `Here <https://blobtoolkit.genomehubs.org/>`_ is a link to their website. See the link below for an example.

The N50 metric is a commonly referenced standard for assessing genome contiguity, but it is often misconstrued to as an "average" contig length. `wikipedia <https://en.wikipedia.org/wiki/N50,_L50,_and_related_statistics#N50>`_ get's it right, but unfortunately the explanation may be a little confusing. To succinctly describe N50 without delving into the details, it is essential to understand that this metric does imply that the majority of the contigs (sequences) in an assembly are shorter than the N50 value. The N50 represents a length such that 50% (half) of the total length of the assembly is contained in sequences that are shorter than the N50 value. Conversely, this means that half the genome length is contained in contigs/scaffolds that are longer than the N50 value.

For example, an N50 value of 9kb suggests that half of the assembled genome is in fragments that are 9kb or longer. This metric is crucial for gene prediction, as a higher N50 often indicates better contiguity, facilitating the identification of complete genes. However, genes that are longer and more complex, like titin, will likely be fragmented in an assembly with an N50 of this size.
For example, an N50 value of 9kb suggests that half of the assembled genome is in fragments that are 9kb or longer. This metric is crucial for gene prediction, as a higher N50 often indicates better contiguity, facilitating the identification of complete genes. However, genes that are longer and more complex, like `titin <https://en.wikipedia.org/wiki/Titin>`_, will likely be fragmented in an assembly with an N50 of this size.

An essential aspect to consider in genome assembly evaluation is the distinction between contig N50 and scaffold N50 metrics. Contigs are contiguous sequences of DNA, while scaffolds represent a higher-order assembly, comprising multiple contigs linked by gaps. These gaps, often represented by a series of 'N' characters in the sequence, indicate unknown or unsequenced regions between contigs. The assembly tool determines the order and orientation of these contigs but leaves the gaps to signify areas of uncertainty.

Expand Down

0 comments on commit 580f029

Please sign in to comment.