Skip to content

Commit

Permalink
yup
Browse files Browse the repository at this point in the history
  • Loading branch information
JonEilers committed Jan 9, 2024
1 parent 4df861a commit c7df1cc
Show file tree
Hide file tree
Showing 2 changed files with 18 additions and 18 deletions.
26 changes: 13 additions & 13 deletions docs/source/genome_annotation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -24,9 +24,9 @@ In the context of gene prediction and annotation, TEs and REs present both chall

The evolutionary significance of TEs and REs is now increasingly acknowledged. These elements are instrumental in driving genomic and evolutionary processes, contributing to structural genomic variation critical for understanding genome regulation and evolution. They are involved in creating new genes, regulating existing ones, and contributing to genetic innovation and adaptation. This evolving perspective underscores the necessity of including the annotation and study of TEs and REs as a vital component of genomic research, on par with the study of protein-coding regions.

* :doc:`An Introduction to Repetitive Elements and Transposable Elements </annotation/repetitive_elements>`
* :doc:`Using TeTools to Identify and Mask Repetitive Elements and Transposons </annotation/tetools>`
* :doc:`Manual Annotation and Curation of Transposable Elements </annotation/manual_te_annotation>`
* :doc:`An Introduction to Repetitive Elements and Transposable Elements <>`
* :doc:`Using TeTools to Identify and Mask Repetitive Elements and Transposons <>`
* :doc:`Manual Annotation and Curation of Transposable Elements <>`

Expression Data Mapping and Protein Database Alignment
----------
Expand All @@ -37,8 +37,8 @@ In addition to RNA-Seq data, leveraging protein databases is also necessary. The

Moreover, these protein databases often include manually curated gene models. This manual curation involves thorough verification and correction of gene structures based on available evidence, enhancing the accuracy of gene predictions. By integrating RNA-Seq data with information from protein databases, researchers can achieve a more complete and precise understanding of the genomic landscape, even in areas where gene expression data is limited or absent.

* :doc:`Mapping Gene Expression Data to the Genome Assembly </annotation/rna-seq_mapping>`
* :doc:`Aligning Protein Databases to the Genome Assembly </annotation/protein_database_alignment>`
* :doc:`Mapping Gene Expression Data to the Genome Assembly <>`
* :doc:`Aligning Protein Databases to the Genome Assembly <>`

Gene Model Prediction
----------
Expand All @@ -47,10 +47,10 @@ Gene prediction, the process of identifying regions of a genome that encode gene

Despite these challenges, automatic methods have become the mainstream approach for initial gene prediction due to the vast size of most genomes and the enormous number of genes they contain. These computational tools provide a foundation for gene annotation but often fall short of the accuracy achieved by manual curation. Manual curation, considered the gold standard, involves a thorough examination of gene expression and protein alignment data to assess and refine gene models. This process is time-consuming and labor-intensive, requiring expert knowledge and a detailed analysis of genomic data. As a result, few projects undertake comprehensive manual curation. However, in recent years, there has been a significant push to manually curate the complete gene sets of specific organisms and particular gene subsets, such as olfactory genes in mice and humans.

* :doc:`Gene Prediction using Braker <annotation/braker_gene_prediction>`
* :doc:`Gene Prediction using Maker and Augustus <annotation/maker_gene_prediction>`
* :doc:`Combining Evidence using EvidenceModler or Tserba <annotation/combining_evidence>`
* :doc:`Visualizing and Editing Gene Models <annotation/manual_curation>`
* :doc:`Gene Prediction using Braker <>`
* :doc:`Gene Prediction using Maker and Augustus <>`
* :doc:`Combining Evidence using EvidenceModler or Tserba <>`
* :doc:`Visualizing and Editing Gene Models <>`

Functional Annotation and Analysis
----------
Expand All @@ -65,10 +65,10 @@ In practice, a hierarchical approach is often employed in functional annotation.

The field of functional annotation, while rich in tools and methods, lacks a unified standard, leading to a variety of approaches with varying levels of accuracy and reliability. Researchers must judiciously choose tools and methods, balancing the need for comprehensive annotation with the risk of error. This careful and considered approach to functional annotation is vital for the accurate interpretation of genomic data, ultimately advancing our understanding of biological systems and their implications in health and disease. The continuous development and refinement of annotation tools, coupled with rigorous validation practices, are essential to ensure the reliability and utility of genomic annotations in scientific research.

* :doc:`Functional Annotation using Gene Ontology <annotation/gene_ontology>`
* :doc:`Protein Domain Annotation using InterProScan and EggNOG-mapper <annotation/protein_domain_annotation>`
* :doc:`Ortholog search using Blast and NCBI <annotation/annotation_via_ortholog>`
* :doc:`Evalauting functional annotations <annotation/functional_evaluation>`
* :doc:`Functional Annotation using Gene Ontology <>`
* :doc:`Protein Domain Annotation using InterProScan and EggNOG-mapper <>`
* :doc:`Ortholog search using Blast and NCBI <>`
* :doc:`Evalauting functional annotations <>`


Non-Coding RNA
Expand Down
10 changes: 5 additions & 5 deletions docs/source/genome_assembly_qc.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ An essential aspect to consider in genome assembly evaluation is the distinction

In summary statistics of genome assemblies, both contig N50 and scaffold N50 values are usually presented. However, the contig N50 metric is particularly crucial, as it provides a more direct measure of the assembly's quality in terms of gene model predictions. Higher contig N50 indicates a larger number of longer contigs and, consequently, a higher likelihood of capturing complete genes. In contrast, scaffold N50, while useful for understanding the broader structure of the genome, may overestimate the assembly quality due to the inclusion of gap regions. Therefore, contig N50 serves as a more reliable indicator of the potential accuracy and completeness of gene predictions in a given genome assembly.

* :doc:`Summary Statistics Via a Variety of Tools <assembly_qc/summary-stats>`
* :doc:`Summary Statistics Via a Variety of Tools <>`

Checking for Complete Conserved Genes
------------------------
Expand All @@ -37,7 +37,7 @@ Checking `BUSCO <https://busco.ezlab.org/>`_ (Benchmarking Universal Single-Copy

However, it's important to recognize that BUSCO scores are not representative of a 'best case scenario' but should be considered as a targeted, yet somewhat random, sample of the assemblies gene content. This means while BUSCO scores provide valuable insights into the assembly quality, they do not necessarily account for all possible structural gene variations or complexities. Therefore, while BUSCO scores are a valuable tool in genomic analysis, they should be interpreted within the broader context of the genome's overall characteristics and other quality metrics. However, if BUSCO scores are low, it is `highly indicative of problems with the genome assembly <https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13364>`_.

* :doc:`Assembly Quality Assessment using BUSCO Analysis <assembly_qc/assembly_busco>`
* :doc:`Assembly Quality Assessment using BUSCO Analysis <>`

Assembly Contamination and Quality
----------------------------------
Expand All @@ -49,8 +49,8 @@ Additionally, k-mer analysis can be used to filter out potential contamination i

Recent years have also seen the development of quality value scores (QV) which vary depending on which tool is used. `Merqury <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-020-02134-9>`_ calculates a QV score which represents a "log-scaled probability of error per a base in the assembly". In the case of `Inspector <https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02527-4>`_, QV is "calculated based on the identified structural and small-scale errors scaled by the total base pairs of the assemblies ". While the method for calculating these two different QV scores are different, they do correlate with each other.

* :doc:`Assembly Contamination and Quality Analysis <assembly_qc/contamination>`
* :doc:`Calculating Genome Assembly Quality Value Scores <assembly_qc/genome_quality>`
* :doc:`Assembly Contamination and Quality Analysis <>`
* :doc:`Calculating Genome Assembly Quality Value Scores <>`


Once you have an assembly that is as good as it'll get, it might be possible to squeeze a little more out of your data using gap closing and polishing tools. However, just like with read trimming, doing either gap closing or polishing can result in an assembly that was worse than what you started with. I also want to add that overzealous use of gap closing or polishing can result in poor assemblies. This is a huge problem when these assemblies are then uploaded into NCBI and used as references genomes for other projects. Most researchers do not have the skill, knowledge, or time to check that the assembly or genes from assemblies are trustworthy, potentially resulting in a lot of frustration and wasted time and money. So proceed with caution.
Expand All @@ -63,5 +63,5 @@ Polishing
----------------------------------
Polishing is probably one of the most overlooked and underappreciated steps of genome assembly. As a result, what appear to be high quality genomes and gene models are published that contain numerous errors. Do not skip this step. Also, have a list of gene models to manually check for gene models errors as this will be more revealing than output summary statistics from the polishing tools. Polishing removes insertions, deletions, and adapter contamination that may have crept into the genome assembly. Examples of what this looks like can be found in the paper `Chasing perfection: validation and polishing strategies for telomere-to-telomere genome assemblies <https://www.nature.com/articles/s41592-022-01440-3>`_. Polishing can be accomplished using either long read or short read data. Short read data has a much higher accuracy and as such can correct short sequences. However, long reads can fix longer, structural errors. Assembly polishing is typically an iterative process requiring anywhere from three to six rounds of polishing. However, this is a subjective number and the correct number of polishing rounds should be based on manual inspection of genes and sequence alignments. It should also be noted that under or over polishing can significantly impact the assembly quality. Under polishing, intuitively, fails to correct as many errors as possible. On the other hand, some polishing tools such as Racon are notorious for over polishing, meaning the more polishing rounds are done, the more likely the tool will start introducing errors into the assembly.

* :doc:`Genome assembly polishing <assembly_qc/polishing>`
* :doc:`Genome assembly polishing <>`

0 comments on commit c7df1cc

Please sign in to comment.