Skip to content

Latest commit

 

History

History
54 lines (38 loc) · 1.93 KB

gtf.rst

File metadata and controls

54 lines (38 loc) · 1.93 KB

GTF files

From the GTF definition:

The following feature types are required: 'CDS', 'start_codon', 'stop_codon'. The features '5UTR', '3UTR', 'inter', 'inter_CNS', 'intron_CNS' and 'exon' are optional. All other features will be ignored.

So genes and transcript are not explicitly defined. The transcript extent, is, after all, implied by all exons with a single transcript_id, and the extent of a gene is implied by all exons with a single gene_id. However, this can be tedious to calculate by hand.

:mod:`gffutils` infers the gene and transcript extents when the file is imported into a database, and adds new "derived" features for each gene and transcript. That way, a gene can be easily accessed by its ID, just like for GFF files.

However, not all files meet the official GTF specifications where each feature has transcript_id to indicate its parent feature and gene_id to indicate its "grandparent" feature. To accommodate this, :mod:`gffutils` provides some extra options for the :func:`gffutils.create_db` function:

transcript_key and gene_key

.. seealso::

    Example that shows the use of `transcript_key`:

    * :ref:`jgi_gff2.txt`

These kwargs are used to extract the parent and grandparent feature respectively.

By default, transcript_key="transcript_id" and gene_key="gene_id". But if your particular data file does not conform to this, then they can be changed.

subfeature

.. seealso::

    Examples that show the use of `subfeature`:

    * :ref:`wormbase_gff2_alt.txt`

Genes and transcripts are inferred from their component "exon" features. If your particular data file does not conform to the GTF standard, you can use the subfeature kwarg to change this. By default, subfeature="exon", but see the example :ref:`wormbase_gff2_alt.txt` for an instance of where this needs to be changed.