GitHub - ridgelab/plasmidCharacterization

Branches Tags

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
doc		doc
scripts		scripts
.gitignore		.gitignore
README		README

Repository files navigation

README
======

This repository contains the scripts from Card, et al. (2019), in press.
If you use this repository, or ideas from it, please both cite our paper in
your work and include the link to this repository. The paper provides some
background and motivation for creating this repository. An overview of the
process employed is recorded with detailed methods in
docs/BioinformaticsMethods.pdf.

It is designed to work by adding GenBank files into the data/original_gb
directory. You will also need to add a Fasta file with of Incompatibility Group
sequences into the directory data/original_incompatibility_groups with the name
incompatibility.fasta. The Fasta file should work as long as it is correctly
formatted. Single- or multi-line sequence entires are fine. No blank lines.
Note that incompatibility.fasta already exists in this directory. This is
because the version we used (the one you'll find in the directory) is no longer
available online. Ours was downloaded on 1 March 2018 from the PlasmidFinder
website. We advise replacing it with either the current version from the
PlasmidFinder website or with a custom file or your own. The
GenBank files should be concatenations of one or more GenBank files. As some
accessions will fit into more than one grouping, it is okay for the GenBank
record for a single accession to occur in more than one file. That record must,
however, be the same in each case if you want things to work right.

All scripts are designed to be run from the main directory (i.e., the parent
directory of data and scripts).

Some information is collected about the GenBank files such as the source
country and sequencing methods. This information is extracted rather simply
when compared to the search method for the primary search strategy described
in the next paragraph.

If you wish to have different key terms, you will need to edit the regular
expressions found in scripts/identifyPlasmidMatches.py. Depending on what you
are searching for, you may need a different search heirarchy altogether.
Presently, Beta-lactamase Special is a subset of Beta-lactamase. It is, in
turn, a subset of Antimicrobial Resistance. Thus, a find in Beta-lactamase
Special will incrememnt the count for itself and its parents. Similarly, DNA
Maintenance Special is a subset of DNA Maintenance. Otherwise, the search
strategy is as follows:
If the CDS search region does not match an ignored term, search it for
each of the following term groupings, stopping as soon as a match is
identified:

-Antimicrobial Resistance (with its subgroups)
-Toxin/Antitoxin System
-DNA Maintenance (with its subgroup)
-Mobile Genetic Elements
-Hypothetical Genes

If no match is found, assign it to an "Other" category.

If you wish to change the percent identity or coverage cutoffs for assigning
incompatibility groups, you may change the appropriate AWK and Python scripts.

You will need to alter the path to the NCBI BLAST+ Suite in scripts/blast.sh,
scripts/blastPlasmid.sh, scripts/02-makeIncompatibilityBlastDB.sh, and
scripts/17-makePlasmidBlastDB.sh.

GNU sed is used in some scripts. If your system used BSD sed by default, you'll
need to either make GNU sed your default or alter the scripts to use GNU sed.
Likewise, the same thing will need to be done for AWK and find. CAM (Miller, et
al. 2019, PeerJ #6984) is needed to create the Newick tree. It can be downloaded on GitHub
at https://github.com/ridgelab/cam. We assume the CAM script makeNewick.py is
in your PATH. Modify scripts/26-createDistTree.sh if needed to specify the path
to the script.