-
Notifications
You must be signed in to change notification settings - Fork 0
ridgelab/plasmidCharacterization
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
README ====== This repository contains the scripts from Card, et al. (2019), in press. If you use this repository, or ideas from it, please both cite our paper in your work and include the link to this repository. The paper provides some background and motivation for creating this repository. An overview of the process employed is recorded with detailed methods in docs/BioinformaticsMethods.pdf. It is designed to work by adding GenBank files into the data/original_gb directory. You will also need to add a Fasta file with of Incompatibility Group sequences into the directory data/original_incompatibility_groups with the name incompatibility.fasta. The Fasta file should work as long as it is correctly formatted. Single- or multi-line sequence entires are fine. No blank lines. Note that incompatibility.fasta already exists in this directory. This is because the version we used (the one you'll find in the directory) is no longer available online. Ours was downloaded on 1 March 2018 from the PlasmidFinder website. We advise replacing it with either the current version from the PlasmidFinder website or with a custom file or your own. The GenBank files should be concatenations of one or more GenBank files. As some accessions will fit into more than one grouping, it is okay for the GenBank record for a single accession to occur in more than one file. That record must, however, be the same in each case if you want things to work right. All scripts are designed to be run from the main directory (i.e., the parent directory of data and scripts). Some information is collected about the GenBank files such as the source country and sequencing methods. This information is extracted rather simply when compared to the search method for the primary search strategy described in the next paragraph. If you wish to have different key terms, you will need to edit the regular expressions found in scripts/identifyPlasmidMatches.py. Depending on what you are searching for, you may need a different search heirarchy altogether. Presently, Beta-lactamase Special is a subset of Beta-lactamase. It is, in turn, a subset of Antimicrobial Resistance. Thus, a find in Beta-lactamase Special will incrememnt the count for itself and its parents. Similarly, DNA Maintenance Special is a subset of DNA Maintenance. Otherwise, the search strategy is as follows: If the CDS search region does not match an ignored term, search it for each of the following term groupings, stopping as soon as a match is identified: -Antimicrobial Resistance (with its subgroups) -Toxin/Antitoxin System -DNA Maintenance (with its subgroup) -Mobile Genetic Elements -Hypothetical Genes If no match is found, assign it to an "Other" category. If you wish to change the percent identity or coverage cutoffs for assigning incompatibility groups, you may change the appropriate AWK and Python scripts. You will need to alter the path to the NCBI BLAST+ Suite in scripts/blast.sh, scripts/blastPlasmid.sh, scripts/02-makeIncompatibilityBlastDB.sh, and scripts/17-makePlasmidBlastDB.sh. GNU sed is used in some scripts. If your system used BSD sed by default, you'll need to either make GNU sed your default or alter the scripts to use GNU sed. Likewise, the same thing will need to be done for AWK and find. CAM (Miller, et al. 2019, PeerJ #6984) is needed to create the Newick tree. It can be downloaded on GitHub at https://github.com/ridgelab/cam. We assume the CAM script makeNewick.py is in your PATH. Modify scripts/26-createDistTree.sh if needed to specify the path to the script.
About
No description, website, or topics provided.
Resources
Stars
Watchers
Forks
Packages 0
No packages published