This repository contains the code and data for extracting Gateway sequence sites from AddGene and SnapGene plasmids, and use them to:
- Extract consensus sequences from the Gateway sites.
- Create a web application to visualize the Gateway plasmids. Can be accessed at https://gatewaymine.netlify.app/.
---
config:
layout: elk
---
flowchart LR
SnapGene ==> Plasmids[~14k<br>plasmids]
AddGene ==> Plasmids
Plasmids ==> Sites[extracted<br>att sites]
Sites ==> CombinatorialSites[combinatorial<br>att sites]
Plasmids ==> SequenceFeatures[extracted<br>sequence features]
CombinatorialSites ==> Alignments[aligned<br>att sites]
CombinatorialSites ==> GatewayMine[GatewayMine]
SequenceFeatures ==> GatewayMine
Alignments ==> Consensus[consensus<br>sequences]
The scripts in this repository were used to download plasmids from AddGene and copying the Gateway plasmis from the SnapGene collection (present in the SnapGene installation folder) to create a collection of Gateway plasmids (attached as a release artifact).
Additional information about AddGene plasmids was also mined, such as whether they are part of kits, and related publication links.
The plasmids files were read, and if their annotation contained Gateway sequence sites (identied by a label attXn
, where X is the att site and n is the version), the sequence of those sites was extracted. The file results/plasmid_site_dict.json contains a dictionary of all the plasmids in the collection, with the Gateway sequence sites they contain.
This file was then used to generate a collection of all version of each Gateway sequence site (e.g. all the versions of attP1
, attB1
, etc.). Contained in the file results/att_sites.json. Because sites can be recombined with each other (e.g. attP1 + attB1 -> attL1
), the sites found in plasmids were recombined in all possible combinations, to yield an even larger collection of sites. This is contained in the file results/att_sites_combinatorial.json. These two files were used for alignment in the next step.
In addition, the mined plasmid information was used to create a summary dataset, contained in the file results/plasmid_features.json. This file contains a dictionary of all the plasmids in the collection, listing:
- Their name
- The Gateway sequence sites they contain
- The sequence features they contain
- Whether they were extracted from SnapGene or AddGene
- If they are from AddGene:
- Their AddGene ID
- Kit to which they belong (if any)
- Publication links (if any)
This dataset was made queryable as a web application, of which the source code is in the web_app
folder.
To generate the consensus sequences, the Gateway sequence sites were aligned using Clustal Omega. The alignment files are contained in the alignments
folder (generated only with the sites found in plasmids) and alignments_combinatorial
(using also the combinatorial sites).
From an alignment, consensus sequences were generated by removing flaking positions containing either spacers and/or all ACGT
nucleotides, and then using ambiguous nucleotides to represent the consensus, for example the below dummy example:
seq1 TGCTAATA
seq2 -GCTCCTT
seq3 TGCTCCCG
seq4 TGCTGACC
consensus GCTVMY
The consensus sequences are contained in the files results/consensus_sites.tsv and results/consensus_sites_combinatorial.tsv, which were created with only the sites found in plasmids and those plus the combinatorial sites respectively.
In addition, even more permissive consensus sequences were generated. What gives specificity for recombination of a pair of sites
is the overlap sequence conserved in all attB
, P
, L
and R
sites with the same number (e.g. all attX1 sites contain twtGTACAAAaaa
as the overlap sequence). An aligment of all sites of a given type (e.g. all attB sites), excluding the overlap sequence was used to generate more permissive consensus sequences. Those consensus sequences are in the mentioned files, but with merged_
prefixed to the site name.
The analysis is run using python, using poetry to manage dependencies.
# Install dependencies
poetry install
# Activate virtual environment
poetry shell
clustalo
is used for alignment, and can be downloaded as a binary from here. Once downloaded, rename it to clustalo
and place it in the root folder of the repository. If you want to provide an alternative path, you can do so with script arguments (see script docs).
The pipeline is described in the following diagram:
---
config:
layout: elk
---
flowchart TD
subgraph DataMining["Data Mining"]
snapgene_application{{SnapGene Folder}} ==> snapgene_script
addgene{{AddGene}} ==> get_addgene_kits_info
addgene ==> get_all_gateway_plasmids
addgene ==> get_addgene_article_refs
addgene ==> get_other_plasmids
snapgene_script[get_snapgene_files.py] ==> snapgene_plasmids([data/snapgene_plasmids/*.dna])
get_addgene_kits_info[get_addgene_kits_info.py] ==> addgene_kits([data/addgene_kit_plasmids.json])
get_all_gateway_plasmids[get_all_gateway_plasmids.py] ==> all_gateway_plasmids([data/all_gateway_plasmids.tsv])
addgene_kits ==> get_addgene_kit_plasmids[get_addgene_kit_plasmids.py]
get_addgene_kit_plasmids ==> addgene_plasmids([data/addgene_plasmids/*.dna])
all_gateway_plasmids ==> get_other_plasmids[get_other_plasmids.py]
get_other_plasmids ==> addgene_plasmids
get_addgene_article_refs[get_addgene_article_refs.py] ==> addgene_article_refs([data/addgene_article_refs.tsv])
addgene_plasmids ==> plasmid_collection[(plasmid_collection)]
snapgene_plasmids ==> plasmid_collection
end
subgraph Formatting
all_gateway_plasmids ==> make_plasmid_summary[make_plasmid_summary.py]
plasmid_collection ==> make_plasmid_summary
addgene_article_refs ==> make_plasmid_summary
addgene_kits ==> make_plasmid_summary
make_plasmid_summary ==> plasmid_summary([results/plasmid_summary.json])
plasmid_collection ==> make_plasmid_site_dict
make_plasmid_site_dict[make_plasmid_site_dict.py] ==> plasmid_site_dict([results/plasmid_site_dict.json])
plasmid_site_dict ==> make_feature_dict
plasmid_summary ==> make_feature_dict
make_feature_dict[make_feature_dict.py] ==> feature_dict[(results/feature_dict.json)]
plasmid_site_dict ==> make_unique_sites
make_unique_sites[make_unique_sites.py] ==> att_sites([results/att_sites.json])
att_sites ==> make_combinatorial_att_sites
make_combinatorial_att_sites[make_combinatorial_att_sites.py] ==> att_sites_combinatorial([results/att_sites_combinatorial.json])
make_combinatorial_att_sites ==> combinatorial_att_sites_only([results/att_sites_combinatorial_only.json])
end
subgraph AlignmentAndConsensus["Alignment and Consensus"]
att_sites ==> make_alignments
att_sites_combinatorial ==> make_alignments
make_alignments[make_alignments.py] ==> alignments([results/alignment/*])
make_alignments ==> alignments_combinatorial([results/alignment_combinatorial/*])
alignments ==> make_consensus_sites
alignments_combinatorial ==> make_consensus_sites
make_consensus_sites[make_consensus_sites.py] ==> consensus_sites[(results/consensus_sites.tsv)]
make_consensus_sites ==> consensus_sites_combinatorial[(results/consensus_sites_combinatorial.tsv)]
alignments ==> make_logos
alignments_combinatorial ==> make_logos
make_logos[make_logos.py] ==> logos([results/alignment/*.svg])
make_logos ==> logos_combinatorial([results/alignment_combinatorial/*.svg])
end
feature_dict ==> GateWayMine{{GatewayMine}}
get_addgene_kits_info:::Sky
get_addgene_kit_plasmids:::Sky
get_addgene_article_refs:::Sky
get_all_gateway_plasmids:::Sky
get_other_plasmids:::Sky
snapgene_script:::Sky
make_plasmid_summary:::Sky
make_feature_dict:::Sky
make_unique_sites:::Sky
make_combinatorial_att_sites:::Sky
make_alignments:::Sky
make_consensus_sites:::Sky
make_plasmid_site_dict:::Sky
make_logos:::Sky
att_sites:::Lavender
att_sites_combinatorial:::Lavender
combinatorial_att_sites_only:::Lavender
plasmid_site_dict:::Lavender
feature_dict:::Pine
consensus_sites:::Pine
consensus_sites_combinatorial:::Pine
plasmid_summary:::Lavender
alignments:::Lavender
alignments_combinatorial:::Lavender
addgene_article_refs:::Lavender
addgene_kits:::Lavender
all_gateway_plasmids:::Lavender
logos:::Lavender
logos_combinatorial:::Lavender
GateWayMine:::Peacock
classDef Sky stroke-width:1px, stroke-dasharray:none, stroke:#374D7C, fill:#E2EBFF, color:#374D7C
classDef Pine stroke-width:1px, stroke-dasharray:none, stroke:#254336, fill:#27654A, color:#FFFFFF
classDef Lavender stroke-width:1px, stroke-dasharray:none, stroke:#8E6C9E, fill:#8E6C9E, color:#FFFFFF
classDef Peacock stroke-width:1px, stroke-dasharray:none, stroke:#006666, fill:#006666, color:#FFFFFF
style DataMining color:#000000,fill:#E2EBFF90
style Formatting color:#000000,fill:#D1FFE290
style AlignmentAndConsensus color:#000000,fill:#FFE5F790
To run locally, first download the plasmid collection from the lastest release of this repository, and place the folders addgene_plasmids
and snapgene_plasmids
in the data
folder.
If you want to re-download the plasmids (reproduce the data mining). You can do so with the bash script run_data_mining.sh
, see the documentation of called scripts.
playwright
is used for scraping. If used for the first time, you will be prompted to run playwright install
See the documentation of scripts called in run_formatting.sh
.
See the documentation of scripts called in run_alignments_and_consensus.sh
.
The web application is a simple React application built with Vite. It was generated with yarn create vite
(see docs), so the yarn
package manager is required. The directory structure is standard, and documented in the vite docs.
# Enable yarn 3
corepack enable
# Install dependencies
yarn install
# Run dev server
yarn dev
# Build for production
yarn build
The only extra configuration is copying the plasmid_features.json
file to the public
folder when building or serving locally, so it can be requested by the frontend application, see the config at web_app/vite.config.js.
NOTE: Make sure the plasmid is not already there!
To add new plasmids from AddGene, add a row to the file all_gateway_plasmids.tsv, with the following columns:
-
plasmid_id
: The plasmid ID from AddGene -
plasmid_name
: The name of the plasmid -
reference
: (optional) The reference article id for the plasmid (this is the number at the end of the URL in the AddGene page). For the example below the publication links to https://www.addgene.org/browse/article/7274/, so the reference is7274
.
Once you have done this:
# Get publication links (if not already present)
python get_addgene_article_refs.py
# Download the new plasmid
python get_other_plasmids.py
# Run the formatting pipeline
bash run_formatting.sh
# Run the alignment and consensus pipeline
bash run_alignments_and_consensus.sh
# Re-build the web app
This could be extended to support other plasmid sources, but for now it only supports the SnapGene and AddGene plasmid collections. Feel free to submit an issue to discuss it!