Skip to content

Suporting code for manuscript: "Beyond to the stable: role of the insertion sequences as epidemiological descriptors in Corynebacterium striatum"

Notifications You must be signed in to change notification settings

Leytoncito/IS_Prokka_Panaroo

Repository files navigation

Supporting code for manuscript: "Beyond to the stable: role of the insertion sequences as epidemiological descriptors in Corynebacterium striatum"

1. Assemblies recovery of C. striatum genomes.

NCBI provides command-line tools that allow you to download genomes from databases, including GenBank. For this study we used the following:

./datasets download genome taxon "Corynebacterium striatum" --exclude-gff3 --exclude-protein --exclude-rna --assembly-source genbank

Then, to sort the files.

unzip *.zip

for f in ./ncbi_dataset/data/*; do
   if [ -d "$f" ]; then
      echo "$f"
      for i in ./${f%}/*.fna; do
          mv $i ${f%}.fna
      done
  fi
 done

The metadata was retrieved from NCBI pathogens, then with R code, we enter the information for each previously downloaded genome.

for f in *.fna; do echo ${f%%.fna}; done > genomas.txt
R
t1<-read.table("genomas.txt", header = FALSE, sep = "\t")
t2<-read.csv("isolates.csv", sep = ",")
t2$Assembly=as.factor(t2$Assembly)
t1$V1=as.factor(t1$V1)
t3<-merge(t1, t2, by, by.x = "V1", by.y = "Assembly", all.x=TRUE, sort=FALSE)
write.csv(t3, "metadatos.csv")

2. Study of insertion sequences using Prokka and Panaroo.

Our aproach cosidered to use Prokka for the annotation.

for file in *.fna; do
    ../../benjamin_leyton/prokka/bin/prokka $file --cpus 0 --kingdom Bacteria --proteins --prefix ${file%%.fna} --locustag ${file%%.fna} --centre Bioren --compliant --addgenes --outdir ${file%%.fna};
done

To rebuild the pangenome we use Panaro.

panaroo-qc -t 8 --graph_type all -i *.gff --ref_db ../../benjamin_leyton/refseq.genomes.k21s1000.msh -o panaroo-qc # for QC

nohup panaroo -i ./gff/*.gff -o ./panaroo/ --clean-mode moderate -a core -- core_threshold 0.95 -t 8 -f 0.6 --refind_prop_match 0.5 --search_radius 6000 > panaroo.log &

We then got a subset of all transposases related with IS of Panroo's results with a R code:

library(xlsx)
panaroo_striatum <- read.csv(file = "gene_presence_absence_roary.csv")
filas <- grep("transposase", panaroo_striatum$Annotation, ignore.case = TRUE)
IS_striatum <- panaroo_diphtheriae[c(filas_diph),]
write.xlsx(IS_striatum, "IS_diphtheriae.xlsx")

Finally, to create the matrix of presence and abundance of IS (transposases), it is enough to "count" the number of loci in the Exel's cells. A Excel formula like this will work:

=SI(ESBLANCO(H93);"";LARGO(H93)-LARGO(SUSTITUIR(H93;" ";""))+1)

=IF(ISBLANK(H93);"";LARGE(H93)-LARGE(SUBSTITUTE(H93;""; ""))+1)

3. Classify the studied genomes into lineages

SNP-based alignment was performed using Parsnp. Pasnp makes a global alignment of the core genome and identifies the SNPs

parsnp -d ./genomas/fna/ -r ./referencia/GCA_002803965.1.fna -c -p 8 -o ./parsnp/

then we transform the output of parsnp to a conventional alignment output.

harvesttools -i parsnp.ggr -M parsnp.aln

to build a tree based on the core alignment:

iqtree -s parsnp.aln -m GTR -pre arbol -bb 1000 -nt auto

To lineages detection RhierBAPS ( hierarchical Bayesian Analysis of Population Structure) was used.

library(ggtree)
library(phytools)
library(ape)
library(xlsx)

set.seed(1234)
fasta.file.name <- "parsnp.aln"
snp.matrix <- load_fasta(fasta.file.name)
hb.results <- hierBAPS(snp.matrix, max.depth = 2, n.pops = 20, quiet = TRUE)
#hb.results  <- hierBAPS ( snp.matrix , max.depth  =  2 , n.pops  =  20 , n.extra.rounds  =  Inf , 
     #quiet  =  TRUE )
head(hb.results$partition.df)
newick.file.name  <- arbol.tree
iqtree  <-  phytools :: read.newick ( newick.file.name )
pdf("rhierbaps_results_circular.pdf")
gg  <- ggtree ( iqtree , layout  =  " circular " )
gg  <-  gg % < + % hb.results $ partición.df 
gg  <-  gg  + geom_tippoint (aes ( color  =  factor ( `level 1` )))
gg
dev.off()
write.xlsx(hb.results$partition.df, "levels.xlsx")

This generates two outputs, 1) a colorful tree with hierarchical levels and 2) an excel table with the hierarchical level for each genome studied.

4. Recombinacion/Mutacion.

Recombination / mutation analyzes were carried out using Gubbins and ClonalFrame.

run_gubbins parsnp.aln

ClonalFrameML arbol.tree parsnp.aln output_CFML

5. Constrained PCoA analysis

To find and explore the relationships of the characteristics of C. striatum with ISs, CPCoA was used.

We define the lineages predicted by RhierBAPS as a restrictive variable. Our restricted PCoA analysis was validated using the permanova test. Permanova (Permutacional multivariable analysis of variance) is a non-parametric test based on dissimilarities. Permanova has the null hypothesis that groups do not differ in spread or position in multivariable space. In this study we did a permanova test with 10,000 permutations.

You can download the Rmd and metadata here and run it in Rstudio.

6. Random Forest

As we mentioned in the manuscript Random Forest has the advantage of incorporating a vast and diverse number of data to predict characteristics. We define the response variable as the number of ARGs, lineages and country. Then we defend insertion sequences as predictor variables. We rank the variables according to their importance in the classification and then we graph the 5% of the most important variables for the classification of characteristics.

We use the kappa concordance test to measure the performance of the classification, and Cross validation to evaluate the Random Forest model. You can see the code here

About

Suporting code for manuscript: "Beyond to the stable: role of the insertion sequences as epidemiological descriptors in Corynebacterium striatum"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published