Skip to content

Latest commit

 

History

History
 
 

simulate_pangenomes

Incompleteness simulations

Workflow for running incompletness simulations

Step 1: remove sequences from assemblies

Create a list of files to remove sequences from using the command ls -d -1 $PWD/* > input.txt, where each line is the full path to an individual assembly

python remove_sequence.py --input input.txt --dist completeness_distribution.txt --prop-complete 0 --avg-breaks 1 --outdir augmented_genomes

Step 2: Run Celebrimbor

Alter config.yaml to run Celebrimbor using your desired settings. This can also be done from the command line as below, where each of the config parameters can be set manually.

snakemake --cores 16 --config cgt_breaks=0.05,0.05 cgt_error=0.05 output_dir=/path/to/output/dir genome_fasta=/path/to/assembly/directory clustering_method=mmseqs2 

Run this command on data both pre- and post-sequence removal.

Step 3a: Compare pre- and post-sequence removal results (mmseqs2)

For Celebrimbor run with mmseqs2, first get a 1:1 mapping between gene sequences pre- and post-sequence removal (assuming run in pre_sim and post_sim directories).

cat pre_sim/mmseqs_all_seqs.fasta post_sim/mmseqs_all_seqs.fasta > mmseqs_all_seqs_concat.fasta
mmseqs easy-cluster mmseqs_all_seqs_concat.fasta mmseqs_comp tmp --min-seq-id 1.0 --seq-id-mode 1 --threads 16 >mmseqs2.log 2>&1
python match_genes.py --input mmseqs_comp_cluster.tsv --output matched_genes

Next match up gene clusters pre- and post-sequence removal, with naming convention x, y and z for the rare, core and error thresholds respectively.

python analyse_celebrimbor_simulation.py --matched-genes matched_genes.tsv --cgt-pre pre_sim/cgt_output_params_x,y_err_z.txt --cgt-post post_sim/cgt_output_params_z,y_err_z.txt --checkm-pre pre_sim/checkm_out.tsv --checkm-post post_sim/checkm_out.tsv --pan-pre pre_sim/pangenome_summary_params_z,y_err_z.tsv --pan-post post_sim/pangenome_summary_params_z,y_err_z.tsv --mmseqs-pre pre_sim/mmseqs/mmseqs_cluster.sorted.tsv --mmseqs-post post_sim/mmseqs/mmseqs_cluster.sorted.tsv --outpref cgt_comp_params_z,y_err_z

Finally, place cgt_comp_params_z,y_err_z.tsv files in a single directory and analyse the results using mmseqs2_performance.R.

Step 3a: Compare pre- and post-sequence removal results (Panaroo)

For Celebrimbor run with Panaroo, use the pangenome_summary.tsv and cgt_output.txt, renaming to panaroo_sim_<stringency>_params_x,y.tsv and cgt_sim_panaroo_<stringency>_params_x,y_err_z.txt for x, y and z for the rare, core and error thresholds respectively. The <stringency> should be one of strict, moderate or sensitive, depending on the settings using in config.yaml for panaroo_stringency option.

Finally, place panaroo_sim_<stringency>_params_x,y.tsv and cgt_sim_panaroo_<stringency>_params_x,y_err_z.txt files together in a single directory and analyse the results using panaroo_performance.R.

Step 4: Compare pre- and post-sequence removal results (combined)

Take the mmseqs_core.csv and mmseqs_rare.csv generated by mmseqs2_performance.R, and the panaroo_core.csv and panaroo_rare.csv generated by panaroo_performance.R, and analyse them using combined_performance. This will generate Figure 1D in the Celebrimbor publication.