Create a list of files to remove sequences from using the command ls -d -1 $PWD/* > input.txt
, where each line is the full path to an individual assembly
python remove_sequence.py --input input.txt --dist completeness_distribution.txt --prop-complete 0 --avg-breaks 1 --outdir augmented_genomes
Alter config.yaml
to run Celebrimbor using your desired settings. This can also be done from the command line as below, where each of the config parameters can be set manually.
snakemake --cores 16 --config cgt_breaks=0.05,0.05 cgt_error=0.05 output_dir=/path/to/output/dir genome_fasta=/path/to/assembly/directory clustering_method=mmseqs2
Run this command on data both pre- and post-sequence removal.
For Celebrimbor run with mmseqs2, first get a 1:1 mapping between gene sequences pre- and post-sequence removal (assuming run in pre_sim
and post_sim
directories).
cat pre_sim/mmseqs_all_seqs.fasta post_sim/mmseqs_all_seqs.fasta > mmseqs_all_seqs_concat.fasta
mmseqs easy-cluster mmseqs_all_seqs_concat.fasta mmseqs_comp tmp --min-seq-id 1.0 --seq-id-mode 1 --threads 16 >mmseqs2.log 2>&1
python match_genes.py --input mmseqs_comp_cluster.tsv --output matched_genes
Next match up gene clusters pre- and post-sequence removal, with naming convention x, y and z for the rare, core and error thresholds respectively.
python analyse_celebrimbor_simulation.py --matched-genes matched_genes.tsv --cgt-pre pre_sim/cgt_output_params_x,y_err_z.txt --cgt-post post_sim/cgt_output_params_z,y_err_z.txt --checkm-pre pre_sim/checkm_out.tsv --checkm-post post_sim/checkm_out.tsv --pan-pre pre_sim/pangenome_summary_params_z,y_err_z.tsv --pan-post post_sim/pangenome_summary_params_z,y_err_z.tsv --mmseqs-pre pre_sim/mmseqs/mmseqs_cluster.sorted.tsv --mmseqs-post post_sim/mmseqs/mmseqs_cluster.sorted.tsv --outpref cgt_comp_params_z,y_err_z
Finally, place cgt_comp_params_z,y_err_z.tsv
files in a single directory and analyse the results using mmseqs2_performance.R
.
For Celebrimbor run with Panaroo, use the pangenome_summary.tsv
and cgt_output.txt
, renaming to panaroo_sim_<stringency>_params_x,y.tsv
and cgt_sim_panaroo_<stringency>_params_x,y_err_z.txt
for x, y and z for the rare, core and error thresholds respectively. The <stringency>
should be one of strict
, moderate
or sensitive
, depending on the settings using in config.yaml for panaroo_stringency
option.
Finally, place panaroo_sim_<stringency>_params_x,y.tsv
and cgt_sim_panaroo_<stringency>_params_x,y_err_z.txt
files together in a single directory and analyse the results using panaroo_performance.R
.
Take the mmseqs_core.csv
and mmseqs_rare.csv
generated by mmseqs2_performance.R
, and the panaroo_core.csv
and panaroo_rare.csv
generated by panaroo_performance.R
, and analyse them using combined_performance
. This will generate Figure 1D in the Celebrimbor publication.