This protocol describes good practices on how to look for archaic segments using state-of-the-art tools. In particular, it focuses on optimising the use of IBDmix in different demographic scenarios and explores its use on ancient data IBDmix Github.
First, we need a demographic model that represents important characteristics of our dataset. We can use demes Demes Github to visualise the demographic model of interest as follow:
demesdraw tubes --log-time modelA.yaml modelA_nlog.svg
The name of the model must be specified in the modelA.yaml and will be used for downstream analyses.
Pre-install the latest version of Python and R. We run the simulations using the Snakemake workflow management system. Please, download Snakemake following the instructions here: Snakemake Install.
- Slendr: Slendr Install. Remember to run setup_env() before running the pipeline
- R packages:
install.packages(c("dplyr", "slendr","ggplot2", "admixr","purrr","readr","optparse", "reticulate"))
Alternatively, you can use the provided environmental.yaml file to create a new environment for this project
- Clone this repository (git clone XXXX)
- Create a conda environment (conda env create --name XXX --file environment.yaml). Modify the prefix at the end of this YAML file (and the name of the environment if interested)
- Update the path to the config file in the snakemake rules (configfile: path/to/config.yaml)
Check the toy config.yaml file. You will need to specify:
- name of the model ('modelA')
- path to environmental.yaml file ('path/to/environmental.yaml')
- your output directory ('path/2/outputs')
Step 2, uses the Snakefile named "rules/00.gen_simuls_step1.smk" which generates the simulation outputs. The pipeline includes a few jobs to generate the simulations for which we chose the package "slendr" Slendr Manual. In our example, we use the simulator "msprime" but similarly you can use "slim". We will also be specified a seed to do replicates.
You can run the Snakefile as follows:
snakemake --snakefile rules/00.gen_simuls_step1.smk --cores 50 --use-conda
We make use of the --use-conda because we created an environment for slendr.
This will generate several files in some directories under your output-dir:
- {model}{seed}{time}ky/results
- {model}{seed}{time}ky/logs
- {model}{seed}{time}ky/model
It runs 3 scripts:
- scripts/00.introgression.R: generates the simulations and outputs vcf file, the tree sequence data and the simulation model.
- scripts/00.detect_tracts.py: iterates through every genomic locus and outputs the introgressed tracts.
- scripts/00.calculate_f4_ratio.r: calculates the f4 ratio. This step is useful to have a glance at the data and double-check the simulations' output.
We modified the Snakemake pipeline from IBDmix Github to run IBDmix on several archaic genomes for all simulations.
snakemake --snakefile rules/01.simulations_arc_detection.smk --cores xx
We calculate the performance at the base-pair and window levels. Difference measures are calculated to better calibrate the software.
snakemake --snakefile rules/02.eval_performance.smk --cores xx