Population structure and local ancestry inference of a certain population (we used it for Faroese's) using a window-based (phased) haplotype clustering approach. The neural network (NN) can be used in whole-genome data. Here are more details about the NN: Haplonet github.
Software requirements:
- Snakemake
- Conda
- bcftools
- Haplonet
- R (version >= 4.0)
- Python (>= 3.6)
Please read this guide if you have questions about Snakemake and conda environments (corresponding to the file guide_snakemake_conda_env.md
).
Input files:
- phased (and imputed) data (VCF format)
- tab-delimited file with the sample IDs (same order as VCF)
- Pre-step: combine datasets (optional)
- Pre-processing: variant- and individual-level
- Haplonet
- Training (NN)
- Admixture (training + supervised)
- Fatash
- PCA
- Plotting
- Pre-step: since we aimed to paint the Faroese chromosomes using ancient samples we merged both datasets using
rules/prep_merged_vcf.smk
. You can skip this step if all samples are already in the same VCF file. - Pre-processing: filter the VCF using bcftools to remove missing data and to only keep bi-allelic sites with MAF > 0.05.
bcftools view -m 2 -M 2 -v snps -i 'MAF > 0.05'
- Run Haplonet to infer population structure and local ancestry using the Snakemake file provided. First of all, we will run this file step1_training_popst
snakemake --snakefile rules/haplonet_main.smk -j10
We will run 10 seeds (more seeds are required for higher Ks). For the chosen K, check if the log-likelihoods have converged, otherwise, run up to 100 seeds. Then, use the admixture results with the highest log-likelihood for further analyses. The Snakemake file step2_best_plotting does this for you.
snakemake --snakefile rules/plotting_haplonet.smk -j10
Finally, we need to run this to get chromosome painting step3_fastash:
snakemake --snakefile rules/fatash.smk -j10