Skip to content

Latest commit

 

History

History
448 lines (307 loc) · 24.2 KB

MANUAL.md

File metadata and controls

448 lines (307 loc) · 24.2 KB

vcf2gwas manual

About

This manual is meant to clarify the most important options and possibilities when running vcf2gwas.
Note: When running vcf2gwas via docker, replace vcf2gwas in every command with docker run -v /path/to/current-working-directory/:/vcf2gwas/ fvogt257/vcf2gwas.
To obtain example files for running vcf2gwas, run:

vcf2gwas -v test

Besides testing the installation, this will copy the example VCF file example.vcf.gz and phenotype file example.csv to your current working directory.
The VCF file example.vcf.gz contains pre-filtered SNP information of about 60 accessions of A. thaliana from the 1001 Genomes project.
The phenotype file example.csv is made up of the avrRpm1 phenotype from the AraPheno database for the same accessions.

Contents

GEMMA models

This sections demonstrates the usage of vcf2gwas when running the different analysis models of GEMMA.

To read more in detail about the different models GEMMA is able to run, please refer to their manual.

Running a linear model analysis

To run a linear model analysis with default settings, select a VCF file with the -v/--vcf and a phenotype file with the -pf/--pfile option, respectively. To select a specific phenotype from the phenotype file, utilize the -p/--pheno option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lm

This will use the first phenotype column in the phenotype file for analysis.

Running a linear mixed model analysis

GEMMA offers univariate and multivariate linear mixed model analysis.

Univariate linear mixed model

To run a linear mixed model analysis with default settings, type:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm

This again uses the first phenotype column in the phenotype file to carry out the analysis.
To analyse multiple phenotypes (for example the 1st, 2nd and 4th) independently, run:

vcf2gwas -v [filename] -pf [filename] -p 1 -p 2 -p 4 -lmm

This will carry out the univariate linear mixed model analysis for the 1st, 2nd and 4th phenotype column in the phenotype file.

Multivariate linear mixed model

To analyse these phenotypes jointly, employ the -m/--multi option:

vcf2gwas -v [filename] -pf [filename] -p 1 -p 2 -p 4 -lmm --multi

Running a bayesian sparse linear mixed model analysis

Run a bayesian sparse linear mixed model analysis with any phenotype in the phenotype file:

vcf2gwas -v [filename] -pf [filename] -p [int] -bslmm

To change to amount of burn-in steps, sampling steps and maximum value for 'gamma' from their default values (100,000, 1,000,000 and 300, respectively), use the -w/--burn, -s/--sampling and -smax/--snpmax options, respectively.

File related options

vcf2gwas is capable of analyzing multiple phenotypes from one or multiple phenotypes at once. Depending on the available cores and memory of the machine running the analysis and the number of phenotypes to be analyzed, the phenotype file will be split up during the analysis to ensure maximum efficiency by analyzing phenotypes in parallel.

Selecting multiple phenotype files

vcf2gwas is able to take in multiple phenotype files by employing the -pf/--pfile option for each file:

vcf2gwas -v [filename] -pf [filename1] -pf [filename2] -p [int] -lmm

If multiple phenotypes are specified, these phenotypes will be analyzed in every phenotype file. This is advantageous if one wants to analyze the same phenotypes with a different subset of the individuals in the VCF file.

Analyzing multiple phenotypes

To analyze multiple phenotypes in one run you can either specify multiple phenotypes (by either typing the phenotype names or specifying the phenotype column position)

vcf2gwas -v [filename] -pf [filename] -p [name1] -p [name2] -p [name3] -lmm

vcf2gwas -v [filename] -pf [filename] -p [num1] -p [num2] -p [num3] -lmm

or select all phenotypes in the phenotype file at once utilizing the -ap/--allphenotypes option:

vcf2gwas -v [filename] -pf [filename] -ap -lmm

Transforming phenotype values

vcf2gwas offers the option to transform the phenotype values in the phenotype file(s). The selected metric (default: 'wisconsin') is applied across rows.
To transform the phenotypes, employ the -t/--transform option and to change the metric to one of the other supported metrics, add it as an argument to the option:

vcf2gwas -v [filename] -pf [filename1] -p [int] -lmm -t hellinger

Now, vcf2gwas, transforms the phenotypes according to the hellinger metric.
The following metrics are available:

  • total: Divides each observation by row sum
  • max: Divides each observation by row max
  • normalize: Chord transformation, also euclidean normalization, making the length of each row 1
  • range: Converts the range of the data to 0 and 1
  • standardize: Standardizes each observation (i.e. z-score)
  • hellinger: Square-root of the total transformation
  • log: Returns ln(x+1)
  • logp1: Returns ln(x) + 1, if x > 0. Otherwise returns 0
  • pa: Converts data to binary absence (0) presence (1) data
  • wisconsin: First divides an observation by the max of the column, then the sum of the row. That is, it applies ‘max’ down columns then ‘total’ across rows

These functions are taken from the ecopy package.

Note: If desired, one can also use the dimensionality reduction options in conjunction with the transformation. vcf2gwas will first transform the phenotypes and then reduce the dimensionality of the transformed phenotypes according to the chosen method and use these results as phenotypes.

Adding covariates

GEMMA supports adding covariates to the linear model and the linear mixed model analysis.
To extract principal components from the VCF file for subsequent use as covariates in the analysis, use -cf/--cfile and -c/--covar with the 'PCA' argument and the amount of PCs desired for the analysis, respectively:

vcf2gwas -v [filename] -pf [filename] -p [num1] -cf PCA -c 2 -lmm

Now, 2 principal components will be extracted from the VCF file and used for the linear mixed model analysis.

To use principal components or UMAP embeddings of the phenotypic data as covariates for the analysis, see Using dimensionality reduction of phenotypes for analysis.

Alternatively, a covariate file formatted in the same way as the phenotype file can be added manually. Selecting covariates from a covariate file follows the same scheme as selecting phenotypes by using -cf/--cfile and -c/--covar:

vcf2gwas -v [filename] -pf [filename] -p [num1] -cf [filename] -c 1 -lmm

Here, the 1st covariate column of the covariate file will be considered in the analysis of the selected phenotype. Similarly to the phenotype options, multiple covariates can be selected (either by name or column number) as well as all at once using -ac/--allcovariates.
Note: Even if no covariates were added to the analysis, GEMMA will always use the intercept as a covariate.

Comparing results to specific genes

Comparing the GWAS results to specific genes of interest can be a tedious task. To facilitate this process, vcf2gwas comes with GFF gene files already built-in for the most common species. Comparison to a species can be selected by employing the -gf/--genefile option and the abbreviation for the species:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -gf [abbreviation]

Below are all supported species, their abbreviations as well as the used references and their sources:

Species Abbreviation Scientific name Reference Source
Anopheles AG anopheles gambiae AgamP4.51 link
Arabidopsis AT arabidopsis thaliana TAIR10.51 link
C. elegans CE caenorhabditis elegans WBcel235 link
Fruit fly DM drosophila melanogaster BDGP6.32 link
Zebrafish DR danio rerio GRCz11 link
Chicken GG gallus gallus GRCg6a link
Human HS homo sapiens GRCh38.p13 link
Mouse MM mus musculus GRCm39 link
Rice OS oryza sativa IRGSP-1.0.51 link
Rat RN rattus norvegicus Rnor_6.0 link
Yeast SC saccharomyces cerevisiae R64-1-1 link
Tomato SL solanum lycopersicum SL3.0.51 link
Grape VV vitis vinifera 12X.51 link
Maize ZM zea mays Zm-B73-REFERENCE-NAM-5.0.51 link

Futhermore, vcf2gwas supports adding a 'gene-file' (either a GFF3 formatted .gff or comma-separated .csv file) containing the position as well as additional information of genes to the analysis by using the -gf/--genefile option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -gf [filename]

One can also add multiple 'gene-files' or use the built-in files in combination with 'gene-files' to efficiently compare the results of the analysis with different subsets of genes.

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -gf [abbreviation] -gf [filename1] -gf [filename2]

If the file is in the .csv format, the file needs at least three columns containing information about chromosome, gene start position and gene stop position. These columns have to be named 'chr', 'start' and 'stop'.

vcf2gwas recognizes chromosomes in the following formats (here the first chromosome): Chr1, chr1, 1.
If the chromosomes in the VCF file are of a different format, it is necessary that the chromosome information in the gene file is formatted in the same way, otherwise vcf2gwas won't recognize the information correctly.

vcf2gwas will summarize the n best SNPs (specified with -ts/--topsnp) of every analyzed phenotype and compare them to the genes in the file by calculating the distance between each SNP and gene upstream as well as downstream. These results can be filtered by saving only those SNPs with a distance to a gene lower than a specific threshold (set with -gt/--genethresh).

Note: Since for each SNP only the gene with the closest start/end upstream and downstream is shown, this feature only serves to give a hint of possibly associated genes. Closer inspection by the user is strongly recommended.

Adding a relatedness matrix

Although vcf2gwas will by default calculate a relatedness matrix depending on the chosen model, one may want to add a different one instead to the analysis. This is possible by employing the -k/--relmatrix option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -k [filename]

To use vcf2gwas to just calculate a relatedness matrix from the VCF file, run the -gk option:

vcf2gwas -v [filename] -gk 

To calculate the relatedness matrix and perform its eigen-decomposition in the same run, use the -eigen option:

vcf2gwas -v [filename] -eigen

Of course the -eigen option can also be used when supplying your own relatedness matrix with the -k/--relmatrix option.

Analyzing a subset of chromosomes

By default vcf2gwas analyzes all chromosomes available in the VCF file. If one wants to analyze only a subset of the chromosomes, employ the -chr/--chromosome option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -chr [num1] -chr [num2]

Here, only the two specified chromosomes chromosomes will be used for the analysis, speeding up the whole process.

Changing the output directory

By default, vcf2gwas will save the output in the current working directory. To change to a unique output directory, use the -o/--output option to specify a path:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -o dir/example/

Miscellaneous options

In this section other useful options of vcf2gwas will be elucidated.

Limiting memory and core usage

By default vcf2gwas will use half of the available memory and all logical cores minus one. It can be important to limit usage of these resources especially when running the analysis on a machine shared with others.
To set the memory (in MB) and core usage employ the -M/--memory and -T/--threads option, respectively:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -M 8000 -T 6

Now, vcf2gwas uses 8 GB of memory and 6 cores to carry out the analysis.
It is recommended to not set the memory to less than 1 GB.

Using dimensionality reduction of phenotypes for analysis

When analyzing many phenotypes it can be escpecially beneficial to reduce the phenotypic dimensions. This allows the user to analyze or account for underlying structures in their phenotypic data by either using the output of the dimensionality reduction as phenotypes for GEMMA or adding them as covariates to the analysis.
vcf2gwas offers two often-used methods to reduce the dimensions: principal component analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP).
When using the ouput as phenotypes, both methods can be used either separately or simultaneously in the analysis.

PCA

To perform PCA on the phenotype data and use the principal components as phenotypes for the analysis, use the -P/--PCA option.
By default, vcf2gwas will reduce the phenotype dimensionality to 2 PCs. To change this value to any value between 2 and 10, append the value to the option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -P 3

Now, vcf2gwas will reduce the phenotype dimensionality to 3 instead of 2. vcf2gwas will also plot the variance explained of the principal components, so that the user can estimate the amount of sufficient PCs.

UMAP

To perform UMAP reduction on the phenotype data and use the embeddings as phenotypes for the analysis, use the -U/--UMAP option.
By default, vcf2gwas will reduce the phenotype dimensionality to 2 embeddings. To change this value to any value between 1 and 5, append the value to the option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -U 3

Now, vcf2gwas will reduce the phenotype dimensionality to 3 instead of 2.

By default, vcf2gwas uses euclidean metric for UMAP calculations. The metric can be changed with the -um/--umapmetric option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -U 3 -um manhattan

The manhattan metric is now used to calculate the UMAP embeddings.
The following is a list of the available metrics:

  • euclidean
  • manhattan
  • braycurtis
  • cosine
  • hamming
  • jaccard
  • hellinger

Using PCs or UMAP embeddings as covariates

Another useful option is -asc/--ascovariates. It allows the user to utilize a variable amount principal components or UMAP embeddings of their phenotypic data as covariates in the analysis of said phenotypes:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -U 3 -asc

Here, the phenotypic dimensionality will be reduced to 3 UMAP embeddings which will be subsequently used as covariates for the analysis.
Note: This option takes only effect in conjuction with either the -U/--UMAP or -P/--PCA option, but not both simultaneously. Furthermore, only one phenotype file can be added to the analysis to take advantage of this option.

Using PCA of genotypes instead of standard relatedness matrix

vcf2gwas uses GEMMAs standard method of kinship calculation for the linear mixed model, which produces a relatedness matrix. Instead of using this standard method, the relatedness matrix can optionally be calculated via PCA by utilizing the -KC/--kcpca option.
The SNP data from the VCF file will be pruned by linkage disequilibrium with a default r-squared threshold of 0.5. To change the threshold, append the value to the option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -KC 0.8

Filter out SNPs

By default, vcf2gwas will filter out SNPs with a minor allele frequency > 0.01. To change this threshold use the -q/--minaf option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -q 0.02

Change manhattan plot

The manhattan plot which will be produced from the GEMMA output, features by default a significant value line. The significance level is determined by Bonferroni-correcting the standard significance level of 0.05 with the amount of SNPs used by GEMMA in the analysis. All SNPs above that line will be labeled.
To change the threshold line manually, use the -sv/--sigval option.

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -sv 7

The line will now be drawn at -log10(1e-7).
To disable the line and not label any SNPs, change the value to 0.

To remove the SNP lables completely, utilize the -nl/--nolabel option.

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -nl

Note: This can be beneficial to reduce the overall runtime when the analysis results in many significant SNPs.

To deactivate the manhattan and QQ-plots completely, employ the -np/--noplot option.

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -np

Note: This can be beneficial when using custom vcf files missing e.g. variant positions, or to reduce overall runtime.

Change font size in all plots

Since the resulting plots may be used in various contexts, the font size of the plots produced by vcf2gwas can be changed by using the -fs/--fontsize option.

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -fs 20

The font sizes of the plots will now be changed from the default value of 26 pt to 20 pt.

Keep temporary files

During the analysis, various temporary files like subsetted and filtered VCF and .csv files are produced. By default they are removed once they are no longer needed but if one wants to retain these files, employ the -r/--retain option:

vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -r

Output

The following part shows some of the output plots and summaries produced by the analysis of the example files using the linear mixed model and standard options.

Quality Control

vcf2gwas will produce quality control plots (saved in the QC directory) for the phenotype data as well as the genotype data:

  • For every analyzed phenotype, its distribution will be plotted.
  • For every chromosome of the genotype data, the raw variant density and, if available, DP (depth of coverage), MQ (mapping quality) and QD (Qual normalized by Depth) will be plotted.

To reduce the runtime, quality control can be deactivated with the option -nq/--noqc.

Plots

Manhattan-plot labeling significant SNPs with a standard significant value threshold of -log10(1e-6): Manhattan-plot

QQ-plot comparing the expected and observed probability distributions: QQ-plot

Summaries

Amongst other things, vcf2gwas will sort the SNPs of every analyzed phenotype, save the specified amount of top SNPs for each phenotype, summarize these SNPs of all phenotypes to check if certain SNPs occur more than once and optionally compare these SNPs to the genes supplied by the gene file. Below is the output shown of such a gene comparison when supplying a gene file containing information about NLR genes in A. thaliana:

Upstream gene Upstream gene Upstream gene Upstream gene Downstream gene Downstream gene Downstream gene Downstream gene
SNP ID Chr Phenotypes ID Comment Name Distance SNP pos Distance Name Comment ID
3:2237364 3 avrRpm AT3G07040.1 NB-ARC domain-containing disease resistance protein RPM1 8340 2237364
3:2237394 3 avrRpm AT3G07040.1 NB-ARC domain-containing disease resistance protein RPM1 8370 2237394
3:2237446 3 avrRpm AT3G07040.1 NB-ARC domain-containing disease resistance protein RPM1 8422 2237446
3:2237452 3 avrRpm AT3G07040.1 NB-ARC domain-containing disease resistance protein RPM1 8428 2237452
3:2289171 3 avrRpm AT3G07040.1 NB-ARC domain-containing disease resistance protein RPM1 60147 2289171
2:975138 2 avrRpm AT2G03030.1 Toll-Interleukin-Resistance (TIR) domain family protein 84618 975138 28330 Toll-Interleukin-Resistance (TIR) domain family protein AT2G03300.1
2:975234 2 avrRpm AT2G03030.1 Toll-Interleukin-Resistance (TIR) domain family protein 84714 975234 28234 Toll-Interleukin-Resistance (TIR) domain family protein AT2G03300.1
2:975320 2 avrRpm AT2G03030.1 Toll-Interleukin-Resistance (TIR) domain family protein 84800 975320 28148 Toll-Interleukin-Resistance (TIR) domain family protein AT2G03300.1
2:975405 2 avrRpm AT2G03030.1 Toll-Interleukin-Resistance (TIR) domain family protein 84885 975405 28063 Toll-Interleukin-Resistance (TIR) domain family protein AT2G03300.1
2:975411 2 avrRpm AT2G03030.1 Toll-Interleukin-Resistance (TIR) domain family protein 84891 975411 28057 Toll-Interleukin-Resistance (TIR) domain family protein AT2G03300.1
2:975490 2 avrRpm AT2G03030.1 Toll-Interleukin-Resistance (TIR) domain family protein 84970 975490 27978 Toll-Interleukin-Resistance (TIR) domain family protein AT2G03300.1
2:975590 2 avrRpm AT2G03030.1 Toll-Interleukin-Resistance (TIR) domain family protein 85070 975590 27878 Toll-Interleukin-Resistance (TIR) domain family protein AT2G03300.1
1:11273854 1 avrRpm 11273854 14598 RAC1 Disease resistance protein (TIR-NBS-LRR class) family AT1G31540.2
1:11273813 1 avrRpm 11273813 14639 RAC1 Disease resistance protein (TIR-NBS-LRR class) family AT1G31540.2
3:2165688 3 avrRpm 2165688 60264 RPM1 NB-ARC domain-containing disease resistance protein AT3G07040.1