This manual is meant to clarify the most important options and possibilities when running vcf2gwas.
Note: When running vcf2gwas via docker, replace vcf2gwas
in every command with docker run -v /path/to/current-working-directory/:/vcf2gwas/ fvogt257/vcf2gwas
.
To obtain example files for running vcf2gwas, run:
vcf2gwas -v test
Besides testing the installation, this will copy the example VCF file example.vcf.gz
and phenotype file example.csv
to your current working directory.
The VCF file example.vcf.gz
contains pre-filtered SNP information of about 60 accessions of A. thaliana from the 1001 Genomes project.
The phenotype file example.csv
is made up of the avrRpm1
phenotype from the AraPheno database for the same accessions.
This sections demonstrates the usage of vcf2gwas when running the different analysis models of GEMMA.
To read more in detail about the different models GEMMA is able to run, please refer to their manual.
To run a linear model analysis with default settings, select a VCF
file with the -v/--vcf
and a phenotype file with the -pf/--pfile
option, respectively. To select a specific phenotype from the phenotype file, utilize the -p/--pheno
option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lm
This will use the first phenotype column in the phenotype file for analysis.
GEMMA offers univariate and multivariate linear mixed model analysis.
To run a linear mixed model analysis with default settings, type:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm
This again uses the first phenotype column in the phenotype file to carry out the analysis.
To analyse multiple phenotypes (for example the 1st, 2nd and 4th) independently, run:
vcf2gwas -v [filename] -pf [filename] -p 1 -p 2 -p 4 -lmm
This will carry out the univariate linear mixed model analysis for the 1st, 2nd and 4th phenotype column in the phenotype file.
To analyse these phenotypes jointly, employ the -m/--multi
option:
vcf2gwas -v [filename] -pf [filename] -p 1 -p 2 -p 4 -lmm --multi
Run a bayesian sparse linear mixed model analysis with any phenotype in the phenotype file:
vcf2gwas -v [filename] -pf [filename] -p [int] -bslmm
To change to amount of burn-in steps, sampling steps and maximum value for 'gamma' from their default values (100,000, 1,000,000 and 300, respectively), use the -w/--burn
, -s/--sampling
and -smax/--snpmax
options, respectively.
vcf2gwas is capable of analyzing multiple phenotypes from one or multiple phenotypes at once. Depending on the available cores and memory of the machine running the analysis and the number of phenotypes to be analyzed, the phenotype file will be split up during the analysis to ensure maximum efficiency by analyzing phenotypes in parallel.
vcf2gwas is able to take in multiple phenotype files by employing the -pf/--pfile
option for each file:
vcf2gwas -v [filename] -pf [filename1] -pf [filename2] -p [int] -lmm
If multiple phenotypes are specified, these phenotypes will be analyzed in every phenotype file.
This is advantageous if one wants to analyze the same phenotypes with a different subset of the individuals in the VCF
file.
To analyze multiple phenotypes in one run you can either specify multiple phenotypes (by either typing the phenotype names or specifying the phenotype column position)
vcf2gwas -v [filename] -pf [filename] -p [name1] -p [name2] -p [name3] -lmm
vcf2gwas -v [filename] -pf [filename] -p [num1] -p [num2] -p [num3] -lmm
or select all phenotypes in the phenotype file at once utilizing the -ap/--allphenotypes
option:
vcf2gwas -v [filename] -pf [filename] -ap -lmm
vcf2gwas offers the option to transform the phenotype values in the phenotype file(s). The selected metric (default: 'wisconsin') is applied across rows.
To transform the phenotypes, employ the -t/--transform
option and to change the metric to one of the other supported metrics, add it as an argument to the option:
vcf2gwas -v [filename] -pf [filename1] -p [int] -lmm -t hellinger
Now, vcf2gwas, transforms the phenotypes according to the hellinger metric.
The following metrics are available:
- total: Divides each observation by row sum
- max: Divides each observation by row max
- normalize: Chord transformation, also euclidean normalization, making the length of each row 1
- range: Converts the range of the data to 0 and 1
- standardize: Standardizes each observation (i.e. z-score)
- hellinger: Square-root of the total transformation
- log: Returns ln(x+1)
- logp1: Returns ln(x) + 1, if x > 0. Otherwise returns 0
- pa: Converts data to binary absence (0) presence (1) data
- wisconsin: First divides an observation by the max of the column, then the sum of the row. That is, it applies ‘max’ down columns then ‘total’ across rows
These functions are taken from the ecopy package.
Note: If desired, one can also use the dimensionality reduction options in conjunction with the transformation. vcf2gwas will first transform the phenotypes and then reduce the dimensionality of the transformed phenotypes according to the chosen method and use these results as phenotypes.
GEMMA supports adding covariates to the linear model and the linear mixed model analysis.
To extract principal components from the VCF
file for subsequent use as covariates in the analysis, use -cf/--cfile
and -c/--covar
with the 'PCA' argument and the amount of PCs desired for the analysis, respectively:
vcf2gwas -v [filename] -pf [filename] -p [num1] -cf PCA -c 2 -lmm
Now, 2 principal components will be extracted from the VCF
file and used for the linear mixed model analysis.
To use principal components or UMAP embeddings of the phenotypic data as covariates for the analysis, see Using dimensionality reduction of phenotypes for analysis.
Alternatively, a covariate file formatted in the same way as the phenotype file can be added manually.
Selecting covariates from a covariate file follows the same scheme as selecting phenotypes by using -cf/--cfile
and -c/--covar
:
vcf2gwas -v [filename] -pf [filename] -p [num1] -cf [filename] -c 1 -lmm
Here, the 1st covariate column of the covariate file will be considered in the analysis of the selected phenotype.
Similarly to the phenotype options, multiple covariates can be selected (either by name or column number) as well as all at once using -ac/--allcovariates
.
Note: Even if no covariates were added to the analysis, GEMMA will always use the intercept as a covariate.
Comparing the GWAS results to specific genes of interest can be a tedious task. To facilitate this process, vcf2gwas comes with GFF gene files already built-in for the most common species. Comparison to a species can be selected by employing the -gf/--genefile
option and the abbreviation for the species:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -gf [abbreviation]
Below are all supported species, their abbreviations as well as the used references and their sources:
Species | Abbreviation | Scientific name | Reference | Source |
---|---|---|---|---|
Anopheles | AG | anopheles gambiae | AgamP4.51 | link |
Arabidopsis | AT | arabidopsis thaliana | TAIR10.51 | link |
C. elegans | CE | caenorhabditis elegans | WBcel235 | link |
Fruit fly | DM | drosophila melanogaster | BDGP6.32 | link |
Zebrafish | DR | danio rerio | GRCz11 | link |
Chicken | GG | gallus gallus | GRCg6a | link |
Human | HS | homo sapiens | GRCh38.p13 | link |
Mouse | MM | mus musculus | GRCm39 | link |
Rice | OS | oryza sativa | IRGSP-1.0.51 | link |
Rat | RN | rattus norvegicus | Rnor_6.0 | link |
Yeast | SC | saccharomyces cerevisiae | R64-1-1 | link |
Tomato | SL | solanum lycopersicum | SL3.0.51 | link |
Grape | VV | vitis vinifera | 12X.51 | link |
Maize | ZM | zea mays | Zm-B73-REFERENCE-NAM-5.0.51 | link |
Futhermore, vcf2gwas supports adding a 'gene-file' (either a GFF3 formatted .gff
or comma-separated .csv
file) containing the position as well as additional information of genes to the analysis by using the -gf/--genefile
option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -gf [filename]
One can also add multiple 'gene-files' or use the built-in files in combination with 'gene-files' to efficiently compare the results of the analysis with different subsets of genes.
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -gf [abbreviation] -gf [filename1] -gf [filename2]
If the file is in the .csv
format, the file needs at least three columns containing information about chromosome, gene start position and gene stop position. These columns have to be named 'chr', 'start' and 'stop'.
vcf2gwas recognizes chromosomes in the following formats (here the first chromosome): Chr1
, chr1
, 1
.
If the chromosomes in the VCF
file are of a different format, it is necessary that the chromosome information in the gene file is formatted in the same way, otherwise vcf2gwas won't recognize the information correctly.
vcf2gwas will summarize the n best SNPs (specified with -ts/--topsnp
) of every analyzed phenotype and compare them to the genes in the file by calculating the distance between each SNP and gene upstream as well as downstream. These results can be filtered by saving only those SNPs with a distance to a gene lower than a specific threshold (set with -gt/--genethresh
).
Note: Since for each SNP only the gene with the closest start/end upstream and downstream is shown, this feature only serves to give a hint of possibly associated genes. Closer inspection by the user is strongly recommended.
Although vcf2gwas will by default calculate a relatedness matrix depending on the chosen model, one may want to add a different one instead to the analysis. This is possible by employing the -k/--relmatrix
option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -k [filename]
To use vcf2gwas to just calculate a relatedness matrix from the VCF file, run the -gk
option:
vcf2gwas -v [filename] -gk
To calculate the relatedness matrix and perform its eigen-decomposition in the same run, use the -eigen
option:
vcf2gwas -v [filename] -eigen
Of course the -eigen
option can also be used when supplying your own relatedness matrix with the -k/--relmatrix
option.
By default vcf2gwas analyzes all chromosomes available in the VCF
file. If one wants to analyze only a subset of the chromosomes, employ the -chr/--chromosome
option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -chr [num1] -chr [num2]
Here, only the two specified chromosomes chromosomes will be used for the analysis, speeding up the whole process.
By default, vcf2gwas will save the output in the current working directory. To change to a unique output directory, use the -o/--output
option to specify a path:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -o dir/example/
In this section other useful options of vcf2gwas will be elucidated.
By default vcf2gwas will use half of the available memory and all logical cores minus one. It can be important to limit usage of these resources especially when running the analysis on a machine shared with others.
To set the memory (in MB) and core usage employ the -M/--memory
and -T/--threads
option, respectively:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -M 8000 -T 6
Now, vcf2gwas uses 8 GB of memory and 6 cores to carry out the analysis.
It is recommended to not set the memory to less than 1 GB.
When analyzing many phenotypes it can be escpecially beneficial to reduce the phenotypic dimensions. This allows the user to analyze or account for underlying structures in their phenotypic data by either using the output of the dimensionality reduction as phenotypes for GEMMA or adding them as covariates to the analysis.
vcf2gwas offers two often-used methods to reduce the dimensions: principal component analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP).
When using the ouput as phenotypes, both methods can be used either separately or simultaneously in the analysis.
To perform PCA on the phenotype data and use the principal components as phenotypes for the analysis, use the -P/--PCA
option.
By default, vcf2gwas will reduce the phenotype dimensionality to 2 PCs. To change this value to any value between 2 and 10, append the value to the option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -P 3
Now, vcf2gwas will reduce the phenotype dimensionality to 3 instead of 2. vcf2gwas will also plot the variance explained of the principal components, so that the user can estimate the amount of sufficient PCs.
To perform UMAP reduction on the phenotype data and use the embeddings as phenotypes for the analysis, use the -U/--UMAP
option.
By default, vcf2gwas will reduce the phenotype dimensionality to 2 embeddings. To change this value to any value between 1 and 5, append the value to the option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -U 3
Now, vcf2gwas will reduce the phenotype dimensionality to 3 instead of 2.
By default, vcf2gwas uses euclidean metric for UMAP calculations. The metric can be changed with the -um/--umapmetric
option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -U 3 -um manhattan
The manhattan metric is now used to calculate the UMAP embeddings.
The following is a list of the available metrics:
- euclidean
- manhattan
- braycurtis
- cosine
- hamming
- jaccard
- hellinger
Another useful option is -asc/--ascovariates
. It allows the user to utilize a variable amount principal components or UMAP embeddings of their phenotypic data as covariates in the analysis of said phenotypes:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -U 3 -asc
Here, the phenotypic dimensionality will be reduced to 3 UMAP embeddings which will be subsequently used as covariates for the analysis.
Note: This option takes only effect in conjuction with either the -U/--UMAP
or -P/--PCA
option, but not both simultaneously.
Furthermore, only one phenotype file can be added to the analysis to take advantage of this option.
vcf2gwas uses GEMMAs standard method of kinship calculation for the linear mixed model, which produces a relatedness matrix. Instead of using this standard method, the relatedness matrix can optionally be calculated via PCA by utilizing the -KC/--kcpca
option.
The SNP data from the VCF file will be pruned by linkage disequilibrium with a default r-squared threshold of 0.5. To change the threshold, append the value to the option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -KC 0.8
By default, vcf2gwas will filter out SNPs with a minor allele frequency > 0.01. To change this threshold use the -q/--minaf
option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -q 0.02
The manhattan plot which will be produced from the GEMMA output, features by default a significant value line. The significance level is determined by Bonferroni-correcting the standard significance level of 0.05 with the amount of SNPs used by GEMMA in the analysis. All SNPs above that line will be labeled.
To change the threshold line manually, use the -sv/--sigval
option.
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -sv 7
The line will now be drawn at -log10(1e-7).
To disable the line and not label any SNPs, change the value to 0.
To remove the SNP lables completely, utilize the -nl/--nolabel
option.
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -nl
Note: This can be beneficial to reduce the overall runtime when the analysis results in many significant SNPs.
To deactivate the manhattan and QQ-plots completely, employ the -np/--noplot
option.
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -np
Note: This can be beneficial when using custom vcf files missing e.g. variant positions, or to reduce overall runtime.
Since the resulting plots may be used in various contexts, the font size of the plots produced by vcf2gwas can be changed by using the -fs/--fontsize
option.
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -fs 20
The font sizes of the plots will now be changed from the default value of 26 pt to 20 pt.
During the analysis, various temporary files like subsetted and filtered VCF and .csv
files are produced. By default they are removed once they are no longer needed but if one wants to retain these files, employ the -r/--retain
option:
vcf2gwas -v [filename] -pf [filename] -p 1 -lmm -r
The following part shows some of the output plots and summaries produced by the analysis of the example files using the linear mixed model and standard options.
vcf2gwas will produce quality control plots (saved in the QC
directory) for the phenotype data as well as the genotype data:
- For every analyzed phenotype, its distribution will be plotted.
- For every chromosome of the genotype data, the raw variant density and, if available, DP (depth of coverage), MQ (mapping quality) and QD (Qual normalized by Depth) will be plotted.
To reduce the runtime, quality control can be deactivated with the option -nq/--noqc
.
Manhattan-plot labeling significant SNPs with a standard significant value threshold of -log10(1e-6):
QQ-plot comparing the expected and observed probability distributions:
Amongst other things, vcf2gwas will sort the SNPs of every analyzed phenotype, save the specified amount of top SNPs for each phenotype, summarize these SNPs of all phenotypes to check if certain SNPs occur more than once and optionally compare these SNPs to the genes supplied by the gene file. Below is the output shown of such a gene comparison when supplying a gene file containing information about NLR genes in A. thaliana:
Upstream gene | Upstream gene | Upstream gene | Upstream gene | Downstream gene | Downstream gene | Downstream gene | Downstream gene | ||||
---|---|---|---|---|---|---|---|---|---|---|---|
SNP ID | Chr | Phenotypes | ID | Comment | Name | Distance | SNP pos | Distance | Name | Comment | ID |
3:2237364 | 3 | avrRpm | AT3G07040.1 | NB-ARC domain-containing disease resistance protein | RPM1 | 8340 | 2237364 | ||||
3:2237394 | 3 | avrRpm | AT3G07040.1 | NB-ARC domain-containing disease resistance protein | RPM1 | 8370 | 2237394 | ||||
3:2237446 | 3 | avrRpm | AT3G07040.1 | NB-ARC domain-containing disease resistance protein | RPM1 | 8422 | 2237446 | ||||
3:2237452 | 3 | avrRpm | AT3G07040.1 | NB-ARC domain-containing disease resistance protein | RPM1 | 8428 | 2237452 | ||||
3:2289171 | 3 | avrRpm | AT3G07040.1 | NB-ARC domain-containing disease resistance protein | RPM1 | 60147 | 2289171 | ||||
2:975138 | 2 | avrRpm | AT2G03030.1 | Toll-Interleukin-Resistance (TIR) domain family protein | 84618 | 975138 | 28330 | Toll-Interleukin-Resistance (TIR) domain family protein | AT2G03300.1 | ||
2:975234 | 2 | avrRpm | AT2G03030.1 | Toll-Interleukin-Resistance (TIR) domain family protein | 84714 | 975234 | 28234 | Toll-Interleukin-Resistance (TIR) domain family protein | AT2G03300.1 | ||
2:975320 | 2 | avrRpm | AT2G03030.1 | Toll-Interleukin-Resistance (TIR) domain family protein | 84800 | 975320 | 28148 | Toll-Interleukin-Resistance (TIR) domain family protein | AT2G03300.1 | ||
2:975405 | 2 | avrRpm | AT2G03030.1 | Toll-Interleukin-Resistance (TIR) domain family protein | 84885 | 975405 | 28063 | Toll-Interleukin-Resistance (TIR) domain family protein | AT2G03300.1 | ||
2:975411 | 2 | avrRpm | AT2G03030.1 | Toll-Interleukin-Resistance (TIR) domain family protein | 84891 | 975411 | 28057 | Toll-Interleukin-Resistance (TIR) domain family protein | AT2G03300.1 | ||
2:975490 | 2 | avrRpm | AT2G03030.1 | Toll-Interleukin-Resistance (TIR) domain family protein | 84970 | 975490 | 27978 | Toll-Interleukin-Resistance (TIR) domain family protein | AT2G03300.1 | ||
2:975590 | 2 | avrRpm | AT2G03030.1 | Toll-Interleukin-Resistance (TIR) domain family protein | 85070 | 975590 | 27878 | Toll-Interleukin-Resistance (TIR) domain family protein | AT2G03300.1 | ||
1:11273854 | 1 | avrRpm | 11273854 | 14598 | RAC1 | Disease resistance protein (TIR-NBS-LRR class) family | AT1G31540.2 | ||||
1:11273813 | 1 | avrRpm | 11273813 | 14639 | RAC1 | Disease resistance protein (TIR-NBS-LRR class) family | AT1G31540.2 | ||||
3:2165688 | 3 | avrRpm | 2165688 | 60264 | RPM1 | NB-ARC domain-containing disease resistance protein | AT3G07040.1 |