HDL-L is the local version of high-definition likelihood (HDL) (https://github.com/zhenin/HDL). It is specifically tailored for local heritability and genetic correlation analysis using GWAS summary statistics. As a specialized tool, HDL-L addresses the limitations of global correlation estimates by focusing on localized genetic signals which can differ in direction and magnitude across genomic loci.
HDL-L is a full likelihood-based method, ensuring precision and computational efficiency. Notably, HDL-L achieves an approximate fifty-fold increase in computational speed compared to LAVA (Local Analysis of [co]Variant Association), facilitating more rapid analyses without sacrificing accuracy. This efficiency is critical for large-scale genomic studies where traditional methods may falter due to computational demands.
HDL-L not only enhances our ability to identify shared genetic pathways but also deepens our understanding of the biological mechanisms that underpin these relationships. By integrating both local and global correlation estimates, HDL-L provides a comprehensive view of pleiotropic genetic architectures, offering insights that are essential for both theoretical research and practical applications in human genomics.
Here is the preprint version of our paper: https://www.researchsquare.com/article/rs-4568593/v1
HDL-L requires only a standard computer with enough RAM to support the in-memory operations.
OS Requirements This package is supported for macOS and Linux.
R Requirements R Dependencies:
dplyr
data.table
tidyverse
HDL-L can be easily installed from GitHub using the remotes
package. If you don't already have remotes
installed, the following commands will manage the installation for you:
if (!requireNamespace("remotes", quietly = TRUE))
install.packages("remotes")
remotes::install_github("YuyingLi-X/HDL-L")
library(HDLL) #Notice: The name of the R package without `-`
This installation process only takes a few minutes.
You will need the reference panel and linkage disequilibrium (LD) data for each region pertaining to the European ancestry population from the UK Biobank. Access this data at Zenodo.
For a detailed, step-by-step tutorial on how to conduct your analysis using HDL-L, please refer to our comprehensive guide available below. This demo only takes a few minutes.
# Load required libraries
library(dplyr)
library(data.table)
library(tidyverse)
# Load example GWAS summary statistics for basal metabolic rate
data(gwas1.example)
# Load example GWAS summary statistics for standing height
data(gwas2.example)
# Specify the paths to the LD and bim files
LD.path <- "/Path/to/LD/LD.path/"
bim.path <- "/Path/to/bim/bim.path/"
# Run HDL-L analysis
res.HDL <- HDL.L(
gwas1.example, gwas2.example,
Trait1name = "Basal metabolic rate",
Trait2name = "Standing height",
LD.path = LD.path,
bim.path = bim.path,
chr = 1,
piece = 3,
N0 = 0
)
# View the results
print(res.HDL)
Here, we provide a step-by-step tutorial for HDL-L and a real data example at the end. Before you begin your analysis, ensure that you have the necessary resources downloaded, so you need download reference panel and LD at first:
As with HDL, we already prepared the pre-computed reference panel and LD for each region of the European-ancestry population. You can download it from Zenodo.
In the "LD.path", it includes LD files, eigenvectors, and eigen matrixes for all local regions, end by "_LDSVD.rda"
In the "bim.path", it includes bim files for local regions, which helps to clean the summary statistics data and check if there are multiallelic or duplicated SNPs
In addition to the existing HDL-L reference panel, we have developed a new LD reference panel specifically tailored for cis-protein quantitative trait loci (cis-pQTL) analysis. This new panel was constructed using data from the Olink Proteomics proximity extension assay (PEA), focusing on approximately 3,000 proteins in the Olink® Explore panel. It can be downloaded from Zenodo.
If you want to build a reference panel for your predefined loci, you can use compute_ld_eigen.R
To analyze your data using HDL-L, it is crucial to format your summary statistics correctly. Below are the required columns that your input data file must include:
SNP
: SNP IDA1
: Effect alleleA2
: Reference alleleN
: Sample sizeZ
: Z-score
If Z
is not available, alternatively, you may provide:
b
: Estimate of marginal effect in GWASse
: Standard error of the estimates of marginal effects in GWAS
If the GWAS is based on logistic regression, b
should be the logarithm of OR (odds ratio), and se
is the standard error of log(OR).
The summary statistics should look like (b and se can be absent in this example since Z is available):
## SNP A1 A2 N b se Z
## 1 rs3131962 G A 205475 0.001004 0.004590 0.2187
## 2 rs12562034 A G 205475 0.005382 0.005011 1.0740
## 3 rs11240779 A G 205475 0.002259 0.003691 0.6119
## 4 rs57181708 G A 205475 0.005401 0.005114 1.0562
## 5 rs4422948 G A 205475 0.005368 0.003604 1.4893
## 6 rs4970383 A C 205475 0.004685 0.003582 1.3080
You can use HDL.L.data.wrangling.R
to do data wrangling for data from the Neale Lab round 2 GWAS of UK Biobank using commands.
Rscript /Path/to/HDL/HDL.L.data.wrangling.R \
gwas.file=/Path/to/gwas/data/datafile \
LD.path=/Path/to/LD.path/ \
GWAS.type=UKB.Neale \
output.file=/Path/to/gwas/gwas1 \
log.file=/Path/to/log/gwas1
If the GWAS is from other sources, you need to explicitly tell HDL.L.data.wrangling.R
how to understand the variable names in the GWAS. Other than this, the syntax is the same as that in the previous section. For example, if your GWAS looks like this:
## rsid alt ref tstat n_complete_samples beta se
## 1 rs3131962 G A 0.2187 205475 0.001004 0.004590
## 2 rs12562034 A G 1.0740 205475 0.005382 0.005011
## 3 rs11240779 A G 0.6119 205475 0.002259 0.003691
## 4 rs57181708 G A 1.0562 205475 0.005401 0.005114
## 5 rs4422948 G A 1.4893 205475 0.005368 0.003604
## 6 rs4970383 A C 1.3080 205475 0.004685 0.003582
You should use the below command to run HDL.L.data.wrangling.R
Rscript /Path/to/HDL/HDL.L.data.wrangling.R \
gwas.file=/Path/to/gwas/data/datafile \
LD.path=/Path/to/LD.path/ \
SNP=rsid A1=alt A2=ref N=n_complete_samples Z=tstat \
output.file=/Path/to/gwas/gwas1 \
log.file=/Path/to/log/gwas1
or
Rscript /Path/to/HDL/HDL.L.data.wrangling.R \
gwas.file=/Path/to/gwas/data/datafile \
LD.path=/Path/to/LD.path/ \
SNP=rsid A1=alt A2=ref N=n_complete_samples b=beta se=se \
output.file=/Path/to/gwas/gwas1 \
log.file=/Path/to/log/gwas1
You can execute HDL.L on the entire genome or specific regions to obtain results. It is optimized for parallel processing, allowing users to configure the execution to utilize multiple cores. When utilizing a single core, the typical computational time for analyzing all local regions is estimated at approximately 1.5 hours. Users are encouraged to exploit the parallel processing capabilities of their systems by allocating additional cores, thereby reducing the overall computation time. Detailed procedural guidance is provided in the example below.
The HDL.L function takes several arguments as outlined below:
- gwas1.df The first formatted summary statistics data. The input data frame should include the following columns:
SNP
: SNP IDA1
: Effect alleleA2
: Reference alleleN
: Sample sizeZ
: Z-score Alternatively, ifZ
is not provided, you may include:b
: Estimate of marginal effect in GWASse
: Standard error of the estimates of marginal effects in GWAS.
-
gwas2.df For the second formatted summary statistics data, the columns should be the same as
gwas1.df
. -
Trait1name The trait name for gwas1.df.
-
Trait2name The trait name for gwas2.df.
-
LD.path Path to the directory where the decompressed LD.path.zip file is stored.
-
bim.path Path to the directory where the decompressed bim.path.zip file is stored.
-
Nref Sample size of the reference sample where LD is computed. If the default UK Biobank reference sample is used, Nref = 335272.
-
N0 Number of individuals included in both cohorts.
-
output.file Location where the log and results should be written. If you do not specify a file, the log will be printed on the console.
-
eigen.cut Specifies which eigenvalues and eigenvectors in each LD score matrix should be used for HDL. The default value is 0.95. Users are allowed to specify a numeric value between 0 and 1 for eigen.cut.
-
intercept.output Logical,
FALSE
by default. Determines whether the intercept terms are included in the resultestimates.df
or not. -
fill.missing.N If
NULL
(default), SNPs with missingN
are removed. You can specify "median", "min", or "max" so that the missingN
will be filled accordingly. For example, "median" means the missingN
are filled with the medianN
of the SNPs with availableN
. -
lim Tolerance limitation, default
lim = exp(-18)
. -
chr The chromosome to which the region belongs.
-
piece The piece of the genome to which the region belongs. The whole genome is divided into 2,476 smaller, semi-independent blocks, each defined by LD calculated by Plink. The SNP information in each local region is included in these two data sets: "UKB_snp_counter_imputed.RData" and "UKB_snp_list_imputed_vector.RData".
We provide a real data example from UKBB:
cd /Path/to/Your/directory/
wget https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/21001_irnt.gwas.imputed_v3.female.tsv.bgz -O 21001_irnt.gwas.imputed_v3.female.tsv.bgz
wget https://broad-ukb-sumstats-us-east-1.s3.amazonaws.com/round2/additive-tsvs/I9_CHD.gwas.imputed_v3.male.tsv.bgz -O I9_CHD.gwas.imputed_v3.male.tsv.bgz
Rscript /Path/to/HDL/HDL.L.data.wrangling.R \
gwas.file=/Path/to/gwas/21001_irnt.gwas.imputed_v3.female.tsv.bgz \
LD.path=/Path/to/LD.path/ \
GWAS.type=UKB.Neale \
output.file=/Path/to/gwas/gwas1 \
log.file=/Path/to/log/gwas1
Rscript /Path/to/HDL/HDL.L.data.wrangling.R \
gwas.file=/Path/to/gwas/I9_CHD.gwas.imputed_v3.male.tsv.bgz \
LD.path=/Path/to/LD.path/ \
GWAS.type=UKB.Neale \
output.file=/Path/to/gwas/gwas2 \
log.file=/Path/to/log/gwas2
Rscript /Path/to/HDL/HDL.L.run.R \
gwas1.df=/Path/to/gwas/gwas1.hdl.rds \
gwas2.df=/Path/to/gwas/gwas2.hdl.rds \
Trait1name="gwas1" \
Trait2name="gwas2" \
LD.path=/Path/to/LD.path/ \
bim.path=/Path/to/bim.path/ \
N0=0 \
output.file=/Path/to/output/test.raw.gwas.Rout \
type="WG" \
cores=1 \
save.path=/Path/to/save/result/
If you only want to run on a specific region, you can run this command
Rscript /Path/to/HDL/HDL.L.run.R \
gwas1.df=/Path/to/gwas/gwas1.hdl.rds \
gwas2.df=/Path/to/gwas/gwas2.hdl.rds \
Trait1name="gwas1" \
Trait2name="gwas2" \
LD.path=/Path/to/LD.path/ \
bim.path=/Path/to/bim.path/ \
N0=0 \
output.file=/Path/to/output/test.raw.gwas.Rout \
chr=1 \
piece=147 \
cores=1 \
save.path=/Path/to/save/result/
The results of HDL.L are saved in the save.path as a rda file and here is the output file:
Attaching package: ‘data.table’
The following objects are masked from ‘package:dplyr’:
between, first, last
Loading GWAS1 ...
Loading GWAS2 ...
Processing chromosome 1 region 147
Analysis starts on Mon Apr 22 00:58:23 2024
0 SNPs were removed in GWAS 1 due to missing N or missing test statistic.
0 SNPs were removed in GWAS 2 due to missing N or missing test statistic.
329 out of 329 (100%) SNPs in reference panel are available in GWAS 1.
329 out of 329 (100%) SNPs in reference panel are available in GWAS 2.
Point estimates:
Heritability of phenotype 1: 8.25e-05
Heritability of phenotype 2: 2e-04
Genetic Covariance: 7.40e-05
Genetic Correlation: 0.6396
Heritability of phenotype 1: 8.25e-05 ,P:0.0893
Heritability of phenotype 2: 2e-04 , P:0.00e+00
Genetic Correlation: 0.6396 (-0.4085,1)
P: 0.2480275
Analysis finished at Mon Apr 22 00:58:25 2024
Saving results ...
Finished!
If you use the HDL-L software, please cite:
- Li, Y., Pawitan, Y., & Shen, X. An enhanced framework for local genetic correlation analysis. (2024)
- Ning, Z., Pawitan, Y. & Shen, X. High-definition likelihood inference of genetic correlations across human complex traits. Nat Genet (2020).
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
- Report bugs by opening a new issue on this GitHub page.
- Send email to the authors: Yuying Li or Xia Shen.
This repository will be integrated into the HDL software in the future.