The goal of ensemblQueryR is to seemlessly integrate querying of Ensembl databases into your R workflow. It does this by formatting and submitting user queries to the Ensembl API. In its current iteration, the package contains functions for the three Ensembl Linkage Disequilibrium (LD) 'endpoints': 1. Query LD in a window around one SNP, 2. Query LD for a pair of query SNPs and 3. Query LD for SNPs at a specified genomic locus.
You can install the development version of ensemblQueryR like so:
# load remotes package
library(remotes)
remotes::install_github("ainefairbrother/ensemblQueryR")
All functions in this package take the pop
argument which defines the population for which to retrieve LD statistics. So, to get a list of options for this argument, run the ensemblQueryGetPops()
function.
# load libraries
library(ensemblQueryR)
library(magrittr)
ensemblQueryR::ensemblQueryGetPops()
For one query rsID, get all rsIDs in LD using ensemblQueryLDwithSNPwindow
ensemblQueryR::ensemblQueryLDwithSNPwindow(rsid="rs3851179",
r2=0.8,
d.prime=0.8,
window.size=500,
pop="1000GENOMES:phase_3:EUR")
For a vector of query rsIDs, get all rsIDs in LD if your query is <1000 rsIDs in length. This is due to Ensembl's 1000 query limit. See next example for queries >1000 rsIDs in length.
rsid.vec <- c("rs7153434","rs1963154","rs12672022","rs3852802","rs12324408","rs56346870")
# run query on rsid.vec
ensemblQueryR::ensemblQueryLDwithSNPwindowList(rsid.vec,
r2=0.8,
d.prime=0.8,
window.size=500,
pop="1000GENOMES:phase_3:EUR")
There is a separate function for large queries (>1000 rsIDs) because of Ensembl's API query size limit. This function takes a data.frame
as an input, and gets all rsIDs in LD with a column containing query rsIDs called rsid
.
# example input data
in.table <- data.frame(rsid=rep(c("rs7153434","rs1963154","rs12672022","rs3852802","rs12324408","rs56346870"), 500))
# run query on in.table
ensemblQueryR::ensemblQueryLDwithSNPwindowDataframe(
in.table=in.table,
r2=0.8,
d.prime=0.8,
window.size=500,
pop="1000GENOMES:phase_3:EUR"
)
ensemblQueryLDwithSNPpair(
chr="6",
start="25837556",
end="25843455",
pop="1000GENOMES:phase_3:EUR"
)
ensemblQueryLDwithSNPpairDataframe(
in.table=data.frame(rsid1=rep("rs6792369", 10), rsid2=rep("rs1042779", 10)),
pop="1000GENOMES:phase_3:EUR",
keep.original.table.row.n=F,
parallelise=F
)
ensemblQueryLDwithSNPregion(
chr="6",
start="25837556",
end="25843455",
pop="1000GENOMES:phase_3:EUR"
)
Please note that this code is still under development and may contain bugs or errors. It is not recommended for use in production environments. Use at your own risk. I am working on improving the code, addressing any issues, and expanding the package's capabilities so please check back for updates.