tidyGWAS

Interested in the trying tidyGWAS out? Check out the get started page.

Genome-wide summary statistics are becoming a staple in many different genetics and genomics analysis pipelines. Often, the specific filters suggested for pipelines can be different, requiring each pipeline to have a step where summary statistics are “munged”.

tidyGWAS aims to provide a standardized format before any pipeline specific munging is done. With that in mind, tidyGWAS is conservative in removing rows, and by default keeps both indels and multi-allelic variants.

tidyGWAS does the following:

Detection of duplicated rows (based on RSID_REF_ALT or CHR_POS_REF_ALT)
Standardized column names
Automatic updating of merged RSIDs
Detection and optional removal of deletions/insertions (“indels”)
Detection of non rsID values in RSID column, and automatic parsing of the common CHR:POS or CHR:POS:REF:ALT format
Standardization of CHR values (ex: “23” -> “X”, “chr1” -> “1”)
Validation of standard GWAS columns, B, SE, P, N, FREQ, Z, CaseN, ControlN, A1, A2
1. Extremely small pvalues are by default converted to 2.225074e-308 (minimum pvalue in R)
Imputation of missing columns: RSID from CHR:POS or CHR:POS from RSID. Any of B,SE, P, Z, N and EAF if missing and possible
Validation of CHR:POS:RSID by matching with dbSNP v.155
Cleaned sumstats are provided with coordinates on both GRCh37 and GRCh38, with TRUE/FALSE flags for indels and variants that are multi-allelic in the dataset

From working with standardized GWAS formats, we’ve found that having both GRCh37 and GRCh38 coordinates, and standardized column names significantly speeds up downstream analysis.

The computationally intensive part of aligning summary statistics with dbSNP 155 (> 940 million rows) for both GRCh37 and GRCh38 (in total 1.8 billion rows) is implemented using the Apache Arrow R implementation, allowing for the full function to run in <3 minutes, using less than 16gb, with ~7 million rows on a Macbook Pro M2.

Name		Name	Last commit message	Last commit date
Latest commit History 132 Commits
.devcontainer		.devcontainer
.github		.github
R		R
inst/extdata		inst/extdata
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
NEWS.md		NEWS.md
README.Rmd		README.Rmd
README.md		README.md
_pkgdown.yml		_pkgdown.yml
codecov.yml		codecov.yml
tidyGWAS.Rproj		tidyGWAS.Rproj

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Licenses found

Repository files navigation

tidyGWAS

About

Licenses found

Releases

Packages

Languages

License

Licenses found

Ararder/tidyGWAS

Folders and files

Latest commit

History

Repository files navigation

tidyGWAS

About

Resources

License

Licenses found

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages