Hail is a framework for scalable genetic data analysis. Hail is pre-alpha software and under active development. Hail is written in Scala (mostly) and uses Apache Spark and other Apache Hadoop projects. If you are interested in getting involved in Hail development, email [email protected].
- Building
- Representation
- Hail's expression language
- Importing VCF files
- Importing sites-only text files
- Splitting Multiallelic Variants
- Renaming Samples
- Annotating Variants
- Annotating Samples
- Annotating Global
- Imputing Sex
- Quality Control
- PCA
- Annotating with the Variant Effect Predictor
- Computing aggregate statistics on intervals
- Filtering
- Querying using SQL
- Linear regression
- Mendel errors
- Exporting to TSV
- Exporting to VCF
- Exporting to Plink
- Persist
If you use Hail for published work, please cite both the software:
and the forthcoming manuscript describing Hail (if possible):
- Cotton Seed, Alex Bloemendal, Jonathan M Bloom, Jacqueline I Goldstein, Daniel King, Timothy Poterba. Hail: An Open-Source Framework for Scalable Genetic Data Analysis. In preparation.
or the following paper which includes a brief introduction to Hail in the online methods:
- Andrea Ganna, Giulio Genovese, Daniel P Howrigan, Andrea Byrnes, Mitja Kurki, Seyedeh M Zekavat, Christopher W Whelan, Robert E Handsaker, Mart Kals, Alex Bloemendal, Jonathan M Bloom, Jacqueline I Goldstein, Timothy Poterba, Cotton Seed, Michel G Nivard, Pradeep Natarajan, Reedik Magi, Diane Gage, Elise B Robinson, Andres Metspalu, Veikko Salomaa, Jaana Suvisaari, Shaun M Purcell, Pamela Sklar, Sekar Kathiresan, Mark J Daly, Steven A McCarroll, Patrick F Sullivan, Aarno Palotie, Tonu Esko, Christina Hultman, Benjamin M Neale. Ultra-rare disruptive and damaging mutations influence educational attainment in the general population. doi: http://dx.doi.org/10.1101/050195.
Here is a rough list of features currently planned or under development:
- generalized query language
- better interoperability with other Hadoop projects
- kinship estimation from GRM
- LMM
- burden tests, SKAT
- logistic regression
- posterior (PP)
- LD pruning
- TDT
- Kaitlin Samocha's de novo caller