Skip to content

Commit

Permalink
update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
plantimals committed Aug 25, 2019
1 parent 62e0040 commit 8461b1e
Showing 1 changed file with 19 additions and 5 deletions.
24 changes: 19 additions & 5 deletions readme.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,26 @@
# 2vcf - convert raw 23andme or ancestry.com data to VCF
## 2vcf

The [VCF](https://samtools.github.io/hts-specs/VCFv4.3.pdf) is a widely adopted format for storing detailed data about genetic variation. Services like [23andme](https://www.23andme.com/) and [ancestry.com](https://www.ancestry.com/) offer to genotype customers at less than a million well-characterized sites in the human genome. It is possible to obtain the raw data collected by these sites, but the raw data are provided in a minimal format, which is not trivial to enrich and transform into the VCF format. _2vcf_ converts the raw output from 23andme or ancestry.com into a gzipped VCF file. The output vcf is populated with [human variant data](https://www.ncbi.nlm.nih.gov/variation/docs/human_variation_vcf/), which includes all alternate alleles, annotations, etc.
in order to improve individual sovereignty over genetic/genomic information, facilitate a deeper understanding of biology and computation, and promote shared meaning, openb.io provides `2vcf` under the [MIT license](https://mit-license.org). `2vcf` will convert raw genotype data exports from [23andme](https://www.23andme.com) or [Ancestry.com](https://www.ancestry.com) into [VCF format](https://samtools.github.io/hts-specs/VCFv4.2.pdf).

In order to build 2vcf, the [golang](https://golang.org/) build tool is required. On os x use [homebrew](https://brew.sh/) to install it `brew install go`.
`2vcf` produces a VCF that contains annotations from dbSNP [build 151](https://github.com/ncbi/dbsnp/tree/master/Build%20Announcements/151) on `GRCh37.p13`. these annotations include allele frequencies from various sources including [1000 Genomes](https://www.internationalgenome.org) and [ExAC](http://exac.broadinstitute.org/), [RefSeq](https://www.ncbi.nlm.nih.gov/refseq/) gene annotations, and functional class of the variant.

Build 2vcf by checking out the [source repo](https://github.com/plantimals/2vcf), entering the directory `cd 2vcf`, and running the make file `make`. Build for windows by using `make windows`.
the source VCF for dbSNP build 151 weighs in at around 15GB. the sites assayed by personal genomics companies are but a tiny fraction of the totality of dbSNP sites. so I make available a reference version of the dbSNP VCF which has been filtered down to those sites likely to be contained in your exported 23andme or Ancestry.com exported raw data. for more details on which sites are included and why, see this writeup on the sources for `2vcf reference v2.0`.

Convert your raw data by running the utility `./2vcf --input-file my-raw-data.zip --output-file my-personal-genotypes.vcf.gz`. Running the utility from another location works as well, but remember to specify the path to the reference data as well `--vcf-ref /home/me/git/2vcf/reference.vcf.gz`.
## usage

1. download the appropriate binary for your architecture from the [most recent github release](https://github.com/plantimals/2vcf/releases/tag/v0.4.0). un-tar the contents after downloading.

2. download the [reference vcf](http://openb.io/2vcf/2vcf-v2.0.vcf.gz) http://openb.io/2vcf/2vcf-v2.0.vcf.gz

3. download your raw genotype data from [23andme](https://customercare.23andme.com/hc/en-us/articles/212196868-Accessing-and-Downloading-Your-Raw-Data) or [Ancestry](https://support.ancestry.com/s/article/Downloading-AncestryDNA-Raw-Data).

4. now run the `2vcf` binary with the appropriate options:

```
./2vcf conv 23andme --ref path/to/2vcf-v2.0.vcf.gz \
--input path/to/my/raw/genotypes.zip \
--output my-personal-annotated.vcf.gz
```

Please report any errors or difficulties with the utility.

0 comments on commit 8461b1e

Please sign in to comment.