Skip to content

Commit

Permalink
basic data input and structure vignette
Browse files Browse the repository at this point in the history
  • Loading branch information
delomast committed Jun 23, 2020
1 parent df28df7 commit b806959
Show file tree
Hide file tree
Showing 4 changed files with 32 additions and 4 deletions.
1 change: 1 addition & 0 deletions DESCRIPTION
Original file line number Diff line number Diff line change
Expand Up @@ -18,3 +18,4 @@ Suggests:
knitr,
rmarkdown
VignetteBuilder: knitr
LazyData: true
11 changes: 11 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,3 +9,14 @@ Current options: </br>
* "single-sided" grandparentage: a trio of grandchild + both maternal grandparents (or both paternal grandparents) </br>
* single parentage: a pair of parent + offspring </br>

Install with:
```
devtools::install_github("delomast/gRandma")
```

To install and view the vignette:
```
devtools::install_github("delomast/gRandma", build_vignettes = TRUE)
browseVignettes("gRandma")
```

2 changes: 1 addition & 1 deletion test.R
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ for(i in seq(3,ncol(data_mh_snp) - 1, 2)){
to_remove <- c(to_remove, to_remove + 1)
data_mh_snp <- data_mh_snp[,-to_remove]
# add data to package
usethis::use_data(data_mh_snp, overwrite = TRUE)
usethis::use_data(data_mh_snp, overwrite = TRUE, internal = FALSE)

# add vignette
usethis::use_vignette("Load_in_data_and_gmaData_structure")
22 changes: 19 additions & 3 deletions vignettes/Load_in_data_and_gmaData_structure.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,15 @@ For most things, you will just take this object and use it as input for other fu

gmaData objects are lists with nine entries:

* `baseline`: A recoded, one-column per call version of the baseline. Genotypes are represented by an integer (starting at 0) and missing genotypes are `NA`.
* `mixture`: A recoded version of the input mixture (if one was input).
* `baseline`: A recoded, one-column per call version of the baseline (potential grandparents/parents). Genotypes are represented by an integer (starting at 0) and missing genotypes are `NA`.
* `mixture`: A recoded version of the input mixture (potential descendants, if one was input).
* `unsampledPops`: experimental - ignore and do not use for now
* `genotypeErrorRates`: a list of matrices giving the genotyping error model. The row label indicates the true genotype and the column indicates the observed genotypes. The values are the probability of observing each genotype given the true genotype.
* `genotypeKeys`: A list of dataframes defining each of the genotypes as represented by integers
* `alleleKeys`: A list of dataframes defining each of the alleles as represented by integers
* `baselineParams`: A list with an entry for each baseline population. Each population is represented by another list of numeric vectors - one for each locus. These vectors are the parameters of a Dirichlet posterior for estimates of allele frequencies given a Dirichlet prior with 1/n for all parameters where n is the number of alleles at that locus and ignoring genotyping error. The allele frequencies used by gRandma for a given baseline population and locus are these values normalized to sum to 1.
* `unsampledPopsParams`: experimental - ignore and do not use for now
* `missingParams`: A list of numeric vectors, one for each locus. These have the parameters of a Beta posterior for estimates of the proability a genotype is missing given a Beta(.5,.5) prior.
* `missingParams`: A list of numeric vectors, one for each locus. These have the parameters of a Beta posterior for estimates of the proability a genotype is missing given a Beta(.5,.5) prior. The probabilities of missing genotypes used by gRandma for a given locus are these values normalized to sum to 1.

```{r}
head(gData_1$baseline)
Expand All @@ -65,4 +65,20 @@ gData_1$missingParams[1:5]
```


Now let's pretend we have separate mixture and baseline populations. These should be two separate dataframes, with identical columns expect that the mixture dataframe should not have the column representing populations. We can split our example data into two:
```{r}
mixtureData <- data_mh_snp[data_mh_snp$Pop == "Pop_4",]
mixtureData <- mixtureData[,-1] # remove Pop column from mixture
baselineData <- data_mh_snp[data_mh_snp$Pop != "Pop_4",]
mixtureData[1:5,1:5]
baselineData[1:5,1:5]
```

And now we can create a gmaData object with both a baseline and a mixture.
```{r}
gData_2 <- createGmaInput(baseline = baselineData, mixture = mixtureData, perAlleleError = .005, dropoutProb = .005,
markerType = "microhaps")
gData_2
```

0 comments on commit b806959

Please sign in to comment.