README.Rmd

---
output:
  md_document:
    variant: "gfm"
    standalone: true
    toc: false
    toc_depth: 2
  html_document:
    toc: false
    toc_depth: 2
    self_contained: true
---

# HLA allele frequencies in tab-delimited format

<a href="https://zenodo.org/badge/latestdoi/614108789"><img src="https://zenodo.org/badge/614108789.svg" alt="DOI"></a>

Kamil Slowikowski

`r format(Sys.Date())`

```{r,include=FALSE}
options(width=100)
library(data.table)
library(dplyr)
library(glue)
library(readr)
library(magrittr)
library(ggplot2)
library(ggstance)
devtools::source_gist("c83e078bf8c81b035e32c3fc0cf04ee8", filename = 'render_toc.R')
file_size <- function(x) glue("{fs::file_size(x)}B")
d <- fread("afnd.tsv")
n_hla <- d %>% filter(group == "hla") %>% count(gene) %>% nrow
n_kir <- d %>% filter(group == "kir") %>% count(gene) %>% nrow
n_mic <- d %>% filter(group == "mic") %>% count(gene) %>% nrow
n_cyt <- d %>% filter(group == "cyt") %>% count(gene) %>% nrow
```


**Table of Contents**

```{r toc, echo=FALSE} 
render_toc("README.Rmd", toc_depth = 2)
```


## Introduction

Here, we share a single file [afnd.tsv](afnd.tsv) (`r file_size("afnd.tsv")`) in tab-delimited format with all allele frequencies for `r n_hla` HLA genes, `r n_kir` KIR genes, `r n_mic` MIC genes, and `r n_cyt` cytokine genes from [Allele Frequency Net Database](http://allelefrequencies.net) (AFND).

The script [allelefrequencies.py](allelefrequencies.py) automatically downloads allele frequencies from the website.

[What is the Allele Frequency Net Database?](http://www.allelefrequencies.net/faqs.asp)

> The Allele Frequency Net Database (AFND) is a public database which contains
> frequency information of several immune genes such as Human Leukocyte
> Antigens (HLA), Killer-cell Immunoglobulin-like Receptors (KIR), Major
> histocompatibility complex class I chain-related (MIC) genes, and a number of
> cytokine gene polymorphisms.

The [afnd.tsv](afnd.tsv) file looks like this:

```{r}
d <- fread("afnd.tsv")
head(d)
```

Definitions:

- `alleles_over_2n` (Alleles / 2n)
  Allele Frequency: total number of copies of
  the allele in the population sample in three decimal format.

- `indivs_over_n` (100 \* Individuals / n)
  Percentage of individuals who have the allele or gene.

- `n` (Individuals)
  Number of individuals sampled from the population.


## Examples

Here are a few examples of how we can use R to analyze these data.

View the largest and smallest populations available in the data:

```{r}
d %>%
  mutate(n = parse_number(n)) %>%
  select(population, n) %>%
  unique() %>%
  arrange(-n)
```

Count the number of alleles for each gene:

```{r}
d %>%
  count(group, gene, allele) %>%
  count(group, gene) %>%
  arrange(-n) %>%
  head(15)
```

Sum the allele frequencies for each gene in each population. This allows us to
see which populations have a set of allele frequencies that adds up to 100
percent:

```{r}
d %>%
  mutate(alleles_over_2n = parse_number(alleles_over_2n)) %>%
  filter(alleles_over_2n > 0) %>%
  group_by(group, gene, population) %>%
  summarize(sum = sum(alleles_over_2n)) %>%
  count(sum == 1)
```

```{r, include = FALSE}
theme_set(
  theme_bw(base_size = 14) +
  theme(
    plot.caption.position = "plot"
  )  
)
```

Plot the frequency of a specific allele in populations with more than 1000
sampled individuals:

```{r, dpi = 300, fig.width = 9, fig.height = 7}

my_allele <- "DQB1*02:01"
my_d <- d %>% filter(allele == my_allele) %>%
  mutate(
    n = parse_number(n),
    alleles_over_2n = parse_number(alleles_over_2n)
  ) %>%
  filter(n > 1000) %>%
  arrange(-alleles_over_2n)

ggplot(my_d) +
  aes(x = alleles_over_2n, y = reorder(population, alleles_over_2n)) +
  scale_y_discrete(position = "right") +
  geom_colh() +
  labs(
    x = "Allele Frequency (Alleles / 2N)",
    y = NULL,
    title =  glue("Frequency of {my_allele} across populations"),
    caption = "Data from AFND http://allelefrequencies.net"
  )
```


## Citation

If you use this data, please cite the latest manuscript about **Allele Frequency
Net Database**:

- Gonzalez-Galarza FF, McCabe A, Santos EJMD, Jones J, Takeshita L,
  Ortega-Rivera ND, et al. [Allele frequency net database (AFND) 2020 update:
  gold-standard data classification, open access genotype data and new query
  tools.](https://pubmed.ncbi.nlm.nih.gov/31722398) Nucleic Acids Res. 2020;48:
  D783–D788. doi:10.1093/nar/gkz1029

```
@ARTICLE{Gonzalez-Galarza2020,
  title    = "{Allele frequency net database (AFND) 2020 update: gold-standard
              data classification, open access genotype data and new query
              tools}",
  author   = "Gonzalez-Galarza, Faviel F and McCabe, Antony and Santos, Eduardo
              J Melo Dos and Jones, James and Takeshita, Louise and
              Ortega-Rivera, Nestor D and Cid-Pavon, Glenda M Del and
              Ramsbottom, Kerry and Ghattaoraya, Gurpreet and Alfirevic, Ana
              and Middleton, Derek and Jones, Andrew R",
  journal  = "Nucleic acids research",
  volume   =  48,
  number   = "D1",
  pages    = "D783--D788",
  month    =  jan,
  year     =  2020,
  language = "en",
  issn     = "0305-1048, 1362-4962",
  pmid     = "31722398",
  doi      = "10.1093/nar/gkz1029",
  pmc      = "PMC7145554"
}
```

## Related work

Here are all of the resources I could find that have information about HLA
allele frequencies in different populations.

### CIWD version 3.0.0

- Hurley CK, Kempenich J, Wadsworth K, Sauter J, Hofmann JA, Schefzyk D, et al.
  [Common, intermediate and well-documented HLA alleles in world populations:
  CIWD version 3.0.0.](https://www.ncbi.nlm.nih.gov/pubmed/31970929) Hladnikia.
  2020;95: 516–531. doi:10.1111/tan.13811

The authors provide xlsx files on this website:

- https://www.ihiw18.org/component-immunogenetics/download-common-and-well-documented-alleles-3-0

But the frequency information is binned into categories:

- C, common
- I, intermediate
- WD, well-documented
- NA, not applicable

There is a tool called [HLA-Net](https://hla-net.eu/tools/cwd-viewer/results/)
that provides a visualization of the CIWD data.

### IEDB Tools

http://tools.iedb.org/population/download

At the IEDB Tools page, we can find a tool called **Population Coverage**. The
authors have downloaded the HLA frequency information from AFND and saved it in
a Python pickle file.

### dbMHC

https://www.ncbi.nlm.nih.gov/gv/mhc

The dbMHC database and website appears to be discontinued. But an archive of
old files is still available via FTP.


## Acknowledgments

Thanks to David A. Wells for sharing [scrapeAF][1], which inspired me to work
on this project.

[1]: https://github.com/DAWells/scrapeAF