forked from slowkow/allelefrequencies
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
238 lines (179 loc) · 6.54 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
---
output:
md_document:
variant: "gfm"
standalone: true
toc: false
toc_depth: 2
html_document:
toc: false
toc_depth: 2
self_contained: true
---
# HLA allele frequencies in tab-delimited format
<a href="https://zenodo.org/badge/latestdoi/614108789"><img src="https://zenodo.org/badge/614108789.svg" alt="DOI"></a>
Kamil Slowikowski
`r format(Sys.Date())`
```{r,include=FALSE}
options(width=100)
library(data.table)
library(dplyr)
library(glue)
library(readr)
library(magrittr)
library(ggplot2)
library(ggstance)
devtools::source_gist("c83e078bf8c81b035e32c3fc0cf04ee8", filename = 'render_toc.R')
file_size <- function(x) glue("{fs::file_size(x)}B")
d <- fread("afnd.tsv")
n_hla <- d %>% filter(group == "hla") %>% count(gene) %>% nrow
n_kir <- d %>% filter(group == "kir") %>% count(gene) %>% nrow
n_mic <- d %>% filter(group == "mic") %>% count(gene) %>% nrow
n_cyt <- d %>% filter(group == "cyt") %>% count(gene) %>% nrow
```
**Table of Contents**
```{r toc, echo=FALSE}
render_toc("README.Rmd", toc_depth = 2)
```
## Introduction
Here, we share a single file [afnd.tsv](afnd.tsv) (`r file_size("afnd.tsv")`) in tab-delimited format with all allele frequencies for `r n_hla` HLA genes, `r n_kir` KIR genes, `r n_mic` MIC genes, and `r n_cyt` cytokine genes from [Allele Frequency Net Database](http://allelefrequencies.net) (AFND).
The script [allelefrequencies.py](allelefrequencies.py) automatically downloads allele frequencies from the website.
[What is the Allele Frequency Net Database?](http://www.allelefrequencies.net/faqs.asp)
> The Allele Frequency Net Database (AFND) is a public database which contains
> frequency information of several immune genes such as Human Leukocyte
> Antigens (HLA), Killer-cell Immunoglobulin-like Receptors (KIR), Major
> histocompatibility complex class I chain-related (MIC) genes, and a number of
> cytokine gene polymorphisms.
The [afnd.tsv](afnd.tsv) file looks like this:
```{r}
d <- fread("afnd.tsv")
head(d)
```
Definitions:
- `alleles_over_2n` (Alleles / 2n)
Allele Frequency: total number of copies of
the allele in the population sample in three decimal format.
- `indivs_over_n` (100 \* Individuals / n)
Percentage of individuals who have the allele or gene.
- `n` (Individuals)
Number of individuals sampled from the population.
## Examples
Here are a few examples of how we can use R to analyze these data.
View the largest and smallest populations available in the data:
```{r}
d %>%
mutate(n = parse_number(n)) %>%
select(population, n) %>%
unique() %>%
arrange(-n)
```
Count the number of alleles for each gene:
```{r}
d %>%
count(group, gene, allele) %>%
count(group, gene) %>%
arrange(-n) %>%
head(15)
```
Sum the allele frequencies for each gene in each population. This allows us to
see which populations have a set of allele frequencies that adds up to 100
percent:
```{r}
d %>%
mutate(alleles_over_2n = parse_number(alleles_over_2n)) %>%
filter(alleles_over_2n > 0) %>%
group_by(group, gene, population) %>%
summarize(sum = sum(alleles_over_2n)) %>%
count(sum == 1)
```
```{r, include = FALSE}
theme_set(
theme_bw(base_size = 14) +
theme(
plot.caption.position = "plot"
)
)
```
Plot the frequency of a specific allele in populations with more than 1000
sampled individuals:
```{r, dpi = 300, fig.width = 9, fig.height = 7}
my_allele <- "DQB1*02:01"
my_d <- d %>% filter(allele == my_allele) %>%
mutate(
n = parse_number(n),
alleles_over_2n = parse_number(alleles_over_2n)
) %>%
filter(n > 1000) %>%
arrange(-alleles_over_2n)
ggplot(my_d) +
aes(x = alleles_over_2n, y = reorder(population, alleles_over_2n)) +
scale_y_discrete(position = "right") +
geom_colh() +
labs(
x = "Allele Frequency (Alleles / 2N)",
y = NULL,
title = glue("Frequency of {my_allele} across populations"),
caption = "Data from AFND http://allelefrequencies.net"
)
```
## Citation
If you use this data, please cite the latest manuscript about **Allele Frequency
Net Database**:
- Gonzalez-Galarza FF, McCabe A, Santos EJMD, Jones J, Takeshita L,
Ortega-Rivera ND, et al. [Allele frequency net database (AFND) 2020 update:
gold-standard data classification, open access genotype data and new query
tools.](https://pubmed.ncbi.nlm.nih.gov/31722398) Nucleic Acids Res. 2020;48:
D783–D788. doi:10.1093/nar/gkz1029
```
@ARTICLE{Gonzalez-Galarza2020,
title = "{Allele frequency net database (AFND) 2020 update: gold-standard
data classification, open access genotype data and new query
tools}",
author = "Gonzalez-Galarza, Faviel F and McCabe, Antony and Santos, Eduardo
J Melo Dos and Jones, James and Takeshita, Louise and
Ortega-Rivera, Nestor D and Cid-Pavon, Glenda M Del and
Ramsbottom, Kerry and Ghattaoraya, Gurpreet and Alfirevic, Ana
and Middleton, Derek and Jones, Andrew R",
journal = "Nucleic acids research",
volume = 48,
number = "D1",
pages = "D783--D788",
month = jan,
year = 2020,
language = "en",
issn = "0305-1048, 1362-4962",
pmid = "31722398",
doi = "10.1093/nar/gkz1029",
pmc = "PMC7145554"
}
```
## Related work
Here are all of the resources I could find that have information about HLA
allele frequencies in different populations.
### CIWD version 3.0.0
- Hurley CK, Kempenich J, Wadsworth K, Sauter J, Hofmann JA, Schefzyk D, et al.
[Common, intermediate and well-documented HLA alleles in world populations:
CIWD version 3.0.0.](https://www.ncbi.nlm.nih.gov/pubmed/31970929) Hladnikia.
2020;95: 516–531. doi:10.1111/tan.13811
The authors provide xlsx files on this website:
- https://www.ihiw18.org/component-immunogenetics/download-common-and-well-documented-alleles-3-0
But the frequency information is binned into categories:
- C, common
- I, intermediate
- WD, well-documented
- NA, not applicable
There is a tool called [HLA-Net](https://hla-net.eu/tools/cwd-viewer/results/)
that provides a visualization of the CIWD data.
### IEDB Tools
http://tools.iedb.org/population/download
At the IEDB Tools page, we can find a tool called **Population Coverage**. The
authors have downloaded the HLA frequency information from AFND and saved it in
a Python pickle file.
### dbMHC
https://www.ncbi.nlm.nih.gov/gv/mhc
The dbMHC database and website appears to be discontinued. But an archive of
old files is still available via FTP.
## Acknowledgments
Thanks to David A. Wells for sharing [scrapeAF][1], which inspired me to work
on this project.
[1]: https://github.com/DAWells/scrapeAF