Skip to content

Commit

Permalink
New README
Browse files Browse the repository at this point in the history
  • Loading branch information
HectorRDB committed Jul 8, 2018
1 parent cce979c commit a410ad1
Show file tree
Hide file tree
Showing 12 changed files with 144 additions and 146 deletions.
67 changes: 61 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,16 @@ Understanding RNA-seq analysis through an example: voting behavior at the French
Hector Roux de Bézieux
May 2018

- [Introduction](#introduction)
- [1 Context](#context)
- [2 Loading the data and building an ExpressionSet object](#loading-the-data-and-building-an-expressionset-object)
- [3 Data Cleaning](#data-cleaning)
- [3.1 Filtering the cells (delegate)](#filtering-the-cells-delegate)
- [3.2 Filtering the genes (votes)](#filtering-the-genes-votes)
- [Normalization and correcting for zero inflation](#normalization-and-correcting-for-zero-inflation)
- [Recovering / discovering biological types](#recovering-discovering-biological-types)
- [Thanks](#thanks)

Introduction
============

Expand Down Expand Up @@ -49,7 +59,7 @@ Assay <- ExpressionSet(assayData = voting_record,
3.1 Filtering the cells (delegate)
----------------------------------

In scRNA-seq settings, we want to filter the cells with too few total reads, or reads of too poor quality (that do not match to the genome of reference). Here, we instead filter the delegates with too few votes. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, and then its replacement is not around long enough, or because the delegate was really not voting much. A special case is of course the chairman of the Assembly who do not take part in the votes.
In scRNA-seq settings, we want to filter the cells with too few total reads, or reads of too poor quality (that do not match to the genome of reference). Here, we instead filter the delegates with too few votes. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, and then its replacement is not around long enough.

We therefore filter delegates whose number of votes is more than 2 SD below the mean (so equal or less to 40).

Expand All @@ -60,21 +70,46 @@ Nb_Votes <- data.frame(Nb_Votes = colSums(!is.na(exprs(Assay))),
ggplot(Nb_Votes, aes(x = Nb_Votes, fill = Nb_Votes <= 40)) +
stat_bin(breaks = seq(0, max(Nb_Votes$Nb_Votes), 20)) +
labs(x = "Number of votes") +
guides(fill = guide_legend(title = "Less than 40 votes"))
guides(fill = guide_legend(title = "Could vote\nless than\n40 times"))
```

<img src="Single-cell_files/figure-markdown_github/filtering cells-1.png" style="display: block; margin: auto;" />

``` r
Assay <- Assay[, colSums(!is.na(exprs(Assay))) > 40]
Assay <- Assay[, Nb_Votes$Nb_Votes > 40]
```

We then simplify the data by replacing NAs by 0 (as abstentions). We can then check whether some delegates never voted. Again, we remove people whose number of votes is more than 2 SD below the mean number of votes (so delegates who voted less than 38 votes).

``` r
E <- exprs(Assay)
E[is.na(E)] <- 0
pd <- phenoData(Assay)
fd <- featureData(Assay)
colnames(E) <- rownames(pd) <- pd@data$identifiant
rownames(E) <- rownames(fd) <- fd@data$Number
Assay <- ExpressionSet(assayData = E, phenoData = pd, featureData = fd)

Nb_Votes <- data.frame(Nb_Votes = colSums(exprs(Assay) != 0),
delegate = rownames(phenoData(Assay)))
ggplot(Nb_Votes, aes(x = Nb_Votes, fill = Nb_Votes <= 38)) +
stat_bin(breaks = seq(0, max(Nb_Votes$Nb_Votes), 19)) +
labs(x = "Number of votes") +
guides(fill = guide_legend(title = "Actually voted\nless than\n38 times"))
```

<img src="Single-cell_files/figure-markdown_github/change NA to zeros-1.png" style="display: block; margin: auto;" />

``` r
Assay <- Assay[, Nb_Votes$Nb_Votes > 38]
```

3.2 Filtering the genes (votes)
-------------------------------

In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes to keep speeds up calculations. Secondly, we are usually intersted in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of "at least *i* counts in *j* cells" are usually quite efficient. The aim is too be quite broad.
In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes to keep speeds up calculations. Secondly, we are usually interested in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of "at least *i* counts in *j* cells" are usually quite efficient. The aim is too be quite broad.

In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where an overwehlming majority of delegates voted "yes", or voted "no. We therefore filter out votes with a low standard deviation (&lt; 0.5). This filters out 6.1% of the votes. Note that in scRNA-seq settings, we would usually filter at least 50% of the reads (while keeping more than 90% of the reads).
In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where an overwhelming majority of delegates voted "yes", or voted "no. We therefore filter out votes with a low standard deviation (&lt; 0.5). This filters out 6.1% of the votes. Note that in scRNA-seq settings, we would usually filter at least 50% of the reads (while keeping more than 90% of the reads).

``` r
Vote_repartion <- data.frame(sd = rowSds((exprs(Assay)), na.rm = T),
Expand All @@ -87,9 +122,29 @@ ggplot(Vote_repartion, aes(x = sd, fill = sd < 0.5)) + geom_histogram() +
<img src="Single-cell_files/figure-markdown_github/filtering genes-1.png" style="display: block; margin: auto;" />

``` r
Assay <- Assay[rowSds((exprs(Assay))) >= 0.5, ]
Assay <- Assay[rowSds((exprs(Assay)), na.rm = T) >= 0.5, ]
```

Normalization and correcting for zero inflation
===============================================

Recovering / discovering biological types
=========================================

In most scRNA-seq settings, we have no information about the biological type. We therefore want to cluster the cells and then use marker genes to identify known-cell types. Alternatively, we can use clustering to discover new cell types and identify marker genes for those new types.

In our example, let us imagine for a moment that we do not know the political group to which each delegate belong. Can we recover those groups? This will be done in two steps: discover stable clusters of delegates (cells) then identify vote (genes) that help distinguish those clusters. \#\# Clustering

``` r
pca <- prcomp(t(exprs(Assay)), center = T, scale = T)
ggplot(data.frame(x = pca$x[,1], y = pca$x[,2], col = phenoData(Assay)@data$group),
aes(x = x, fill = col)) +
facet_wrap(col~.) +
geom_density()
```

<img src="Single-cell_files/figure-markdown_github/clustering-1.png" style="display: block; margin: auto;" />

Thanks
======

Expand Down
66 changes: 53 additions & 13 deletions Single-cell.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -2,10 +2,9 @@
title: "Understanding RNA-seq analysis through an example: voting behavior at the French parliament"
author: "Hector Roux de Bézieux"
date: "May 2018"
output:
github_document: default
html_document: default
pdf_document: default
output:
github_document:
toc: true
---

```{r packages, include=F}
Expand Down Expand Up @@ -60,36 +59,77 @@ Assay <- ExpressionSet(assayData = voting_record,
# 3 Data Cleaning
## 3.1 Filtering the cells (delegate)

In scRNA-seq settings, we want to filter the cells with too few total reads, or reads of too poor quality (that do not match to the genome of reference). Here, we instead filter the delegates with too few votes. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, and then its replacement is not around long enough, or because the delegate was really not voting much. A special case is of course the chairman of the Assembly who do not take part in the votes.
In scRNA-seq settings, we want to filter the cells with too few total reads, or reads of too poor quality (that do not match to the genome of reference). Here, we instead filter the delegates with too few votes. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, and then its replacement is not around long enough.

We therefore filter delegates whose number of votes is more than 2 SD below the mean (so equal or less to 40).

```{r filtering cells, fig.height=3}
```{r filtering cells}
Nb_Votes <- data.frame(Nb_Votes = colSums(!is.na(exprs(Assay))),
delegate = rownames(phenoData(Assay)))
ggplot(Nb_Votes, aes(x = Nb_Votes, fill = Nb_Votes <= 40)) +
stat_bin(breaks = seq(0, max(Nb_Votes$Nb_Votes), 20)) +
labs(x = "Number of votes") +
guides(fill = guide_legend(title = "Less than 40 votes"))
Assay <- Assay[, colSums(!is.na(exprs(Assay))) > 40]
guides(fill = guide_legend(title = "Could vote\nless than\n40 times"))
Assay <- Assay[, Nb_Votes$Nb_Votes > 40]
```

We then simplify the data by replacing NAs by 0 (as abstentions). We can then check whether some delegates voted only rarely. Again, we remove people whose number of votes is more than 2 SD below the mean number of votes (so delegates who voted less than 38 votes). Among those, we of course find the president of the parliament, who does not take part in the votes.

```{r change NA to zeros}
E <- exprs(Assay)
E[is.na(E)] <- 0
pd <- phenoData(Assay)
fd <- featureData(Assay)
colnames(E) <- rownames(pd) <- pd@data$identifiant
rownames(E) <- rownames(fd) <- fd@data$Number
Assay <- ExpressionSet(assayData = E, phenoData = pd, featureData = fd)
Nb_Votes <- data.frame(Nb_Votes = colSums(exprs(Assay) != 0),
delegate = rownames(phenoData(Assay)))
ggplot(Nb_Votes, aes(x = Nb_Votes, fill = Nb_Votes <= 38)) +
stat_bin(breaks = seq(0, max(Nb_Votes$Nb_Votes), 19)) +
labs(x = "Number of votes") +
guides(fill = guide_legend(title = "Actually voted\nless than\n38 times"))
Assay <- Assay[, Nb_Votes$Nb_Votes > 38]
```

## 3.2 Filtering the genes (votes)

In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes to keep speeds up calculations. Secondly, we are usually intersted in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of "at least $i$ counts in $j$ cells" are usually quite efficient. The aim is too be quite broad.
In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes to keep speeds up calculations. Secondly, we are usually interested in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of "at least $i$ counts in $j$ cells" are usually quite efficient. The aim is too be quite broad.

In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where an overwehlming majority of delegates voted "yes", or voted "no. We therefore filter out votes with a low standard deviation (< 0.5). This filters out `r round(100 *39/644, digits = 1)`\% of the votes. Note that in scRNA-seq settings, we would usually filter at least 50\% of the reads (while keeping more than 90\% of the reads).
In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where an overwhelming majority of delegates did not vote

```{r filtering genes}
Vote_repartion <- data.frame(sd = rowSds((exprs(Assay)), na.rm = T),
Vote_repartion <- data.frame(Nb_votes = rowSums(exprs(Assay) != 0),
votes = rownames(featureData(Assay)))
ggplot(Vote_repartion, aes(x = sd, fill = sd < 0.5)) + geom_histogram() +
ggplot(Vote_repartion, aes(x = Nb_votes, fill = Nb_Votes < 200)) +
geom_histogram() +
labs(x = "Standard deviation of the voting repartition") +
guides(fill = guide_legend(title = "Consensus vote"))
Assay <- Assay[rowSds((exprs(Assay))) >= 0.5, ]
Assay <- Assay[rowSds((exprs(Assay)), na.rm = T) >= 0.5, ]
```
```{r clean, echo=F}
rm(meta, Nb_Votes, Vote_repartion, votes, voting_record, E, fd, pd)
```

# Normalization and correcting for zero inflation

# Recovering / discovering biological types

In most scRNA-seq settings, we have no information about the biological type. We therefore want to cluster the cells and then use marker genes to identify known-cell types. Alternatively, we can use clustering to discover new cell types and identify marker genes for those new types.

In our example, let us imagine for a moment that we do not know the political group to which each delegate belong. Can we recover those groups? This will be done in two steps: discover stable clusters of delegates (cells) then identify vote (genes) that help distinguish those clusters.
## Clustering
```{r clustering}
pca <- prcomp(t(exprs(Assay)), center = T, scale = T)
ggplot(data.frame(x = pca$x[,1], y = pca$x[,2], col = phenoData(Assay)@data$group),
aes(x = x, fill = col)) +
facet_wrap(col~.) +
geom_density()
```

# Thanks

Special thanks to Vincent Viers for the initial inspiration of this project.
Expand Down
Loading

0 comments on commit a410ad1

Please sign in to comment.