New README

HectorRDB · Jul 8, 2018 · a410ad1 · a410ad1
1 parent cce979c
commit a410ad1
Show file tree

Hide file tree

Showing 12 changed files with 144 additions and 146 deletions.
diff --git a/README.md b/README.md
@@ -3,6 +3,16 @@ Understanding RNA-seq analysis through an example: voting behavior at the French
 Hector Roux de Bézieux
 May 2018
 
+-   [Introduction](#introduction)
+-   [1 Context](#context)
+-   [2 Loading the data and building an ExpressionSet object](#loading-the-data-and-building-an-expressionset-object)
+-   [3 Data Cleaning](#data-cleaning)
+    -   [3.1 Filtering the cells (delegate)](#filtering-the-cells-delegate)
+    -   [3.2 Filtering the genes (votes)](#filtering-the-genes-votes)
+-   [Normalization and correcting for zero inflation](#normalization-and-correcting-for-zero-inflation)
+-   [Recovering / discovering biological types](#recovering-discovering-biological-types)
+-   [Thanks](#thanks)
+
 Introduction
 ============
 
@@ -49,7 +59,7 @@ Assay <- ExpressionSet(assayData = voting_record,
 3.1 Filtering the cells (delegate)
 ----------------------------------
 
-In scRNA-seq settings, we want to filter the cells with too few total reads, or reads of too poor quality (that do not match to the genome of reference). Here, we instead filter the delegates with too few votes. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, and then its replacement is not around long enough, or because the delegate was really not voting much. A special case is of course the chairman of the Assembly who do not take part in the votes.
+In scRNA-seq settings, we want to filter the cells with too few total reads, or reads of too poor quality (that do not match to the genome of reference). Here, we instead filter the delegates with too few votes. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, and then its replacement is not around long enough.
 
 We therefore filter delegates whose number of votes is more than 2 SD below the mean (so equal or less to 40).
 
@@ -60,21 +70,46 @@ Nb_Votes <- data.frame(Nb_Votes = colSums(!is.na(exprs(Assay))),
 ggplot(Nb_Votes, aes(x = Nb_Votes, fill = Nb_Votes <= 40)) +
                  stat_bin(breaks = seq(0, max(Nb_Votes$Nb_Votes), 20)) +
                  labs(x = "Number of votes") +
-                 guides(fill = guide_legend(title = "Less than 40 votes"))
+            guides(fill = guide_legend(title = "Could vote\nless than\n40 times"))
 ```
 
 <img src="Single-cell_files/figure-markdown_github/filtering cells-1.png"  style="display: block; margin: auto;" />
 
 ``` r
-Assay <- Assay[, colSums(!is.na(exprs(Assay))) > 40]
+Assay <- Assay[, Nb_Votes$Nb_Votes > 40]
+```
+
+We then simplify the data by replacing NAs by 0 (as abstentions). We can then check whether some delegates never voted. Again, we remove people whose number of votes is more than 2 SD below the mean number of votes (so delegates who voted less than 38 votes).
+
+``` r
+E <- exprs(Assay)
+E[is.na(E)] <- 0
+pd <- phenoData(Assay)
+fd <- featureData(Assay)
+colnames(E) <- rownames(pd) <- pd@data$identifiant
+rownames(E) <- rownames(fd) <- fd@data$Number
+Assay <- ExpressionSet(assayData = E, phenoData = pd, featureData = fd)
+
+Nb_Votes <- data.frame(Nb_Votes = colSums(exprs(Assay) != 0),
+                       delegate = rownames(phenoData(Assay)))
+ggplot(Nb_Votes, aes(x = Nb_Votes, fill = Nb_Votes <= 38)) +
+                 stat_bin(breaks = seq(0, max(Nb_Votes$Nb_Votes), 19)) +
+                 labs(x = "Number of votes") +
+            guides(fill = guide_legend(title = "Actually voted\nless than\n38 times"))
+```
+
+<img src="Single-cell_files/figure-markdown_github/change NA to zeros-1.png"  style="display: block; margin: auto;" />
+
+``` r
+Assay <- Assay[, Nb_Votes$Nb_Votes > 38]
 ```
 
 3.2 Filtering the genes (votes)
 -------------------------------
 
-In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes to keep speeds up calculations. Secondly, we are usually intersted in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of "at least *i* counts in *j* cells" are usually quite efficient. The aim is too be quite broad.
+In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes to keep speeds up calculations. Secondly, we are usually interested in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of "at least *i* counts in *j* cells" are usually quite efficient. The aim is too be quite broad.
 
-In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where an overwehlming majority of delegates voted "yes", or voted "no. We therefore filter out votes with a low standard deviation (&lt; 0.5). This filters out 6.1% of the votes. Note that in scRNA-seq settings, we would usually filter at least 50% of the reads (while keeping more than 90% of the reads).
+In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where an overwhelming majority of delegates voted "yes", or voted "no. We therefore filter out votes with a low standard deviation (&lt; 0.5). This filters out 6.1% of the votes. Note that in scRNA-seq settings, we would usually filter at least 50% of the reads (while keeping more than 90% of the reads).
 
 ``` r
 Vote_repartion <- data.frame(sd = rowSds((exprs(Assay)), na.rm = T),
@@ -87,9 +122,29 @@ ggplot(Vote_repartion, aes(x = sd, fill = sd < 0.5)) + geom_histogram() +
 <img src="Single-cell_files/figure-markdown_github/filtering genes-1.png"  style="display: block; margin: auto;" />
 
 ``` r
-Assay <- Assay[rowSds((exprs(Assay))) >= 0.5, ]
+Assay <- Assay[rowSds((exprs(Assay)), na.rm = T) >= 0.5, ]
 ```
 
+Normalization and correcting for zero inflation
+===============================================
+
+Recovering / discovering biological types
+=========================================
+
+In most scRNA-seq settings, we have no information about the biological type. We therefore want to cluster the cells and then use marker genes to identify known-cell types. Alternatively, we can use clustering to discover new cell types and identify marker genes for those new types.
+
+In our example, let us imagine for a moment that we do not know the political group to which each delegate belong. Can we recover those groups? This will be done in two steps: discover stable clusters of delegates (cells) then identify vote (genes) that help distinguish those clusters. \#\# Clustering
+
+``` r
+pca <- prcomp(t(exprs(Assay)), center = T, scale = T)
+ggplot(data.frame(x = pca$x[,1], y = pca$x[,2], col = phenoData(Assay)@data$group),
+       aes(x = x, fill = col)) +
+  facet_wrap(col~.) + 
+  geom_density()
+```
+
+<img src="Single-cell_files/figure-markdown_github/clustering-1.png"  style="display: block; margin: auto;" />
+
 Thanks
 ======
 

diff --git a/Single-cell.Rmd b/Single-cell.Rmd
@@ -2,10 +2,9 @@
 title: "Understanding RNA-seq analysis through an example: voting behavior at the French parliament"
 author: "Hector Roux de Bézieux"
 date: "May 2018"
-output:
-  github_document: default
-  html_document: default
-  pdf_document: default
+output: 
+  github_document: 
+    toc: true
 ---
 
 ```{r packages, include=F}
@@ -60,36 +59,77 @@ Assay <- ExpressionSet(assayData = voting_record,
 # 3 Data Cleaning
 ## 3.1 Filtering the cells (delegate)
 
-In scRNA-seq settings, we want to filter the cells with too few total reads, or reads of too poor quality (that do not match to the genome of reference). Here, we instead filter the delegates with too few votes. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, and then its replacement is not around long enough, or because the delegate was really not voting much. A special case is of course the chairman of the Assembly who do not take part in the votes.
+In scRNA-seq settings, we want to filter the cells with too few total reads, or reads of too poor quality (that do not match to the genome of reference). Here, we instead filter the delegates with too few votes. This may happen either because a delegate had to leave office before the end of the session, or was appointed to the government, and then its replacement is not around long enough.
 
 We therefore filter delegates whose number of votes is more than 2 SD below the mean (so equal or less to 40).
 
-```{r filtering cells, fig.height=3}
+```{r filtering cells}
 Nb_Votes <- data.frame(Nb_Votes = colSums(!is.na(exprs(Assay))),
                        delegate = rownames(phenoData(Assay)))
 
 ggplot(Nb_Votes, aes(x = Nb_Votes, fill = Nb_Votes <= 40)) +
                  stat_bin(breaks = seq(0, max(Nb_Votes$Nb_Votes), 20)) +
                  labs(x = "Number of votes") +
-                 guides(fill = guide_legend(title = "Less than 40 votes"))
-Assay <- Assay[, colSums(!is.na(exprs(Assay))) > 40]
+            guides(fill = guide_legend(title = "Could vote\nless than\n40 times"))
+Assay <- Assay[, Nb_Votes$Nb_Votes > 40]
+```
+
+We then simplify the data by replacing NAs by 0 (as abstentions). We can then check whether some delegates voted only rarely. Again, we remove people whose number of votes is more than 2 SD below the mean number of votes (so delegates who voted less than 38 votes). Among those, we of course find the president of the parliament, who does not take part in the votes. 
+
+```{r change NA to zeros}
+E <- exprs(Assay)
+E[is.na(E)] <- 0
+pd <- phenoData(Assay)
+fd <- featureData(Assay)
+colnames(E) <- rownames(pd) <- pd@data$identifiant
+rownames(E) <- rownames(fd) <- fd@data$Number
+Assay <- ExpressionSet(assayData = E, phenoData = pd, featureData = fd)
+
+Nb_Votes <- data.frame(Nb_Votes = colSums(exprs(Assay) != 0),
+                       delegate = rownames(phenoData(Assay)))
+ggplot(Nb_Votes, aes(x = Nb_Votes, fill = Nb_Votes <= 38)) +
+                 stat_bin(breaks = seq(0, max(Nb_Votes$Nb_Votes), 19)) +
+                 labs(x = "Number of votes") +
+            guides(fill = guide_legend(title = "Actually voted\nless than\n38 times"))
+Assay <- Assay[, Nb_Votes$Nb_Votes > 38]
 ```
 
 ## 3.2 Filtering the genes (votes)
 
-In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes to keep speeds up calculations. Secondly, we are usually intersted in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of "at least $i$ counts in $j$ cells" are usually quite efficient. The aim is too be quite broad. 
+In scRNA-seq settings, we want to filter out genes that are expressed in a small number of cells. This is done for two reasons. The first is computational. Most datasets would have dozens of thousands of genes, over dozens or hundreds of cells. Reducing the number of genes to keep speeds up calculations. Secondly, we are usually interested in a specific tissue or process. Most genes are not being expressed in the samples of interest and just add noise to the analysis. Hence filtration. Simple heuristics in the form of "at least $i$ counts in $j$ cells" are usually quite efficient. The aim is too be quite broad. 
 
-In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where an overwehlming majority of delegates voted "yes", or voted "no. We therefore filter out votes with a low standard deviation (< 0.5). This filters out `r round(100 *39/644, digits = 1)`\% of the votes. Note that in scRNA-seq settings, we would usually filter at least 50\% of the reads (while keeping more than 90\% of the reads).
+In our context, we only have 644 votes so the computational goal does not really hold. But we can filter out votes where an overwhelming majority of delegates did not vote
 
 ```{r filtering genes}
-Vote_repartion <- data.frame(sd = rowSds((exprs(Assay)), na.rm = T),
+Vote_repartion <- data.frame(Nb_votes = rowSums(exprs(Assay) != 0),
                              votes = rownames(featureData(Assay)))
-ggplot(Vote_repartion, aes(x = sd, fill = sd < 0.5)) + geom_histogram() +
+ggplot(Vote_repartion, aes(x = Nb_votes, fill = Nb_Votes < 200)) +
+  geom_histogram() +
   labs(x = "Standard deviation of the voting repartition") +
   guides(fill = guide_legend(title = "Consensus vote"))
 
-Assay <- Assay[rowSds((exprs(Assay))) >= 0.5, ]
+Assay <- Assay[rowSds((exprs(Assay)), na.rm = T) >= 0.5, ]
+```
+```{r clean, echo=F}
+rm(meta, Nb_Votes, Vote_repartion, votes, voting_record, E, fd, pd)
 ```
+
+# Normalization and correcting for zero inflation
+
+# Recovering / discovering biological types
+
+In most scRNA-seq settings, we have no information about the biological type. We therefore want to cluster the cells and then use marker genes to identify known-cell types. Alternatively, we can use clustering to discover new cell types and identify marker genes for those new types.
+
+In our example, let us imagine for a moment that we do not know the political group to which each delegate belong. Can we recover those groups? This will be done in two steps: discover stable clusters of delegates (cells) then identify vote (genes) that help distinguish those clusters.
+## Clustering
+```{r clustering}
+pca <- prcomp(t(exprs(Assay)), center = T, scale = T)
+ggplot(data.frame(x = pca$x[,1], y = pca$x[,2], col = phenoData(Assay)@data$group),
+       aes(x = x, fill = col)) +
+  facet_wrap(col~.) + 
+  geom_density()
+```
+
 # Thanks
 
 Special thanks to Vincent Viers for the initial inspiration of this project.