Adds a set of scripts for easily generating in silico mixture

for scRNA-seq data analysis. PiperOrigin-RevId: 308575210
mmrech · Apr 27, 2020 · 54939f7 · 54939f7
1 parent 082c6ce
commit 54939f7
Show file tree

Hide file tree

Showing 7 changed files with 469 additions and 0 deletions.
diff --git a/.travis.yml b/.travis.yml
@@ -16,6 +16,7 @@ env:
     - PROJECT="bam"
     - PROJECT="bisimulation_aaai2020"
     - PROJECT="bitempered_loss"
+    - PROJECT="cell_mixer"
     - PROJECT="cfq"
     - PROJECT="cnn_quantization"
     - PROJECT="dataset_analysis"

diff --git a/cell_mixer/README.md b/cell_mixer/README.md
@@ -0,0 +1,132 @@
+# Cell Mixer
+
+## Requirements
+
+For this package to work you will need to install the following python3
+packages:
+
+```bash
+pip3 install pandas anndata absl-py
+```
+
+You will also need to install the following R/Biocinductor packages
+
+```R
+install.packages("devtools")
+install.packages("argparse")
+install.packages("Seurat")
+install.packages("purrr")
+devtools::install_github(repo = "hhoeflin/hdf5r")
+devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")
+
+if (!requireNamespace("BiocManager", quietly = TRUE))
+    install.packages("BiocManager")
+
+BiocManager::install("SingleCellExperiment")
+BiocManager::install("scater")
+BiocManager::install("scran")
+BiocManager::install("DropletUtils")
+```
+
+## Use
+
+### Downloading the data
+
+The current data comes from
+[Zheng et al](https://www.nature.com/articles/ncomms14049) and was downloaded
+from the
+[10X website](https://support.10xgenomics.com/single-cell-gene-expression/datasets).
+
+To download the data, simply run `fetch_data.sh` and specify the path to which
+you want to save the data to.
+
+```bash
+export DATA_PATH=/tmp/data
+bash fetch_data.sh $DATA_PATH
+```
+
+Note that the raw data only needs to be fetched once.
+
+### Generating a mixture
+
+The script to generate the mixtures is `cell_mixer.R`, it allows for standard QC
+steps:
+
+-   `--qc_counts_mad_lower`: Removing cells with low read count, filtered based
+    on the number of MADs under the median read counts.
+-   `--qc_feature_count_mas_lower`: Removing cells with few genes expressed,
+    filtered based on the number of MADs under the median number of genes
+    expressed.
+-   `--qc_mito_mad_upper`: Removing cells with high number of mitochondrial
+    reads (which is the case for dead cells), based on the number of MADs above
+    the median number of mitochondrial RNA counts.
+
+It also allows the select the quantity of cells of various types in the
+following table:
+
+Cell type                            | Number of cells | Flag
+------------------------------------ | --------------: | ------------------
+CD19+ B cells                        | 10085           | --b\_cells
+CD8+/CD45RA+ Naive Cytotoxic T Cells | 11953           | --naive\_cytotoxic
+CD14+ monocytes                      | 2612            | --cd14\_monocytes
+CD4+/CD25+ Regulatory T Cells        | 10263           | --regulatory\_t
+CD56+ natural killer cells           | 8385            | --cd56\_nk
+CD4+ helper T cells                  | 11213           | --cd4\_t\_helper
+CD4+/CD45RO+ Memory T Cells          | 10224           | --memory\_t
+CD4+/CD45RA+/CD25- Naive T cells     | 10479           | --naive\_t
+
+The seed for the subsampling is set by default to 1234 in order to reproduce the
+data sets from
+[Duo et al](https://github.com/csoneson/DuoClustering2018/blob/2b510422c8b799e508b5bbdde93bd8465db2d148/inst/scripts/import_QC_Zhengmix4eq.Rmd#L38),
+but it can be changed to generate multiple mixtures with similar cells (for
+studying the stability of a result under similar setups).
+
+The cell type identity will be written in the `label` cell attribute.
+
+### Supported formats
+
+This repository can currently generate data in the following formats:
+
+-   [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html)
+-   [Seurat](https://satijalab.org/seurat/)
+-   [LoomR/Loompy](https://github.com/mojaveazure/loomR)
+-   csv
+-   [AnnData](https://github.com/theislab/anndata)
+
+The first four can be done directly with `cell_mixer.R` by specifying the
+`--format` flag.
+
+AnnData has to be generate by first generating the data in csv, then by running
+the `convert.py` script.
+
+```bash
+Rsript cell_mixer.R \
+--data_path=$DATA_PATH \
+--name=mixture \
+--format=csv \
+--b_cells=3000 \
+--naive_t=3000
+python3 converter.py \
+--input_csv=mixture \
+--format=anndata
+```
+
+## Adding new data
+
+In order to add new cell types you can send a Pull Request, the files you will
+need to change are:
+
+-   `fetch_data.sh`: to download the data
+-   `cell_mixer.R`: add the appropriate flag, read the data, add the label, and
+    add it to the mixtures. All the locations to modify have a comment to locate
+    them.
+
+## Adding in new formats
+
+The R formats have to be added in the `cell_mixer.R` script, internally it uses
+SingleCellExperiment which is the most commonly used format.
+
+The python formats have to be added in `converter.py`.
+
+If you want more formats to be supported please open an issue or send a pull
+request.
diff --git a/cell_mixer/cell_mixer.R b/cell_mixer/cell_mixer.R
@@ -0,0 +1,198 @@
+suppressPackageStartupMessages({
+  library(SingleCellExperiment)
+  library(scater)
+  library(scran)
+  library(DropletUtils)
+  library(argparse)
+  library(loomR)
+  library(Seurat)
+  library(purrr)
+})
+
+# create parser object
+parser <- ArgumentParser()
+
+# specify our desired options
+# by default ArgumentParser will add an help option
+parser$add_argument("--data_path",
+  type = "character", required = TRUE,
+  help = "Location of the matrices, where the script tries to read them."
+)
+parser$add_argument("--format",
+  type = "character", required = TRUE,
+  help = "Format to write the data in, must take a value in: 'SingleCellExperiment', 'Seurat', 'LoomR', and 'csv'."
+)
+parser$add_argument("--name",
+  type = "character", required = TRUE,
+  help = "Name of the dataset you want to create."
+)
+parser$add_argument("--seed",
+  type = "integer", default = 1234,
+  help = "Seed to use for the subsampling."
+)
+
+# QC related arguments.
+parser$add_argument("--qc_count_mad_lower",
+  type = "integer", default = 3,
+  help = "Threshold for filtering cells based on the number of reads. Use a negative value if you don't want to filter"
+)
+parser$add_argument("--qc_feature_count_mad_lower",
+  type = "integer", default = 3,
+  help = "Threshold for filtering cells based on the number of expressed genes. Use a negative value if you don't want to filter"
+)
+parser$add_argument("--qc_mito_mad_upper",
+  type = "integer", default = 3,
+  help = "Threshold for filtering cells based on the number of mitochondrial RNA reads. Use a negative value if you don't want to filter"
+)
+
+# Cell mix
+parser$add_argument("--b_cells",
+  type = "integer", default = 0,
+  help = "Number of CD19+ B cells in the mixture."
+)
+parser$add_argument("--naive_cytotoxic",
+  type = "integer", default = 0,
+  help = "Number of CD8+/CD45RA+ Naive Cytotoxic T Cells in the mixture,"
+)
+parser$add_argument("--cd14_monocytes",
+  type = "integer", default = 0,
+  help = "Number of CD14+ monocytes in the mixture,"
+)
+parser$add_argument("--regulatory_t",
+  type = "integer", default = 0,
+  help = "Number of CD4+/CD25+ Regulatory T Cells in the mixture,"
+)
+parser$add_argument("--cd56_nk",
+  type = "integer", default = 0,
+  help = "Number of CD56+ natural killer cells in the mixture,"
+)
+parser$add_argument("--cd4_t_helper",
+  type = "integer", default = 0,
+  help = "Number of CD4+ helper T cells in the mixture,"
+)
+parser$add_argument("--memory_t",
+  type = "integer", default = 0,
+  help = "Number of CD4+/CD45RO+ Memory T Cells in the mixture,"
+)
+parser$add_argument("--naive_t",
+  type = "integer", default = 0,
+  help = "Number of CD4+/CD45RA+/CD25- Naive T cells in the mixture,"
+)
+# Add new data here.
+
+# Get command line options, if help option encountered print help and exit,
+# otherwise if options not found on command line then set defaults,
+args <- parser$parse_args()
+
+generate_zheng <- function(dfs, qc_count_mad_lower, qc_feature_count_mad_lower, qc_mito_mad_upper, seed = 1234) {
+  set.seed(seed)
+  all <- map(dfs, function(arg) {
+    arg$df[, sample(colnames(arg$df), arg$count, replace = FALSE)]
+  })
+  sce <- cbind(
+    all$b_cells,
+    all$naive_cytotoxic,
+    all$cd14_monocytes,
+    all$regulatory_t,
+    all$cd4_t_helper,
+    all$cd56_nk,
+    all$memory_t,
+    all$naive_t
+    # Add new data here.
+  )
+
+  isSpike(sce, "ERCC") <- grepl("^ERCC", rowData(sce)$Symbol)
+  sce <- calculateQCMetrics(sce, feature_controls = list(Mito = grepl("^MT-", rowData(sce)$Symbol)))
+
+  libsize.drop <- rep(FALSE, length(colnames(sce)))
+  if (qc_count_mad_lower >= 0) {
+    libsize.drop <- isOutlier(sce$total_counts, nmads = qc_count_mad_lower, type = "lower", log = TRUE)
+  }
+  feature.drop <- rep(FALSE, length(colnames(sce)))
+  if (qc_feature_count_mad_lower >= 0) {
+    feature.drop <- isOutlier(sce$total_features_by_counts, nmads = qc_feature_count_mad_lower, type = "lower", log = TRUE)
+  }
+  mito.drop <- rep(FALSE, length(colnames(sce)))
+  if (qc_mito_mad_upper >= 0) {
+    mito.drop <- isOutlier(sce$pct_counts_Mito, nmads = qc_mito_mad_upper, type = "higher")
+  }
+
+  keep <- !(libsize.drop | feature.drop | mito.drop)
+  table(keep)
+  sce <- sce[, keep]
+
+
+  num.cells <- nexprs(sce, byrow = TRUE)
+  to.keep <- num.cells > 0
+  table(!to.keep)
+  sce <- sce[to.keep, ]
+  sce
+}
+
+
+# Reading the data.
+b_cells <- DropletUtils::read10xCounts(paste(args$data_path, "b_cells/filtered_matrices_mex/hg19", sep = "/"))
+b_cells$label <- "b_cells"
+memory_t <- DropletUtils::read10xCounts(paste(args$data_path, "memory_t/filtered_matrices_mex/hg19", sep = "/"))
+memory_t$label <- "memory_t"
+naive_t <- DropletUtils::read10xCounts(paste(args$data_path, "naive_t/filtered_matrices_mex/hg19", sep = "/"))
+naive_t$label <- "naive_t"
+cd56_nk <- DropletUtils::read10xCounts(paste(args$data_path, "cd56_nk/filtered_matrices_mex/hg19", sep = "/"))
+cd56_nk$label <- "cd56_nk"
+cd14_monocytes <- DropletUtils::read10xCounts(paste(args$data_path, "cd14_monocytes/filtered_matrices_mex/hg19", sep = "/"))
+cd14_monocytes$label <- "cd14_monocytes"
+cd4_t_helper <- DropletUtils::read10xCounts(paste(args$data_path, "cd4_t_helper/filtered_matrices_mex/hg19", sep = "/"))
+cd4_t_helper$label <- "cd4_t_helper"
+regulatory_t <- DropletUtils::read10xCounts(paste(args$data_path, "regulatory_t/filtered_matrices_mex/hg19", sep = "/"))
+regulatory_t$label <- "regulatory_t"
+naive_cytotoxic <- DropletUtils::read10xCounts(paste(args$data_path, "naive_cytotoxic/filtered_matrices_mex/hg19", sep = "/"))
+naive_cytotoxic$label <- "naive_cytotoxic"
+# Add new data here.
+
+# Naming the Cells.
+colnames(b_cells) <- paste0("b_cells", seq_len(ncol(b_cells)))
+colnames(naive_cytotoxic) <- paste0("naive_cytotoxic", seq_len(ncol(naive_cytotoxic)))
+colnames(cd14_monocytes) <- paste0("cd14_monocytes", seq_len(ncol(cd14_monocytes)))
+colnames(regulatory_t) <- paste0("regulatory_t", seq_len(ncol(regulatory_t)))
+colnames(cd4_t_helper) <- paste0("cd4_t_helper", seq_len(ncol(cd4_t_helper)))
+colnames(cd56_nk) <- paste0("cd56_nk", seq_len(ncol(cd56_nk)))
+colnames(memory_t) <- paste0("memory_t", seq_len(ncol(memory_t)))
+colnames(naive_t) <- paste0("naive_t", seq_len(ncol(naive_t)))
+# Add new data here.
+
+
+sce_args <- list(
+  b_cells = list(df = b_cells, count = args$b_cells),
+  naive_cytotoxic = list(df = naive_cytotoxic, count = args$naive_cytotoxic),
+  cd14_monocytes = list(df = cd14_monocytes, count = args$cd14_monocytes), # careful, only 2.6k
+  regulatory_t = list(df = regulatory_t, count = args$regulatory_t),
+  cd4_t_helper = list(df = cd4_t_helper, count = args$cd4_t_helper),
+  cd56_nk = list(df = cd56_nk, count = args$cd56_nk),
+  memory_t = list(df = memory_t, count = args$memory_t),
+  naive_t = list(df = naive_t, count = args$naive_t)
+  # Add new data here.
+)
+sce <- generate_zheng(sce_args,
+  qc_count_mad_lower = args$qc_count_mad_lower,
+  qc_feature_count_mad_lower = args$qc_feature_count_mad_lower,
+  qc_mito_mad_upper = args$qc_mito_mad_upper,
+  seed = args$seed
+)
+
+if (args$format == "SingleCellExperiment") {
+  saveRDS(sce, paste(args$name, "rds", sep = "."))
+}
+if (args$format == "Seurat") {
+  dat <- as.Seurat(sce, counts = "counts", data = "counts")
+  saveRDS(dat, paste(args$name, "rds", sep = "."))
+}
+if (args$format == "LoomR") {
+  dat <- as.Seurat(sce, counts = "counts", data = "counts")
+  loom <- as.loom(dat, filename = paste(args$name, "rds", sep = "."), verbose = TRUE, overwrite = TRUE)
+  loom$close_all()
+}
+if (args$format == "csv") {
+  write.csv(as.matrix(counts(sce)), paste(args$name, "counts.csv", sep = "."))
+  write.csv(colData(sce), paste(args$name, "metadata.csv", sep = "."))
+  write.csv(rowData(sce), paste(args$name, "featuredata.csv", sep = "."))
+}
diff --git a/cell_mixer/converter.py b/cell_mixer/converter.py
@@ -0,0 +1,48 @@
+# coding=utf-8
+# Copyright 2020 The Google Research Authors.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+# Lint as: python3
+"""Converts data from csv to an scRNA-seq python format."""
+
+from absl import app
+from absl import flags
+
+import anndata
+import pandas as pd
+
+FLAGS = flags.FLAGS
+
+flags.DEFINE_string('input_csv_prefix', None,
+                    'Name used during the csv generation.')
+flags.DEFINE_enum('format', None, ['anndata'], 'Format to convert the data to.')
+
+
+def csv_to_h5ad(csv_prefix):
+  count = pd.read_csv(f'{csv_prefix}.counts.csv', index_col=0)
+  metadata = pd.read_csv(f'{csv_prefix}.metadata.csv', index_col=0)
+  featuredata = pd.read_csv(f'{csv_prefix}.featuredata.csv', index_col=0)
+  adata = anndata.AnnData(X=count.transpose(), obs=metadata, var=featuredata)
+  adata.write(f'{csv_prefix}.h5ad')
+
+
+def main(unused_argv):
+  if FLAGS.format == 'anndata':
+    csv_to_h5ad(FLAGS.input_csv_prefix)
+
+
+if __name__ == '__main__':
+  flags.mark_flag_as_required('input_csv_prefix')
+  flags.mark_flag_as_required('format')
+  app.run(main)