Skip to content

Commit

Permalink
Adds a set of scripts for easily generating in silico mixture
Browse files Browse the repository at this point in the history
for scRNA-seq data analysis.

PiperOrigin-RevId: 308575210
  • Loading branch information
gamazeps authored and copybara-github committed Apr 27, 2020
1 parent 082c6ce commit 54939f7
Show file tree
Hide file tree
Showing 7 changed files with 469 additions and 0 deletions.
1 change: 1 addition & 0 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@ env:
- PROJECT="bam"
- PROJECT="bisimulation_aaai2020"
- PROJECT="bitempered_loss"
- PROJECT="cell_mixer"
- PROJECT="cfq"
- PROJECT="cnn_quantization"
- PROJECT="dataset_analysis"
Expand Down
132 changes: 132 additions & 0 deletions cell_mixer/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Cell Mixer

## Requirements

For this package to work you will need to install the following python3
packages:

```bash
pip3 install pandas anndata absl-py
```

You will also need to install the following R/Biocinductor packages

```R
install.packages("devtools")
install.packages("argparse")
install.packages("Seurat")
install.packages("purrr")
devtools::install_github(repo = "hhoeflin/hdf5r")
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop")

if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("SingleCellExperiment")
BiocManager::install("scater")
BiocManager::install("scran")
BiocManager::install("DropletUtils")
```

## Use

### Downloading the data

The current data comes from
[Zheng et al](https://www.nature.com/articles/ncomms14049) and was downloaded
from the
[10X website](https://support.10xgenomics.com/single-cell-gene-expression/datasets).

To download the data, simply run `fetch_data.sh` and specify the path to which
you want to save the data to.

```bash
export DATA_PATH=/tmp/data
bash fetch_data.sh $DATA_PATH
```

Note that the raw data only needs to be fetched once.

### Generating a mixture

The script to generate the mixtures is `cell_mixer.R`, it allows for standard QC
steps:

- `--qc_counts_mad_lower`: Removing cells with low read count, filtered based
on the number of MADs under the median read counts.
- `--qc_feature_count_mas_lower`: Removing cells with few genes expressed,
filtered based on the number of MADs under the median number of genes
expressed.
- `--qc_mito_mad_upper`: Removing cells with high number of mitochondrial
reads (which is the case for dead cells), based on the number of MADs above
the median number of mitochondrial RNA counts.

It also allows the select the quantity of cells of various types in the
following table:

Cell type | Number of cells | Flag
------------------------------------ | --------------: | ------------------
CD19+ B cells | 10085 | --b\_cells
CD8+/CD45RA+ Naive Cytotoxic T Cells | 11953 | --naive\_cytotoxic
CD14+ monocytes | 2612 | --cd14\_monocytes
CD4+/CD25+ Regulatory T Cells | 10263 | --regulatory\_t
CD56+ natural killer cells | 8385 | --cd56\_nk
CD4+ helper T cells | 11213 | --cd4\_t\_helper
CD4+/CD45RO+ Memory T Cells | 10224 | --memory\_t
CD4+/CD45RA+/CD25- Naive T cells | 10479 | --naive\_t

The seed for the subsampling is set by default to 1234 in order to reproduce the
data sets from
[Duo et al](https://github.com/csoneson/DuoClustering2018/blob/2b510422c8b799e508b5bbdde93bd8465db2d148/inst/scripts/import_QC_Zhengmix4eq.Rmd#L38),
but it can be changed to generate multiple mixtures with similar cells (for
studying the stability of a result under similar setups).

The cell type identity will be written in the `label` cell attribute.

### Supported formats

This repository can currently generate data in the following formats:

- [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html)
- [Seurat](https://satijalab.org/seurat/)
- [LoomR/Loompy](https://github.com/mojaveazure/loomR)
- csv
- [AnnData](https://github.com/theislab/anndata)

The first four can be done directly with `cell_mixer.R` by specifying the
`--format` flag.

AnnData has to be generate by first generating the data in csv, then by running
the `convert.py` script.

```bash
Rsript cell_mixer.R \
--data_path=$DATA_PATH \
--name=mixture \
--format=csv \
--b_cells=3000 \
--naive_t=3000
python3 converter.py \
--input_csv=mixture \
--format=anndata
```

## Adding new data

In order to add new cell types you can send a Pull Request, the files you will
need to change are:

- `fetch_data.sh`: to download the data
- `cell_mixer.R`: add the appropriate flag, read the data, add the label, and
add it to the mixtures. All the locations to modify have a comment to locate
them.

## Adding in new formats

The R formats have to be added in the `cell_mixer.R` script, internally it uses
SingleCellExperiment which is the most commonly used format.

The python formats have to be added in `converter.py`.

If you want more formats to be supported please open an issue or send a pull
request.
198 changes: 198 additions & 0 deletions cell_mixer/cell_mixer.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,198 @@
suppressPackageStartupMessages({
library(SingleCellExperiment)
library(scater)
library(scran)
library(DropletUtils)
library(argparse)
library(loomR)
library(Seurat)
library(purrr)
})

# create parser object
parser <- ArgumentParser()

# specify our desired options
# by default ArgumentParser will add an help option
parser$add_argument("--data_path",
type = "character", required = TRUE,
help = "Location of the matrices, where the script tries to read them."
)
parser$add_argument("--format",
type = "character", required = TRUE,
help = "Format to write the data in, must take a value in: 'SingleCellExperiment', 'Seurat', 'LoomR', and 'csv'."
)
parser$add_argument("--name",
type = "character", required = TRUE,
help = "Name of the dataset you want to create."
)
parser$add_argument("--seed",
type = "integer", default = 1234,
help = "Seed to use for the subsampling."
)

# QC related arguments.
parser$add_argument("--qc_count_mad_lower",
type = "integer", default = 3,
help = "Threshold for filtering cells based on the number of reads. Use a negative value if you don't want to filter"
)
parser$add_argument("--qc_feature_count_mad_lower",
type = "integer", default = 3,
help = "Threshold for filtering cells based on the number of expressed genes. Use a negative value if you don't want to filter"
)
parser$add_argument("--qc_mito_mad_upper",
type = "integer", default = 3,
help = "Threshold for filtering cells based on the number of mitochondrial RNA reads. Use a negative value if you don't want to filter"
)

# Cell mix
parser$add_argument("--b_cells",
type = "integer", default = 0,
help = "Number of CD19+ B cells in the mixture."
)
parser$add_argument("--naive_cytotoxic",
type = "integer", default = 0,
help = "Number of CD8+/CD45RA+ Naive Cytotoxic T Cells in the mixture,"
)
parser$add_argument("--cd14_monocytes",
type = "integer", default = 0,
help = "Number of CD14+ monocytes in the mixture,"
)
parser$add_argument("--regulatory_t",
type = "integer", default = 0,
help = "Number of CD4+/CD25+ Regulatory T Cells in the mixture,"
)
parser$add_argument("--cd56_nk",
type = "integer", default = 0,
help = "Number of CD56+ natural killer cells in the mixture,"
)
parser$add_argument("--cd4_t_helper",
type = "integer", default = 0,
help = "Number of CD4+ helper T cells in the mixture,"
)
parser$add_argument("--memory_t",
type = "integer", default = 0,
help = "Number of CD4+/CD45RO+ Memory T Cells in the mixture,"
)
parser$add_argument("--naive_t",
type = "integer", default = 0,
help = "Number of CD4+/CD45RA+/CD25- Naive T cells in the mixture,"
)
# Add new data here.

# Get command line options, if help option encountered print help and exit,
# otherwise if options not found on command line then set defaults,
args <- parser$parse_args()

generate_zheng <- function(dfs, qc_count_mad_lower, qc_feature_count_mad_lower, qc_mito_mad_upper, seed = 1234) {
set.seed(seed)
all <- map(dfs, function(arg) {
arg$df[, sample(colnames(arg$df), arg$count, replace = FALSE)]
})
sce <- cbind(
all$b_cells,
all$naive_cytotoxic,
all$cd14_monocytes,
all$regulatory_t,
all$cd4_t_helper,
all$cd56_nk,
all$memory_t,
all$naive_t
# Add new data here.
)

isSpike(sce, "ERCC") <- grepl("^ERCC", rowData(sce)$Symbol)
sce <- calculateQCMetrics(sce, feature_controls = list(Mito = grepl("^MT-", rowData(sce)$Symbol)))

libsize.drop <- rep(FALSE, length(colnames(sce)))
if (qc_count_mad_lower >= 0) {
libsize.drop <- isOutlier(sce$total_counts, nmads = qc_count_mad_lower, type = "lower", log = TRUE)
}
feature.drop <- rep(FALSE, length(colnames(sce)))
if (qc_feature_count_mad_lower >= 0) {
feature.drop <- isOutlier(sce$total_features_by_counts, nmads = qc_feature_count_mad_lower, type = "lower", log = TRUE)
}
mito.drop <- rep(FALSE, length(colnames(sce)))
if (qc_mito_mad_upper >= 0) {
mito.drop <- isOutlier(sce$pct_counts_Mito, nmads = qc_mito_mad_upper, type = "higher")
}

keep <- !(libsize.drop | feature.drop | mito.drop)
table(keep)
sce <- sce[, keep]


num.cells <- nexprs(sce, byrow = TRUE)
to.keep <- num.cells > 0
table(!to.keep)
sce <- sce[to.keep, ]
sce
}


# Reading the data.
b_cells <- DropletUtils::read10xCounts(paste(args$data_path, "b_cells/filtered_matrices_mex/hg19", sep = "/"))
b_cells$label <- "b_cells"
memory_t <- DropletUtils::read10xCounts(paste(args$data_path, "memory_t/filtered_matrices_mex/hg19", sep = "/"))
memory_t$label <- "memory_t"
naive_t <- DropletUtils::read10xCounts(paste(args$data_path, "naive_t/filtered_matrices_mex/hg19", sep = "/"))
naive_t$label <- "naive_t"
cd56_nk <- DropletUtils::read10xCounts(paste(args$data_path, "cd56_nk/filtered_matrices_mex/hg19", sep = "/"))
cd56_nk$label <- "cd56_nk"
cd14_monocytes <- DropletUtils::read10xCounts(paste(args$data_path, "cd14_monocytes/filtered_matrices_mex/hg19", sep = "/"))
cd14_monocytes$label <- "cd14_monocytes"
cd4_t_helper <- DropletUtils::read10xCounts(paste(args$data_path, "cd4_t_helper/filtered_matrices_mex/hg19", sep = "/"))
cd4_t_helper$label <- "cd4_t_helper"
regulatory_t <- DropletUtils::read10xCounts(paste(args$data_path, "regulatory_t/filtered_matrices_mex/hg19", sep = "/"))
regulatory_t$label <- "regulatory_t"
naive_cytotoxic <- DropletUtils::read10xCounts(paste(args$data_path, "naive_cytotoxic/filtered_matrices_mex/hg19", sep = "/"))
naive_cytotoxic$label <- "naive_cytotoxic"
# Add new data here.

# Naming the Cells.
colnames(b_cells) <- paste0("b_cells", seq_len(ncol(b_cells)))
colnames(naive_cytotoxic) <- paste0("naive_cytotoxic", seq_len(ncol(naive_cytotoxic)))
colnames(cd14_monocytes) <- paste0("cd14_monocytes", seq_len(ncol(cd14_monocytes)))
colnames(regulatory_t) <- paste0("regulatory_t", seq_len(ncol(regulatory_t)))
colnames(cd4_t_helper) <- paste0("cd4_t_helper", seq_len(ncol(cd4_t_helper)))
colnames(cd56_nk) <- paste0("cd56_nk", seq_len(ncol(cd56_nk)))
colnames(memory_t) <- paste0("memory_t", seq_len(ncol(memory_t)))
colnames(naive_t) <- paste0("naive_t", seq_len(ncol(naive_t)))
# Add new data here.


sce_args <- list(
b_cells = list(df = b_cells, count = args$b_cells),
naive_cytotoxic = list(df = naive_cytotoxic, count = args$naive_cytotoxic),
cd14_monocytes = list(df = cd14_monocytes, count = args$cd14_monocytes), # careful, only 2.6k
regulatory_t = list(df = regulatory_t, count = args$regulatory_t),
cd4_t_helper = list(df = cd4_t_helper, count = args$cd4_t_helper),
cd56_nk = list(df = cd56_nk, count = args$cd56_nk),
memory_t = list(df = memory_t, count = args$memory_t),
naive_t = list(df = naive_t, count = args$naive_t)
# Add new data here.
)
sce <- generate_zheng(sce_args,
qc_count_mad_lower = args$qc_count_mad_lower,
qc_feature_count_mad_lower = args$qc_feature_count_mad_lower,
qc_mito_mad_upper = args$qc_mito_mad_upper,
seed = args$seed
)

if (args$format == "SingleCellExperiment") {
saveRDS(sce, paste(args$name, "rds", sep = "."))
}
if (args$format == "Seurat") {
dat <- as.Seurat(sce, counts = "counts", data = "counts")
saveRDS(dat, paste(args$name, "rds", sep = "."))
}
if (args$format == "LoomR") {
dat <- as.Seurat(sce, counts = "counts", data = "counts")
loom <- as.loom(dat, filename = paste(args$name, "rds", sep = "."), verbose = TRUE, overwrite = TRUE)
loom$close_all()
}
if (args$format == "csv") {
write.csv(as.matrix(counts(sce)), paste(args$name, "counts.csv", sep = "."))
write.csv(colData(sce), paste(args$name, "metadata.csv", sep = "."))
write.csv(rowData(sce), paste(args$name, "featuredata.csv", sep = "."))
}
48 changes: 48 additions & 0 deletions cell_mixer/converter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# coding=utf-8
# Copyright 2020 The Google Research Authors.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

# Lint as: python3
"""Converts data from csv to an scRNA-seq python format."""

from absl import app
from absl import flags

import anndata
import pandas as pd

FLAGS = flags.FLAGS

flags.DEFINE_string('input_csv_prefix', None,
'Name used during the csv generation.')
flags.DEFINE_enum('format', None, ['anndata'], 'Format to convert the data to.')


def csv_to_h5ad(csv_prefix):
count = pd.read_csv(f'{csv_prefix}.counts.csv', index_col=0)
metadata = pd.read_csv(f'{csv_prefix}.metadata.csv', index_col=0)
featuredata = pd.read_csv(f'{csv_prefix}.featuredata.csv', index_col=0)
adata = anndata.AnnData(X=count.transpose(), obs=metadata, var=featuredata)
adata.write(f'{csv_prefix}.h5ad')


def main(unused_argv):
if FLAGS.format == 'anndata':
csv_to_h5ad(FLAGS.input_csv_prefix)


if __name__ == '__main__':
flags.mark_flag_as_required('input_csv_prefix')
flags.mark_flag_as_required('format')
app.run(main)
Loading

0 comments on commit 54939f7

Please sign in to comment.