forked from google-research/google-research
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Adds a set of scripts for easily generating in silico mixture
for scRNA-seq data analysis. PiperOrigin-RevId: 308575210
- Loading branch information
1 parent
082c6ce
commit 54939f7
Showing
7 changed files
with
469 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
# Cell Mixer | ||
|
||
## Requirements | ||
|
||
For this package to work you will need to install the following python3 | ||
packages: | ||
|
||
```bash | ||
pip3 install pandas anndata absl-py | ||
``` | ||
|
||
You will also need to install the following R/Biocinductor packages | ||
|
||
```R | ||
install.packages("devtools") | ||
install.packages("argparse") | ||
install.packages("Seurat") | ||
install.packages("purrr") | ||
devtools::install_github(repo = "hhoeflin/hdf5r") | ||
devtools::install_github(repo = "mojaveazure/loomR", ref = "develop") | ||
|
||
if (!requireNamespace("BiocManager", quietly = TRUE)) | ||
install.packages("BiocManager") | ||
|
||
BiocManager::install("SingleCellExperiment") | ||
BiocManager::install("scater") | ||
BiocManager::install("scran") | ||
BiocManager::install("DropletUtils") | ||
``` | ||
|
||
## Use | ||
|
||
### Downloading the data | ||
|
||
The current data comes from | ||
[Zheng et al](https://www.nature.com/articles/ncomms14049) and was downloaded | ||
from the | ||
[10X website](https://support.10xgenomics.com/single-cell-gene-expression/datasets). | ||
|
||
To download the data, simply run `fetch_data.sh` and specify the path to which | ||
you want to save the data to. | ||
|
||
```bash | ||
export DATA_PATH=/tmp/data | ||
bash fetch_data.sh $DATA_PATH | ||
``` | ||
|
||
Note that the raw data only needs to be fetched once. | ||
|
||
### Generating a mixture | ||
|
||
The script to generate the mixtures is `cell_mixer.R`, it allows for standard QC | ||
steps: | ||
|
||
- `--qc_counts_mad_lower`: Removing cells with low read count, filtered based | ||
on the number of MADs under the median read counts. | ||
- `--qc_feature_count_mas_lower`: Removing cells with few genes expressed, | ||
filtered based on the number of MADs under the median number of genes | ||
expressed. | ||
- `--qc_mito_mad_upper`: Removing cells with high number of mitochondrial | ||
reads (which is the case for dead cells), based on the number of MADs above | ||
the median number of mitochondrial RNA counts. | ||
|
||
It also allows the select the quantity of cells of various types in the | ||
following table: | ||
|
||
Cell type | Number of cells | Flag | ||
------------------------------------ | --------------: | ------------------ | ||
CD19+ B cells | 10085 | --b\_cells | ||
CD8+/CD45RA+ Naive Cytotoxic T Cells | 11953 | --naive\_cytotoxic | ||
CD14+ monocytes | 2612 | --cd14\_monocytes | ||
CD4+/CD25+ Regulatory T Cells | 10263 | --regulatory\_t | ||
CD56+ natural killer cells | 8385 | --cd56\_nk | ||
CD4+ helper T cells | 11213 | --cd4\_t\_helper | ||
CD4+/CD45RO+ Memory T Cells | 10224 | --memory\_t | ||
CD4+/CD45RA+/CD25- Naive T cells | 10479 | --naive\_t | ||
|
||
The seed for the subsampling is set by default to 1234 in order to reproduce the | ||
data sets from | ||
[Duo et al](https://github.com/csoneson/DuoClustering2018/blob/2b510422c8b799e508b5bbdde93bd8465db2d148/inst/scripts/import_QC_Zhengmix4eq.Rmd#L38), | ||
but it can be changed to generate multiple mixtures with similar cells (for | ||
studying the stability of a result under similar setups). | ||
|
||
The cell type identity will be written in the `label` cell attribute. | ||
|
||
### Supported formats | ||
|
||
This repository can currently generate data in the following formats: | ||
|
||
- [SingleCellExperiment](https://bioconductor.org/packages/release/bioc/html/SingleCellExperiment.html) | ||
- [Seurat](https://satijalab.org/seurat/) | ||
- [LoomR/Loompy](https://github.com/mojaveazure/loomR) | ||
- csv | ||
- [AnnData](https://github.com/theislab/anndata) | ||
|
||
The first four can be done directly with `cell_mixer.R` by specifying the | ||
`--format` flag. | ||
|
||
AnnData has to be generate by first generating the data in csv, then by running | ||
the `convert.py` script. | ||
|
||
```bash | ||
Rsript cell_mixer.R \ | ||
--data_path=$DATA_PATH \ | ||
--name=mixture \ | ||
--format=csv \ | ||
--b_cells=3000 \ | ||
--naive_t=3000 | ||
python3 converter.py \ | ||
--input_csv=mixture \ | ||
--format=anndata | ||
``` | ||
|
||
## Adding new data | ||
|
||
In order to add new cell types you can send a Pull Request, the files you will | ||
need to change are: | ||
|
||
- `fetch_data.sh`: to download the data | ||
- `cell_mixer.R`: add the appropriate flag, read the data, add the label, and | ||
add it to the mixtures. All the locations to modify have a comment to locate | ||
them. | ||
|
||
## Adding in new formats | ||
|
||
The R formats have to be added in the `cell_mixer.R` script, internally it uses | ||
SingleCellExperiment which is the most commonly used format. | ||
|
||
The python formats have to be added in `converter.py`. | ||
|
||
If you want more formats to be supported please open an issue or send a pull | ||
request. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,198 @@ | ||
suppressPackageStartupMessages({ | ||
library(SingleCellExperiment) | ||
library(scater) | ||
library(scran) | ||
library(DropletUtils) | ||
library(argparse) | ||
library(loomR) | ||
library(Seurat) | ||
library(purrr) | ||
}) | ||
|
||
# create parser object | ||
parser <- ArgumentParser() | ||
|
||
# specify our desired options | ||
# by default ArgumentParser will add an help option | ||
parser$add_argument("--data_path", | ||
type = "character", required = TRUE, | ||
help = "Location of the matrices, where the script tries to read them." | ||
) | ||
parser$add_argument("--format", | ||
type = "character", required = TRUE, | ||
help = "Format to write the data in, must take a value in: 'SingleCellExperiment', 'Seurat', 'LoomR', and 'csv'." | ||
) | ||
parser$add_argument("--name", | ||
type = "character", required = TRUE, | ||
help = "Name of the dataset you want to create." | ||
) | ||
parser$add_argument("--seed", | ||
type = "integer", default = 1234, | ||
help = "Seed to use for the subsampling." | ||
) | ||
|
||
# QC related arguments. | ||
parser$add_argument("--qc_count_mad_lower", | ||
type = "integer", default = 3, | ||
help = "Threshold for filtering cells based on the number of reads. Use a negative value if you don't want to filter" | ||
) | ||
parser$add_argument("--qc_feature_count_mad_lower", | ||
type = "integer", default = 3, | ||
help = "Threshold for filtering cells based on the number of expressed genes. Use a negative value if you don't want to filter" | ||
) | ||
parser$add_argument("--qc_mito_mad_upper", | ||
type = "integer", default = 3, | ||
help = "Threshold for filtering cells based on the number of mitochondrial RNA reads. Use a negative value if you don't want to filter" | ||
) | ||
|
||
# Cell mix | ||
parser$add_argument("--b_cells", | ||
type = "integer", default = 0, | ||
help = "Number of CD19+ B cells in the mixture." | ||
) | ||
parser$add_argument("--naive_cytotoxic", | ||
type = "integer", default = 0, | ||
help = "Number of CD8+/CD45RA+ Naive Cytotoxic T Cells in the mixture," | ||
) | ||
parser$add_argument("--cd14_monocytes", | ||
type = "integer", default = 0, | ||
help = "Number of CD14+ monocytes in the mixture," | ||
) | ||
parser$add_argument("--regulatory_t", | ||
type = "integer", default = 0, | ||
help = "Number of CD4+/CD25+ Regulatory T Cells in the mixture," | ||
) | ||
parser$add_argument("--cd56_nk", | ||
type = "integer", default = 0, | ||
help = "Number of CD56+ natural killer cells in the mixture," | ||
) | ||
parser$add_argument("--cd4_t_helper", | ||
type = "integer", default = 0, | ||
help = "Number of CD4+ helper T cells in the mixture," | ||
) | ||
parser$add_argument("--memory_t", | ||
type = "integer", default = 0, | ||
help = "Number of CD4+/CD45RO+ Memory T Cells in the mixture," | ||
) | ||
parser$add_argument("--naive_t", | ||
type = "integer", default = 0, | ||
help = "Number of CD4+/CD45RA+/CD25- Naive T cells in the mixture," | ||
) | ||
# Add new data here. | ||
|
||
# Get command line options, if help option encountered print help and exit, | ||
# otherwise if options not found on command line then set defaults, | ||
args <- parser$parse_args() | ||
|
||
generate_zheng <- function(dfs, qc_count_mad_lower, qc_feature_count_mad_lower, qc_mito_mad_upper, seed = 1234) { | ||
set.seed(seed) | ||
all <- map(dfs, function(arg) { | ||
arg$df[, sample(colnames(arg$df), arg$count, replace = FALSE)] | ||
}) | ||
sce <- cbind( | ||
all$b_cells, | ||
all$naive_cytotoxic, | ||
all$cd14_monocytes, | ||
all$regulatory_t, | ||
all$cd4_t_helper, | ||
all$cd56_nk, | ||
all$memory_t, | ||
all$naive_t | ||
# Add new data here. | ||
) | ||
|
||
isSpike(sce, "ERCC") <- grepl("^ERCC", rowData(sce)$Symbol) | ||
sce <- calculateQCMetrics(sce, feature_controls = list(Mito = grepl("^MT-", rowData(sce)$Symbol))) | ||
|
||
libsize.drop <- rep(FALSE, length(colnames(sce))) | ||
if (qc_count_mad_lower >= 0) { | ||
libsize.drop <- isOutlier(sce$total_counts, nmads = qc_count_mad_lower, type = "lower", log = TRUE) | ||
} | ||
feature.drop <- rep(FALSE, length(colnames(sce))) | ||
if (qc_feature_count_mad_lower >= 0) { | ||
feature.drop <- isOutlier(sce$total_features_by_counts, nmads = qc_feature_count_mad_lower, type = "lower", log = TRUE) | ||
} | ||
mito.drop <- rep(FALSE, length(colnames(sce))) | ||
if (qc_mito_mad_upper >= 0) { | ||
mito.drop <- isOutlier(sce$pct_counts_Mito, nmads = qc_mito_mad_upper, type = "higher") | ||
} | ||
|
||
keep <- !(libsize.drop | feature.drop | mito.drop) | ||
table(keep) | ||
sce <- sce[, keep] | ||
|
||
|
||
num.cells <- nexprs(sce, byrow = TRUE) | ||
to.keep <- num.cells > 0 | ||
table(!to.keep) | ||
sce <- sce[to.keep, ] | ||
sce | ||
} | ||
|
||
|
||
# Reading the data. | ||
b_cells <- DropletUtils::read10xCounts(paste(args$data_path, "b_cells/filtered_matrices_mex/hg19", sep = "/")) | ||
b_cells$label <- "b_cells" | ||
memory_t <- DropletUtils::read10xCounts(paste(args$data_path, "memory_t/filtered_matrices_mex/hg19", sep = "/")) | ||
memory_t$label <- "memory_t" | ||
naive_t <- DropletUtils::read10xCounts(paste(args$data_path, "naive_t/filtered_matrices_mex/hg19", sep = "/")) | ||
naive_t$label <- "naive_t" | ||
cd56_nk <- DropletUtils::read10xCounts(paste(args$data_path, "cd56_nk/filtered_matrices_mex/hg19", sep = "/")) | ||
cd56_nk$label <- "cd56_nk" | ||
cd14_monocytes <- DropletUtils::read10xCounts(paste(args$data_path, "cd14_monocytes/filtered_matrices_mex/hg19", sep = "/")) | ||
cd14_monocytes$label <- "cd14_monocytes" | ||
cd4_t_helper <- DropletUtils::read10xCounts(paste(args$data_path, "cd4_t_helper/filtered_matrices_mex/hg19", sep = "/")) | ||
cd4_t_helper$label <- "cd4_t_helper" | ||
regulatory_t <- DropletUtils::read10xCounts(paste(args$data_path, "regulatory_t/filtered_matrices_mex/hg19", sep = "/")) | ||
regulatory_t$label <- "regulatory_t" | ||
naive_cytotoxic <- DropletUtils::read10xCounts(paste(args$data_path, "naive_cytotoxic/filtered_matrices_mex/hg19", sep = "/")) | ||
naive_cytotoxic$label <- "naive_cytotoxic" | ||
# Add new data here. | ||
|
||
# Naming the Cells. | ||
colnames(b_cells) <- paste0("b_cells", seq_len(ncol(b_cells))) | ||
colnames(naive_cytotoxic) <- paste0("naive_cytotoxic", seq_len(ncol(naive_cytotoxic))) | ||
colnames(cd14_monocytes) <- paste0("cd14_monocytes", seq_len(ncol(cd14_monocytes))) | ||
colnames(regulatory_t) <- paste0("regulatory_t", seq_len(ncol(regulatory_t))) | ||
colnames(cd4_t_helper) <- paste0("cd4_t_helper", seq_len(ncol(cd4_t_helper))) | ||
colnames(cd56_nk) <- paste0("cd56_nk", seq_len(ncol(cd56_nk))) | ||
colnames(memory_t) <- paste0("memory_t", seq_len(ncol(memory_t))) | ||
colnames(naive_t) <- paste0("naive_t", seq_len(ncol(naive_t))) | ||
# Add new data here. | ||
|
||
|
||
sce_args <- list( | ||
b_cells = list(df = b_cells, count = args$b_cells), | ||
naive_cytotoxic = list(df = naive_cytotoxic, count = args$naive_cytotoxic), | ||
cd14_monocytes = list(df = cd14_monocytes, count = args$cd14_monocytes), # careful, only 2.6k | ||
regulatory_t = list(df = regulatory_t, count = args$regulatory_t), | ||
cd4_t_helper = list(df = cd4_t_helper, count = args$cd4_t_helper), | ||
cd56_nk = list(df = cd56_nk, count = args$cd56_nk), | ||
memory_t = list(df = memory_t, count = args$memory_t), | ||
naive_t = list(df = naive_t, count = args$naive_t) | ||
# Add new data here. | ||
) | ||
sce <- generate_zheng(sce_args, | ||
qc_count_mad_lower = args$qc_count_mad_lower, | ||
qc_feature_count_mad_lower = args$qc_feature_count_mad_lower, | ||
qc_mito_mad_upper = args$qc_mito_mad_upper, | ||
seed = args$seed | ||
) | ||
|
||
if (args$format == "SingleCellExperiment") { | ||
saveRDS(sce, paste(args$name, "rds", sep = ".")) | ||
} | ||
if (args$format == "Seurat") { | ||
dat <- as.Seurat(sce, counts = "counts", data = "counts") | ||
saveRDS(dat, paste(args$name, "rds", sep = ".")) | ||
} | ||
if (args$format == "LoomR") { | ||
dat <- as.Seurat(sce, counts = "counts", data = "counts") | ||
loom <- as.loom(dat, filename = paste(args$name, "rds", sep = "."), verbose = TRUE, overwrite = TRUE) | ||
loom$close_all() | ||
} | ||
if (args$format == "csv") { | ||
write.csv(as.matrix(counts(sce)), paste(args$name, "counts.csv", sep = ".")) | ||
write.csv(colData(sce), paste(args$name, "metadata.csv", sep = ".")) | ||
write.csv(rowData(sce), paste(args$name, "featuredata.csv", sep = ".")) | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
# coding=utf-8 | ||
# Copyright 2020 The Google Research Authors. | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
|
||
# Lint as: python3 | ||
"""Converts data from csv to an scRNA-seq python format.""" | ||
|
||
from absl import app | ||
from absl import flags | ||
|
||
import anndata | ||
import pandas as pd | ||
|
||
FLAGS = flags.FLAGS | ||
|
||
flags.DEFINE_string('input_csv_prefix', None, | ||
'Name used during the csv generation.') | ||
flags.DEFINE_enum('format', None, ['anndata'], 'Format to convert the data to.') | ||
|
||
|
||
def csv_to_h5ad(csv_prefix): | ||
count = pd.read_csv(f'{csv_prefix}.counts.csv', index_col=0) | ||
metadata = pd.read_csv(f'{csv_prefix}.metadata.csv', index_col=0) | ||
featuredata = pd.read_csv(f'{csv_prefix}.featuredata.csv', index_col=0) | ||
adata = anndata.AnnData(X=count.transpose(), obs=metadata, var=featuredata) | ||
adata.write(f'{csv_prefix}.h5ad') | ||
|
||
|
||
def main(unused_argv): | ||
if FLAGS.format == 'anndata': | ||
csv_to_h5ad(FLAGS.input_csv_prefix) | ||
|
||
|
||
if __name__ == '__main__': | ||
flags.mark_flag_as_required('input_csv_prefix') | ||
flags.mark_flag_as_required('format') | ||
app.run(main) |
Oops, something went wrong.