- scCello pretraining dataset is processed from CellxGene census LTS release 2023-07-25. We select all primary data with 10x protocols sequencing on non-cancer human cells. See paper App. B Data Preprocessing Details for details.
- token_vocabulary/token_dictionary.pkl: We use Geneformer's gene vocabulary. The vocabulary has 25424 gene ensembl ids, with 3 special tokens "pad", "mask" and "cls" (total vocab size 25427).
- Matching ensembl ids with names using
biomart
:- Matched names: 25137 genes (token_vocabulary/vocab_id2name.cs`)
- Unmatched names: 291 genes (token_vocabulary/vocab_ids_notFoundName.csv)
- token_vocabulary/gene_median_dictionary.pkl: Non-zero median value of expression of each detected gene across all cells for Geneformer-like gene-wise normalization.
- new_pretrain/general_CLid2cellname.pkl: Associates textual cell types used in pre-training with their cell type lineage ID (CLID).
- new_pretrain/pretrain_frac100_clid2name.pkl: Maps CLID to cell type label indices used in pre-training.
- new_pretrain/pretrain_frac100_cell_type_idmap.pkl: Associates textual cell types used in pre-training with their cell type label indices. Note that this file is not consistent with new_pretrain/general_CLid2cellname.pkl and new_pretrain/pretrain_frac100_clid2name.pkl. Its dict keys is used for the correct dict mapping, which can be obtained from
get_prestored_data
insccello/src/utils/data_loading.py
.
- cell_taxonomy/cl.owl: Cell ontology graph obtained from Cell Ontology.
- cell_taxonomy/celltype_relationship.json: A simpler version of cell ontology that adopts a tree structure for subclass relationships obtained from the authors of Cell Taxonomy. Note that we only use this data to associate textual cell types with their cell type lineage ID (CLID), since we are using the graph version of cell ontology.
- marker_gene/cellmarkers.tsv: All cell types with their marker genes obtained from Cell Marker and PanglaoDB.
- marker_gene/celllabel2typesquality.csv: Aligns cell labels provided in downstream datasets to cell types.
- Adapted from DeepCDR repo.