Getting Started | Features | Installation | References
Efficient tools for analyzing and simulating large-scale single-cell data that aim at an understanding of dynamic biological processes from snapshots of transcriptome or proteome. The draft Wolf, Angerer & Theis (2017) explains conceptual ideas of the package. Any comments are appreciated!
Download or clone the repository - green button on top of the page - and cd
into its root directory. With Python 3.5 or 3.6 (preferably Miniconda) installed, type
pip install -e .
Aside from enabling import scanpy.api as sc
anywhere on your system, you can also work
with the top-level command scanpy
on the command-line (more info on installation here).
Then go through the use cases compiled in scanpy_usage, in particular, the recent additions
-
17-05-05 | link | We reproduce parts of the recent Guided Clustering tutorial of Seurat (Macosko et al., Cell 2015).
-
17-05-03 | link | Analyzing 68 000 cells from (Zheng et al., Nat. Comms. 2017), we find that Scanpy is about a factor 5 to 10 faster and more memory efficient than the Cell Ranger R kit for secondary analysis. For large-scale data, this becomes crucial for an interactive analysis.
-
17-05-01 | link | Diffusion Pseudotime analysis resolves developmental processes in data of Moignard et al, Nat. Biotechn. (2015), reproducing results of Haghverdi et al., Nat. Meth. (2016). Also, note that DPT has recently been very favorably discussed by the authors of Monocle.
We first give an overview of the toplevel user functions of scanpy.api
, followed by a few words on Scanpy's basic features and more details. For usage of the command-line interface, which is very similar to usage of the API, see this introductory example.
Scanpy user functions are grouped into the following modules
sc.tools
- Machine Learning and statistics tools. Abbreviationsc.tl
.sc.preprocessing
- Preprocessing. Abbreviationsc.pp
.sc.plotting
- Plotting. Abbreviationsc.pl
.sc.settings
- Settings. Abbreviationsc.sett
.
pp.*
- Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization.
-
tl.pca - PCA (Pedregosa et al., 2011).
-
tl.diffmap - Diffusion Maps (Coifman et al., 2005; Haghverdi et al., 2015; Wolf et al., 2017).
-
tl.tsne - t-SNE (Maaten & Hinton, 2008; Amir et al., 2013; Pedregosa et al., 2011).
-
tl.spring - Force-directed graph drawing (Wikipedia; Weinreb et al., 2016).
-
tl.dpt - Infer progression of cells, identify branching subgroups (Haghverdi et al., 2016; Wolf et al., 2017).
-
tl.dbscan - Cluster cells into subgroups (Ester et al., 1996; Pedregosa et al., 2011).
-
tl.diffrank - Rank genes according to differential expression (Wolf et al., 2017).
- tl.sim - Simulate dynamic gene expression data (Wittmann et al., 2009; Wolf et al., 2017).
The typical workflow consists of subsequent calls of data analysis tools of the form
sc.tl.tool(adata, **params)
where adata
is an AnnData
object and params
is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None
. If you want to copy the AnnData
object, pass the copy
argument
adata_copy = sc.tl.tool(adata, copy=True, **params)
Instantiate AnnData
via
adata = sc.AnnData(X[, smp][, var][, add])
The instance adata
stores X as adata.X
and sample annotation as adata.smp
, variable annotation as adata.var
and additional unstructured annotation as adata.add
. While adata.X
is array-like and adata.add
is a conventional dictionary, adata.smp
and adata.var
are instances of a dictionary-like class, which is based on a Numpy array and requires its values to be iterables with n or d entries, respectively. Values can be retrieved and appended via adata.smp['foo_key']
and adata.var['bar_key']
. Sample and variable names can be accessed via adata.smp_names
and adata.var_names
, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]
. The AnnData class is similar to R's ExpressionSet (Huber et al., 2015).
Instead of invoking the explicit constructor, one usually calls
adata = sc.read(filename)
to initialize an AnnData object, and sc.write(filename, adata)
to write it back to a file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Scanpy is smart about file storage and extensions. Instead of providing a full filename, you can provide filekeys. By default, Scanpy writes to ./write/filekey.h5
, an hdf5 file, which is configurable by setting sc.sett.writedir
and sc.sett.file_format_data
.
For each tool, there is an associated plotting function
sc.pl.tool(adata)
that retrieves and plots the elements of adata
that were previously written by sc.tl.tool(adata)
. To not display figures interactively but save all plots to default locations, you can set sc.sett.savefigs = True
.
By default, figures are saved as png to ./figs/
. Reset sc.sett.file_format_figs
and sc.sett.figdir
if you want to change this. Scanpy's plotting module can be seen similar to Seaborn: an extension of matplotlib that allows visualizing certain frequent tasks with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy's plotting functions usually return a Matplotlib.Axes
object.
Show all builtin example data using sc.show_exdata()
and all builtin example use cases via sc.show_examples()
. Load annotated and preprocessed data using an example key, here 'paul15', via
adata = sc.get_example('paul15')
The key 'paul15' can also be used within sc.read('paul15')
and sc.write('paul15', adata)
to write the current state of the AnnData object to disk.
[source] Computes the PCA representation X_pca
of data, principal components
and variance decomposition. Uses the implementation of the scikit-learn
package (Pedregosa et al., 2011).
[source] Computes the tSNE representation X_tsne
of data.
The algorithm has been introduced by Maaten & Hinton (2008) and
proposed for single-cell data by Amir et al. (2013). By
default, Scanpy uses the implementation of the scikit-learn
package
(Pedregosa et al., 2011). You can achieve a huge speedup
if you install the Multicore-TSNE package by Ulyanov
(2016), which will be
automatically detected by Scanpy.
[source] Computes the diffusion maps representation X_diffmap
of data.
Diffusion maps (Coifman et al., 2005) has been proposed for visualizing single-cell data by Haghverdi et al. (2015). The tool uses the adapted Gaussian kernel suggested by Haghverdi et al. (2016). The Scanpy implementation is due to Wolf et al. (2017).
Beta version.
[source] Force-directed graph drawing is a long-established algorithm for visualizing graphs, see Wikipedia. It has been suggested for visualizing single-cell data by Weinreb et al. (2016).
Here, the Fruchterman & Reingold (1991) algorithm is used. The implementation uses elements of the NetworkX implementation (Hagberg et al., 2008).
[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by Haghverdi et al. (2016) and implemented for Scanpy by Wolf et al. (2017).
The functionality of diffmap and dpt compare to the R package destiny of Angerer et al. (2015).
Examples: See one of the early examples [notebook/command line] dealing with data of Moignard et al., Nat. Biotechn. (2015).
[source] Cluster cells using DBSCAN (Ester et al., 1996), in the implementation of
scikit-learn
(Pedregosa et al., 2011).
This is a very simple clustering method. A better one - in the same framework as DPT and Diffusion Maps - will come soon.
[source] Rank genes by differential expression.
[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by Wittmann et al. (2009). The Scanpy implementation is due to Wolf et al. (2017).
The tool compares to the Matlab tool Odefy of Krumsiek et al. (2010).
If you do not have a current Python distribution (Python 3.5 or 3.6), download and install Miniconda (see below).
Then, download or clone the repository - green button on top of the page - and cd
into its root directory. To install with symbolic links (stay up to date with
your cloned version after you update with git pull
) call
pip install -e .
and work with the top-level command scanpy
or
import scanpy as sc
in any directory.
After downloading Miniconda, in a unix shell (Linux, Mac), run
cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh
and accept all suggestions. Either reopen a new terminal or source ~/.bashrc
on Linux/ source ~/.bash_profile
on Mac. The whole process takes just a couple of minutes.
The package is registered in the
Python Packaging Index, but versioning has not
started yet. In the future, installation will also be possible without reference
to GitHub via pip install scanpy
.
Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia Nature Biotechnology 31, 545.
Angerer et al. (2015), destiny - diffusion maps for large-scale single-cell data in R, Bioinformatics 32, 1241.
Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 7426.
Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226-231.
Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics 31, 2989.
Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods 13, 845.
Krumsiek et al. (2010), Odefy - From discrete to continuous models, BMC Bioinformatics 11, 233.
Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE 6, e22649.
Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR 9, 2579.
Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology 33, 269.
Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR 12, 2825.
Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell 163, 1663.
Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology 3, 98.