Skip to content

Latest commit

 

History

History
323 lines (235 loc) · 15.7 KB

README.md

File metadata and controls

323 lines (235 loc) · 15.7 KB

Getting Started | Features | Installation | References

Build Status

Scanpy - Single-Cell Analysis in Python

Efficient tools for analyzing and simulating large-scale single-cell data that aim at an understanding of dynamic biological processes from snapshots of transcriptome or proteome. The draft Wolf, Angerer & Theis (2017) explains conceptual ideas of the package. Any comments are appreciated!

Getting started

Download or clone the repository - green button on top of the page - and cd into its root directory. With Python 3.5 or 3.6 (preferably Miniconda) installed, type

pip install -e .

Aside from enabling import scanpy.api as sc anywhere on your system, you can also work with the top-level command scanpy on the command-line (more info on installation here).

Then go through the use cases compiled in scanpy_usage, in particular, the recent additions

Features

We first give an overview of the toplevel user functions of scanpy.api, followed by a few words on Scanpy's basic features and more details. For usage of the command-line interface, which is very similar to usage of the API, see this introductory example.

Overview

Scanpy user functions are grouped into the following modules

Preprocessing

  • pp.* - Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization.

Visualization

Branching trajectories and pseudotime, clustering, differential expression

Simulation

Basic features

The typical workflow consists of subsequent calls of data analysis tools of the form

sc.tl.tool(adata, **params)

where adata is an AnnData object and params is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None. If you want to copy the AnnData object, pass the copy argument

adata_copy = sc.tl.tool(adata, copy=True, **params)

AnnData objects

Instantiate AnnData via

adata = sc.AnnData(X[, smp][, var][, add])

The instance adata stores X as adata.X and sample annotation as adata.smp, variable annotation as adata.var and additional unstructured annotation as adata.add. While adata.X is array-like and adata.add is a conventional dictionary, adata.smp and adata.var are instances of a dictionary-like class, which is based on a Numpy array and requires its values to be iterables with n or d entries, respectively. Values can be retrieved and appended via adata.smp['foo_key'] and adata.var['bar_key']. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R's ExpressionSet (Huber et al., 2015).

Reading and writing data files and AnnData objects

Instead of invoking the explicit constructor, one usually calls

adata = sc.read(filename)

to initialize an AnnData object, and sc.write(filename, adata) to write it back to a file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Scanpy is smart about file storage and extensions. Instead of providing a full filename, you can provide filekeys. By default, Scanpy writes to ./write/filekey.h5, an hdf5 file, which is configurable by setting sc.sett.writedir and sc.sett.file_format_data.

Plotting

For each tool, there is an associated plotting function

sc.pl.tool(adata)

that retrieves and plots the elements of adata that were previously written by sc.tl.tool(adata). To not display figures interactively but save all plots to default locations, you can set sc.sett.savefigs = True. By default, figures are saved as png to ./figs/. Reset sc.sett.file_format_figs and sc.sett.figdir if you want to change this. Scanpy's plotting module can be seen similar to Seaborn: an extension of matplotlib that allows visualizing certain frequent tasks with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy's plotting functions usually return a Matplotlib.Axes object.

Builtin examples

Show all builtin example data using sc.show_exdata() and all builtin example use cases via sc.show_examples(). Load annotated and preprocessed data using an example key, here 'paul15', via

adata = sc.get_example('paul15')

The key 'paul15' can also be used within sc.read('paul15') and sc.write('paul15', adata) to write the current state of the AnnData object to disk.

Visualization

pca

[source] Computes the PCA representation X_pca of data, principal components and variance decomposition. Uses the implementation of the scikit-learn package (Pedregosa et al., 2011).

tsne

[source] Computes the tSNE representation X_tsne of data.

The algorithm has been introduced by Maaten & Hinton (2008) and proposed for single-cell data by Amir et al. (2013). By default, Scanpy uses the implementation of the scikit-learn package (Pedregosa et al., 2011). You can achieve a huge speedup if you install the Multicore-TSNE package by Ulyanov (2016), which will be automatically detected by Scanpy.

diffmap

[source] Computes the diffusion maps representation X_diffmap of data.

Diffusion maps (Coifman et al., 2005) has been proposed for visualizing single-cell data by Haghverdi et al. (2015). The tool uses the adapted Gaussian kernel suggested by Haghverdi et al. (2016). The Scanpy implementation is due to Wolf et al. (2017).

spring

Beta version.

[source] Force-directed graph drawing is a long-established algorithm for visualizing graphs, see Wikipedia. It has been suggested for visualizing single-cell data by Weinreb et al. (2016).

Here, the Fruchterman & Reingold (1991) algorithm is used. The implementation uses elements of the NetworkX implementation (Hagberg et al., 2008).

Discrete clustering of subgroups and continuous progression through subgroups

dpt

[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by Haghverdi et al. (2016) and implemented for Scanpy by Wolf et al. (2017).

The functionality of diffmap and dpt compare to the R package destiny of Angerer et al. (2015).

Examples: See one of the early examples [notebook/command line] dealing with data of Moignard et al., Nat. Biotechn. (2015).

dbscan

[source] Cluster cells using DBSCAN (Ester et al., 1996), in the implementation of scikit-learn (Pedregosa et al., 2011).

This is a very simple clustering method. A better one - in the same framework as DPT and Diffusion Maps - will come soon.

Differential expression

diffrank

[source] Rank genes by differential expression.

Simulation

sim

[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by Wittmann et al. (2009). The Scanpy implementation is due to Wolf et al. (2017).

The tool compares to the Matlab tool Odefy of Krumsiek et al. (2010).

Installation

If you do not have a current Python distribution (Python 3.5 or 3.6), download and install Miniconda (see below).

Then, download or clone the repository - green button on top of the page - and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call

pip install -e .

and work with the top-level command scanpy or

import scanpy as sc

in any directory.

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.

PyPi

The package is registered in the Python Packaging Index, but versioning has not started yet. In the future, installation will also be possible without reference to GitHub via pip install scanpy.

References

Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia Nature Biotechnology 31, 545.

Angerer et al. (2015), destiny - diffusion maps for large-scale single-cell data in R, Bioinformatics 32, 1241.

Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 7426.

Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226-231.

Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics 31, 2989.

Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods 13, 845.

Krumsiek et al. (2010), Odefy - From discrete to continuous models, BMC Bioinformatics 11, 233.

Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE 6, e22649.

Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR 9, 2579.

Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology 33, 269.

Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR 12, 2825.

Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell 163, 1663.

Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology 3, 98.