Skip to content

Single-Cell Analysis in Python. Scales to very high cell numbers.

License

Notifications You must be signed in to change notification settings

yanyanzou0721/scanpy

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Getting Started | Features | Installation | References

Build Status

Scanpy - Single-Cell Analysis in Python

Efficient tools for analyzing and simulating large-scale single-cell data that aim at an understanding of dynamic biological processes from snapshots of transcriptome or proteome. The draft Wolf, Angerer & Theis (2017) explains conceptual ideas of the package. Any comments are appreciated!

Getting started

Download or clone the repository - green button on top of the page - and cd into its root directory. With Python 3.5 or 3.6 (preferably Miniconda) installed, type

pip install -e .

Aside from enabling import scanpy.api as sc anywhere on your system, you can also work with the top-level command scanpy on the command-line (more info on installation here).

Then go through the use cases compiled in scanpy_usage, in particular, the recent additions

Features

We first give an overview of the toplevel user functions of scanpy.api, followed by a few words on Scanpy's basic features and more details. For usage of the command-line interface, which parallels usage of the API, see this introductory example.

Overview

Scanpy user functions are grouped into the following modules

Preprocessing

  • pp.* - Filtering of highly-variable genes, batch-effect correction, per-cell (UMI) normalization.

Visualization

Branching trajectories and pseudotime, clustering, differential expression

Simulation

Basic features

The typical workflow consists of subsequent calls of data analysis tools of the form

sc.tl.tool(adata, **params)

where adata is an AnnData object and params is a dictionary that stores optional parameters. Each of these calls adds annotation to an expression matrix X, which stores n d-dimensional gene expression measurements. By default, Scanpy tools operate inplace and return None. If you want to copy the AnnData object, pass the copy argument

adata_copy = sc.tl.tool(adata, copy=True, **params)

Reading and writing data files and AnnData objects

One usually calls

adata = sc.read(filename)

to initialize an AnnData object, possibly adds further annotation, e.g. by,

annotation = np.genfromtxt(filename_annotation)
adata.smp['cell_groups'] = annotation[:, 2]  # categorical annotation of type str
adata.smp['time'] = annotation[:, 3]         # numerical annotation of type float

and uses

sc.write(filename, adata)

to save the adata to a file. Reading foresees filenames with extensions h5, xlsx, mtx, txt, csv and others. Writing foresees writing h5, csv and txt. Instead of providing a filename, you can provide a filekey, i.e., any string that does not end on a valid file extension. By default, Scanpy writes to ./write/filekey.h5, an hdf5 file, which is configurable by setting sc.settings.writedir and sc.settings.file_format_data.

AnnData objects

An AnnData instance stores an array-like data matrix as adata.X, dict-like sample annotation as adata.smp, dict-like variable annotation as adata.var and additional unstructured dict-like annotation as adata.add. While adata.add is a conventional dictionary, adata.smp and adata.var are instances of a low-level Pandas dataframe-like class. Values can be retrieved and appended via adata.smp[key] and adata.var[key]. Sample and variable names can be accessed via adata.smp_names and adata.var_names, respectively. AnnData objects can be sliced like Pandas dataframes, for example, adata = adata[:, list_of_gene_names]. The AnnData class is similar to R's ExpressionSet (Huber et al., 2015); the latter though is not implemented for sparse data.

Plotting

For each tool, there is an associated plotting function

sc.pl.tool(adata)

that retrieves and plots the elements of adata that were previously written by sc.tl.tool(adata). To not display figures interactively but save all plots to default locations, you can set sc.settings.savefigs = True. By default, figures are saved ./figs/. Reset sc.settings.file_format_figs and sc.settings.figdir if you want to change this. Scanpy's plotting module follows a similar as Seaborn: extending matplotlib to enable certain complicated visualizations with one-line commands. Detailed configuration has to be done via matplotlib functions, which is easy as Scanpy's plotting functions take and return Matplotlib.Axes objects.

Visualization

pca

[source] Computes the PCA representation X_pca of data, principal components and variance decomposition. Uses the implementation of the scikit-learn package (Pedregosa et al., 2011).

tsne

[source] Computes the tSNE representation X_tsne of data.

The algorithm has been introduced by Maaten & Hinton (2008) and proposed for single-cell data by Amir et al. (2013). By default, Scanpy uses the implementation of the scikit-learn package (Pedregosa et al., 2011). You can achieve a huge speedup if you install the Multicore-TSNE package by Ulyanov (2016), which will be automatically detected by Scanpy.

diffmap

[source] Computes the diffusion maps representation X_diffmap of data.

Diffusion maps (Coifman et al., 2005) has been proposed for visualizing single-cell data by Haghverdi et al. (2015). The tool uses the adapted Gaussian kernel suggested by Haghverdi et al. (2016). The Scanpy implementation is due to Wolf et al. (2017).

spring

Beta version.

[source] Force-directed graph drawing is a long-established algorithm for visualizing graphs, see Wikipedia. It has been suggested for visualizing single-cell data by Weinreb et al. (2016).

Here, the Fruchterman & Reingold (1991) algorithm is used. The implementation uses elements of the NetworkX implementation (Hagberg et al., 2008).

Discrete clustering of subgroups and continuous progression through subgroups

dpt

[source] Reconstruct the progression of a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by Haghverdi et al. (2016) and implemented for Scanpy by Wolf et al. (2017).

The functionality of diffmap and dpt compare to the R package destiny of Angerer et al. (2015), but run faster and scale to much higher cell numbers.

Examples: See one of the early examples [notebook/command line] dealing with data of Moignard et al., Nat. Biotechn. (2015).

dbscan

[source] Cluster cells using DBSCAN (Ester et al., 1996), in the implementation of scikit-learn (Pedregosa et al., 2011).

This is a very simple clustering method. A better one - in the same framework as DPT and Diffusion Maps - will come soon.

Differential expression

diffrank

[source] Rank genes by differential expression.

Simulation

sim

[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by Wittmann et al. (2009). The Scanpy implementation is due to Wolf et al. (2017).

The tool compares to the Matlab tool Odefy of Krumsiek et al. (2010).

Installation

If you do not have a current Python distribution (Python 3.5 or 3.6), download and install Miniconda (see below).

Then, download or clone the repository - green button on top of the page - and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call

pip install -e .

and work with the top-level command scanpy or

import scanpy as sc

in any directory.

Installing Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. The whole process takes just a couple of minutes.

PyPi

The package is registered in the Python Packaging Index, but versioning has not started yet. In the future, installation will also be possible without reference to GitHub via pip install scanpy.

References

Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia Nature Biotechnology 31, 545.

Angerer et al. (2015), destiny - diffusion maps for large-scale single-cell data in R, Bioinformatics 32, 1241.

Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 7426.

Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226-231.

Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics 31, 2989.

Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods 13, 845.

Krumsiek et al. (2010), Odefy - From discrete to continuous models, BMC Bioinformatics 11, 233.

Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE 6, e22649.

Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR 9, 2579.

Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology 33, 269.

Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR 12, 2825.

Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell 163, 1663.

Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology 3, 98.

About

Single-Cell Analysis in Python. Scales to very high cell numbers.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 100.0%