Skip to content

charliex210/scanpy

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quick Start | Examples | Tools | Installation | References

Scanpy - Single-Cell Analysis in Python

Tools for analyzing and simulating single-cell data that aim at an understanding of dynamic biological processes from snapshots of transcriptome or proteome. Please, cite the original references and implementations.

The draft Wolf, Angerer & Theis (2017) explains conceptual ideas and usage as a library. Potential coauthors who would like to work on software and manuscript are welcome! Any comments are appreciated!

Quick Start

Download or clone the repository - green button on top of the page - and cd into its root directory. Type pip install -e . and you can immediately work with the top-level command scanpy in any directory (more info here).

[notebook] Early mesoderm cells in mouse differentiate through three subsequent stages (PS, NP, HF) and then branch into erythorytes (4SG) and endothelial cells (4SFG).

scanpy moignard15 pca
scanpy moignard15 tsne
scanpy moignard15 diffmap

Coloring samples/cells by gene expression works analogously,

scanpy moignard15 pca -p smp HbbbH1
scanpy moignard15 tsne -p smp HbbbH1
scanpy moignard15 diffmap -p smp HbbbH1

Diffusion Pseudotime (DPT) analysis reveals differentation and branching. It detects the trunk of progenitor cells (dpt group 0) and the branches of endothelial cells (dpt group 1/2) and erythrocytes (dpt group 3). The inferred pseudotime traces the degree of cells' progression in the differentiation process. By default, this is plotted using Diffusion Maps. Using the -p option, you can specify the tSNE basis, for example.

scanpy moignard15 dpt -p smp exp_groups legendloc "upper left"
scanpy moignard15 dpt -p smp exp_groups legendloc none basis tsne

DPT orders cells by dpt groups, and within each group, by pseudotime. Groups are ordered by average pseudotime within the group. With this, we reproduced most of Fig. 1 from Haghverdi et al. (2016).

Let us rank genes according to differential expression between groups of cells.

scanpy moignard15 diffrank -o smp dpt_groups names 0,2,3

In contrast to a DPT analysis, a standard clustering in tSNE coordinates blurs the continuous nature of the data. Also, a seemingly close correspondence between clusters and experimental groups is not confirmed by the top-ranked genes.

scanpy moignard15 dbscan -p smp exp_groups
scanpy moignard15 diffrank -o smp dbscan_groups names 2,3
scanpy moignard15 diffrank -o smp exp_groups names names PS,4SG

If you want to use the results externally, read the resulting hdf5 file (inspect its content using h5ls write/moignard15.h5). If you prefer reading and writing csv files, which is much slower, however, use the option --fileformat csv.

More examples and help

For more examples, read this, or display them on the command line (example data and example use cases, respectively).

scanpy exdata
scanpy examples

Get general help, help on tool parameters and help on plotting the results of a tool.

scanpy --help
scanpy dpt --help
scanpy dpt -p help

Work on your own examples

To work on your own example, make a copy and edit the following notebook. If you want to call user examples from the command-line, create a file scanpy_whatevername.py in your current working directory, e.g., by downloading and renaming scanpy_user_template.py and changing the function myexample() to your needs. Consider using copy and paste from scanpy/examples/builtin.py. Call your example using scanpy myexample pca. For the previous example (moignard15) you would define the following

def moignard15():
    filename = 'data/moignard15/nbt.3154-S3.xlsx'
    adata = sc.read(filename, sheet='dCt_values.txt')
    # filter out genes: the 4th column (Eif2b1), the 31nd (Mrpl19), the 36th
    # (Polr2a) and the 45th (last,UBC), as done by Haghverdi et al. (2016)
    genes = np.array([g not in [4, 31, 36, 45] for g in range(adata.X.shape[1])])
    adata = adata[:, genes] # filter adata
    # choose root cell as in Haghverdi et al. (2016)
    adata['iroot'] = iroot = 532 # note that in Matlab/R, counting starts at 1
    adata['xroot'] = adata.X[iroot]
    # annotate with Moignard et al. (2015) experimental cell groups
    groups_names = ['HF', 'NP', 'PS', '4SG', '4SFG']
    # annotate each sample/cell
    adata.smp['groups'] = [
        next(gname for gname in groups_names if sname.startswith(gname))
        for sname in adata.smp_names]
    # fix the order and colors of names in "groups"
    adata['groups_names'] = groups_names
    adata['groups_colors'] = ['#D7A83E', '#7AAE5D', '#497ABC', '#AF353A', '#765099']
    return adata

Also, it'd be awesome if you add your example to examples and scanpy/examples/builtin.py together with a link to the public data. Simply make a pull request for this. If you have questions or prefer sending your script by email, contact Alex.

If you want to use your own tool, put your script into scanpy/tools, update scanpy/tools/init.py and use a wrapper like scripts/diffmap.py, which can be called directly.

./scripts/diffmap.py moignard15

Tools

Visualization

pca

[source] Uses the implementation of the scikit-learn package (Pedregosa et al., 2011) if it is installed.

tsne

[source] The algorithm has been introduced by Maaten & Hinton (2008) and proposed for single-cell data by Amir et al. (2013). Uses the implementation of the scikit-learn package (Pedregosa et al., 2011) if it is installed.

diffmap

[source] This implements diffusion maps (Coifman et al., 2005), which has been proposed for visualizing single-cell data by Haghverdi et al. (2015). Also, it uses the kernel suggested by Haghverdi et al. (2016). The Scanpy implementation is due to Wolf et al. (2017).

Discrete clustering of subgroups and continuous progression through subgroups

dpt

[source] Reconstruct progression in a biological process from snapshot data and detect branching subgroups. Diffusion Pseudotime analysis has been introduced by Haghverdi et al. (2016) and has been implemented for Scanpy by Wolf et al. (2017).

The functionality of diffmap and dpt compare to the R package destiny of Angerer et al. (2015).

dbscan

[source] Cluster cells using DBSCAN, originally proposed by Ester et al., 1996, in the implementation of scikit-learn (Pedregosa et al., 2011).

Differential expression

diffrank

[source] Rank genes by differential expression.

Simulation

sim

[source] Sample from a stochastic differential equation model built from literature-curated boolean gene regulatory networks, as suggested by Wittmann et al. (2009). The Scanpy implementation is due to Wolf et al. (2017).

The tool compares to the Matlab tool Odefy of Krumsiek et al. (2010).

Installation

Download or clone the repository - green button on top of the page - and cd into its root directory. To install with symbolic links (stay up to date with your cloned version after you update with git pull) call

pip install -e .

and work with the top-level command scanpy or

import scanpy as sc

in any directory.

You can also call the wrapper python scripts/scanpy.py from within the root of the repository or from within scripts, which works without installation.

Packages you might need (all default in Anaconda) can be easily installed using Miniconda. Then run conda install scipy matplotlib h5py scikit-learn pandas xlrd enum34. Scanpy is written in Python 3 and compatible with Python 2.

Package Managment via Miniconda

After downloading Miniconda, in a unix shell (Linux, Mac), run

cd DOWNLOAD_DIR
chmod +x Miniconda3-latest-VERSION.sh
./Miniconda3-latest-VERSION.sh

and accept all suggestions. Either reopen a new terminal or source ~/.bashrc on Linux/ source ~/.bash_profile on Mac. Then run conda install scipy matplotlib h5py pandas xlrd scikit-learn enum34. The whole process takes about 5 min.

PyPi

The package is registered in the Python Packaging Index, but versioning has not started yet. In the future, installation will be possible without reference to GitHub via pip install scanpy.

References

Amir et al. (2013), viSNE enables visualization of high dimensional single-cell data and reveals phenotypic heterogeneity of leukemia Nature Biotechnology 31, 545.

Angerer et al. (2015), destiny - diffusion maps for large-scale single-cell data in R, Bioinformatics 32, 1241.

Coifman et al. (2005), Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps, PNAS 102, 7426.

Ester et al. (1996), A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, Portland, OR, pp. 226-231.

Haghverdi et al. (2015), Diffusion maps for high-dimensional single-cell analysis of differentiation data, Bioinformatics 31, 2989.

Haghverdi et al. (2016), Diffusion pseudotime robustly reconstructs branching cellular lineages, Nature Methods 13, 845.

Krumsiek et al. (2010), Odefy - From discrete to continuous models, BMC Bioinformatics 11, 233.

Krumsiek et al. (2011), Hierarchical Differentiation of Myeloid Progenitors Is Encoded in the Transcription Factor Network, PLoS ONE 6, e22649.

Maaten & Hinton (2008), Visualizing data using t-SNE, JMLR 9, 2579.

Moignard et al. (2015), Decoding the regulatory network of early blood development from single-cell gene expression measurements, Nature Biotechnology 33, 269.

Pedregosa et al. (2011), Scikit-learn: Machine Learning in Python, JMLR 12, 2825.

Paul et al. (2015), Transcriptional Heterogeneity and Lineage Commitment in Myeloid Progenitors, Cell 163, 1663.

Wittmann et al. (2009), Transforming Boolean models to continuous models: methodology and application to T-cell receptor signaling, BMC Systems Biology 3, 98.

About

Single-Cell Analysis in Python.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 98.4%
  • Other 1.6%