enrichment
is a Python library, part of the xstats
toolkit, which you can use to perform to type of enrichment analysis:
- you can compare how enriched a subset of objects is in some annotation when compared with a general population. E.g., how significant it is to find 12 women in a group of 20 people knowing that help of the world's population is female? The significance is evaluated using a Fisher's exact test.
- you can evaluate how enriched the top of a ranked list of objects is in some annotation. There is no need to apply a cut-off to decide what is the top of the list; the significance of this enrichment is evaluated using methods from [Eden2007a] and [Eden2007b]
A typical use of such library is in bioinformatics, to perform gene set enrichment analysis. Given a set of genes for which a property (such as the expression level) is measured, enrichment
can evaluate how enriched is the subset of all genes with expression level above a threshold in some functional annotations. It can also evaluate how enriched the top of a list of genes, ranked by decreasing expression level, is in some functional annotations.
[Eden2007a] | Eden E, Lipson D, Yogev S and Yakhini Z. Motif discovery in ranked lists of DNA sequences. PLoS Computational Biology, 2007 Mar 23;3(3):e39 |
[Eden2007b] | Eden E. Discovering motifs in ranked lists of DNA sequences. Research thesis, 2007 Jan |
- Contact Aurelien Mazurie <[email protected]>
- Keywords Python, Enrichment analysis, Statistics, Bioinformatic, Fisher's exact test, GOrilla, mHG
- Download the latest version of the library from http://github/ajmazurie/xstats.enrichment/downloads
- Unzip the downloaded file, and
cd
in the resulting directory - Run
python setup.py install
. Alternatively, you can package the library by typingpython setup.py bdist
, which will result in the creation of a file dist/xstats.enrichment-xxx.tar.gz, with 'xxx' being the version number and the name of your platform. Installing the library is then as simple aseasy_install dist/Enrichment-xxx.tar.gz
(see the setuptools documentation)
From then you only have to import enrichment
to use the library:
import xstats.enrichment # Analysis 1: how significant is it to have 10 objects out of 500 # that share a given annotation, knowing that 120 out of the 1800 # objects in the general population have this annotation? l, r, t = xstats.enrichment.evaluate_subset(10, 500, 120, 1800) # the left-tailed probability is the probability of having less # than 10 objects out of 500 with this annotation: print "left-tailed:", l # 5.25e-8 # the right-tailed probability is the probability of having more # than 10 objects out of 500 with this annotation: print "right-tailed:", r # 0.99 # the two-tailed probability is the probability of observing 10 # objects out of 500 with this annotation, plus the probabilities # of observing even less likely proportions: print "two-tailed:", t # 7.78e-8 # as a result, we demonstrate that finding 10 objects with this # annotation is unexpected, as shown by the two-tailed p-value # (significant at an error rate of 5%). To know exactly in which # way the finding is unexpected, just look at the left- and right- # tailed p-values. In this example the left-tailed p-value is very # low, while the right-tailed p-value is almost 1. It means that # finding 10 objects is unexpectedly low in regard of the general # population. # conversely, finding 100 objects in the selection of 500 with # this property is unexpectedly high: l, r, t = xstats.enrichment.evaluate_subset(100, 500, 120, 1800) # the right-tailed p-value is very low, while the left-tailed is 1 print "left- and right-tailed:", l, r # 1, 1.32e-39 # Analysis 2: in a ranked list of 1000 objects, of which half of # them share a given annotation, how significant is it to find 20 # objects with this annotation at the top of the list? # we build an occurrence vector which, for each object in the list, # contains either True or False to represent if this object have # the annotation considered. # we start with an homogeneous distribution: occurrences = [True, False] * 500 # as expected, there is no significant enrichment at the top of the list: print xstats.enrichment.evaluate_list(occurrences) # 0.99 # we now build a second occurrence vector, in which the first 20 objects # all have the annotations: occurrences = [True] * 20 + [False] * 20 + [True, False] * 480 # this time we found a significant enrichment at the top of the list, # which in this case is determined as the 20 first entries: p_value, pivot = xstats.enrichment.evaluate_list(occurrences, with_pivot = True) print "p-value:", p_value # 3.67e-5 print "pivot:", pivot # 20