Skip to content

Permutation test for identification of statistically significant clusters in a phylogenetic tree

Notifications You must be signed in to change notification settings

Yasas1994/Clustering-significance-test

Repository files navigation

Clustering-significance-test

A Permutation test for identification of statistically significant clusters in a phylogenetic tree

test_passed

The general idea of a permutation test is to compare the observed data to a null distribution generated by randomly permuting the data. If the observed data is significantly different from the null distribution, we can reject the null hypothesis and conclude that the observed pattern is statistically significant.

In the context of a phylogenetic tree, we can use a permutation test to test the statistical significance of clustering of different metadata annotations. Here's a general outline of how the test could be performed:

Define the null hypothesis: The null hypothesis is that the observed clustering of metadata annotations is no different from what we would expect by chance.

Define the test statistic: The test statistic is a measure of how different the observed clustering is from what we would expect by chance. One possible test statistic is the purity of clusters in the observed tree. Purity of a cluster measures the extent to which clusters contain a single class. Thus, clusters containing a single class would have purity score of 1.

$$ Purity = \frac 1 N \sum_{i=1}^k max_j | c_i \cap t_j | $$

Generate null distributions: To generate null distributions, we need to randomly permute the positions of labels of metadata annotations in the phylogenetic tree many times (e.g., 1000 times). Each permutation changes the clustering of metadata annotations, and we can compute the test statistic for each branch of the tree by recursive traversal.

Calculate p-values: The p-value is the proportion of permutations that have a test statistic that is at least as extreme as the observed test statistic. If the p-value is small (e.g., less than 0.05), we can reject the null hypothesis and conclude that the observed clustering is statistically significant.

Interpret the results: If the null hypothesis is rejected, we can conclude that the observed clustering is statistically significant.

How to run?

python pemutation_test.py -m Phenuiviridae.tsv -t Phenuiviridae.tree -o ./test -i1 3 -i2 2 -p 0.05 -r 1000

Results

  1. clusters.csv
┌─────────┬────────────────────────┬────────────┬─────────────┬───────┐
│ cluster ┆ leaf_lab               ┆ 0          ┆ 1           ┆ 2     │
│ ---     ┆ ---                    ┆ ---        ┆ ---         ┆ ---   │
│ i64     ┆ str                    ┆ str        ┆ str         ┆ i64   │
╞═════════╪════════════════════════╪════════════╪═════════════╪═══════╡
│ 0       ┆ Bandavirus_QNR55439.1  ┆ QNR55439.1 ┆ Bandavirus  ┆ 56763 │
│ 0       ┆ Phlebovirus_AEA29884.1 ┆ AEA29884.1 ┆ Phlebovirus ┆ 2346  │
│ 0       ┆ Phlebovirus_ADZ95575.1 ┆ ADZ95575.1 ┆ Phlebovirus ┆ 9855  │
│ 0       ┆ Bandavirus_QEL09442.1  ┆ QEL09442.1 ┆ Bandavirus  ┆ 9     │
│ 0       ┆ Bandavirus_USH08263.1  ┆ USH08263.1 ┆ Bandavirus  ┆ 6587  │
│ …       ┆ …                      ┆ …          ┆ …           ┆ …     │
│ 515     ┆ Bandavirus_AHE38316.1  ┆ AHE38316.1 ┆ Bandavirus  ┆ 654   │
│ 515     ┆ Bandavirus_USH07876.1  ┆ USH07876.1 ┆ Bandavirus  ┆ 45437 │
│ 515     ┆ Bandavirus_QNR55479.2  ┆ QNR55479.2 ┆ Bandavirus  ┆ 6587  │
│ 515     ┆ Bandavirus_USH08354.1  ┆ USH08354.1 ┆ Bandavirus  ┆ 1771  │
│ 515     ┆ Bandavirus_QNR55459.1  ┆ QNR55459.1 ┆ Bandavirus  ┆ 654   │
└─────────┴────────────────────────┴────────────┴─────────────┴───────┘

  1. global_significance.png
    shows how cluster purity changes after shuffling compared to the null distribution global_significance

  2. significant_clusters.csv

┌──────────┬────────────────┬─────────┬──────────────┐
│ clusters ┆ is_significant ┆ p_value ┆ cluster_size │
│ ---      ┆ ---            ┆ ---     ┆ ---          │
│ i64      ┆ bool           ┆ f64     ┆ i64          │
╞══════════╪════════════════╪═════════╪══════════════╡
│ 23       ┆ true           ┆ 0.044   ┆ 32           │
│ 24       ┆ true           ┆ 0.01    ┆ 33           │
│ 25       ┆ true           ┆ 0.03    ┆ 37           │
│ 313      ┆ true           ┆ 0.02    ┆ 14           │
│ 314      ┆ true           ┆ 0.02    ┆ 17           │
│ 315      ┆ true           ┆ 0.03    ┆ 21           │
└──────────┴────────────────┴─────────┴──────────────┘

Dependencies

pandas (v1.0.0 or later)
numpy (v1.18.0 or later)
treeswift (v1.0.0 or later)
queue (included in Python standard library)
tqdm (v4.0.0 or later)
seaborn (v0.10.0 or later)
matplotlib (v3.0.0 or later)

Note: These are minimum version requirements and newer versions of the packages may work as well.

About

Permutation test for identification of statistically significant clusters in a phylogenetic tree

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published