GitHub - Yasas1994/Clustering-significance-test: Permutation test for identification of statistically significant clusters in a phylogenetic tree

Clustering-significance-test

A Permutation test for identification of statistically significant clusters in a phylogenetic tree

The general idea of a permutation test is to compare the observed data to a null distribution generated by randomly permuting the data. If the observed data is significantly different from the null distribution, we can reject the null hypothesis and conclude that the observed pattern is statistically significant.

In the context of a phylogenetic tree, we can use a permutation test to test the statistical significance of clustering of different metadata annotations. Here's a general outline of how the test could be performed:

Define the null hypothesis: The null hypothesis is that the observed clustering of metadata annotations is no different from what we would expect by chance.

Define the test statistic: The test statistic is a measure of how different the observed clustering is from what we would expect by chance. One possible test statistic is the purity of clusters in the observed tree. Purity of a cluster measures the extent to which clusters contain a single class. Thus, clusters containing a single class would have purity score of 1.

$$ Purity = \frac 1 N \sum_{i=1}^k max_j | c_i \cap t_j | $$

Generate null distributions: To generate null distributions, we need to randomly permute the positions of labels of metadata annotations in the phylogenetic tree many times (e.g., 1000 times). Each permutation changes the clustering of metadata annotations, and we can compute the test statistic for each branch of the tree by recursive traversal.

Calculate p-values: The p-value is the proportion of permutations that have a test statistic that is at least as extreme as the observed test statistic. If the p-value is small (e.g., less than 0.05), we can reject the null hypothesis and conclude that the observed clustering is statistically significant.

Interpret the results: If the null hypothesis is rejected, we can conclude that the observed clustering is statistically significant.

How to run?

python pemutation_test.py -m Phenuiviridae.tsv -t Phenuiviridae.tree -o ./test -i1 3 -i2 2 -p 0.05 -r 1000

Results

clusters.csv

┌─────────┬────────────────────────┬────────────┬─────────────┬───────┐
│ cluster ┆ leaf_lab               ┆ 0          ┆ 1           ┆ 2     │
│ ---     ┆ ---                    ┆ ---        ┆ ---         ┆ ---   │
│ i64     ┆ str                    ┆ str        ┆ str         ┆ i64   │
╞═════════╪════════════════════════╪════════════╪═════════════╪═══════╡
│ 0       ┆ Bandavirus_QNR55439.1  ┆ QNR55439.1 ┆ Bandavirus  ┆ 56763 │
│ 0       ┆ Phlebovirus_AEA29884.1 ┆ AEA29884.1 ┆ Phlebovirus ┆ 2346  │
│ 0       ┆ Phlebovirus_ADZ95575.1 ┆ ADZ95575.1 ┆ Phlebovirus ┆ 9855  │
│ 0       ┆ Bandavirus_QEL09442.1  ┆ QEL09442.1 ┆ Bandavirus  ┆ 9     │
│ 0       ┆ Bandavirus_USH08263.1  ┆ USH08263.1 ┆ Bandavirus  ┆ 6587  │
│ …       ┆ …                      ┆ …          ┆ …           ┆ …     │
│ 515     ┆ Bandavirus_AHE38316.1  ┆ AHE38316.1 ┆ Bandavirus  ┆ 654   │
│ 515     ┆ Bandavirus_USH07876.1  ┆ USH07876.1 ┆ Bandavirus  ┆ 45437 │
│ 515     ┆ Bandavirus_QNR55479.2  ┆ QNR55479.2 ┆ Bandavirus  ┆ 6587  │
│ 515     ┆ Bandavirus_USH08354.1  ┆ USH08354.1 ┆ Bandavirus  ┆ 1771  │
│ 515     ┆ Bandavirus_QNR55459.1  ┆ QNR55459.1 ┆ Bandavirus  ┆ 654   │
└─────────┴────────────────────────┴────────────┴─────────────┴───────┘

global_significance.png
shows how cluster purity changes after shuffling compared to the null distribution
significant_clusters.csv

┌──────────┬────────────────┬─────────┬──────────────┐
│ clusters ┆ is_significant ┆ p_value ┆ cluster_size │
│ ---      ┆ ---            ┆ ---     ┆ ---          │
│ i64      ┆ bool           ┆ f64     ┆ i64          │
╞══════════╪════════════════╪═════════╪══════════════╡
│ 23       ┆ true           ┆ 0.044   ┆ 32           │
│ 24       ┆ true           ┆ 0.01    ┆ 33           │
│ 25       ┆ true           ┆ 0.03    ┆ 37           │
│ 313      ┆ true           ┆ 0.02    ┆ 14           │
│ 314      ┆ true           ┆ 0.02    ┆ 17           │
│ 315      ┆ true           ┆ 0.03    ┆ 21           │
└──────────┴────────────────┴─────────┴──────────────┘

Dependencies

pandas (v1.0.0 or later)
numpy (v1.18.0 or later)
treeswift (v1.0.0 or later)
queue (included in Python standard library)
tqdm (v4.0.0 or later)
seaborn (v0.10.0 or later)
matplotlib (v3.0.0 or later)

Note: These are minimum version requirements and newer versions of the packages may work as well.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Phenuiviridae.tree		Phenuiviridae.tree
Phenuiviridae.tsv		Phenuiviridae.tsv
README.md		README.md
permutation_test.ipynb		permutation_test.ipynb
permutation_test.py		permutation_test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clustering-significance-test

How to run?

Results

Dependencies

About

Releases

Packages

Languages

Yasas1994/Clustering-significance-test

Folders and files

Latest commit

History

Repository files navigation

Clustering-significance-test

How to run?

Results

Dependencies

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages