Compare two Clusterings interactively #61

enjalot · 2024-09-20T15:57:26Z

It would be great to have a page dedicated to comparing the results of two clusterings of the same data.

This StackExchange post has many useful pointers for potential techniques to enable, as well as some of the challenges to consider.

enjalot · 2024-10-15T16:56:47Z

I see that the MTEB evaluates the clustering capability of embedding models using V-measure (to compare a k-means clustering vs ground-truth labels): https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/AbsTaskClusteringFast.py

V-measure is a metric that evaluates the quality of clustering by comparing the cluster assignments to the true labels. It's the harmonic mean of two other metrics: homogeneity and completeness.

Homogeneity: Measures whether each cluster contains only members of a single class.
Completeness: Measures whether all members of a given class are assigned to the same cluster.

The V-measure ranges from 0 to 1, where 1 indicates perfect clustering.

Some ideas from Claude for comparing clusters with variable sizes:

Calculate V-measure:
We can still calculate the V-measure between the HDBSCAN clusters and the true labels. The interpretation would be slightly different:

If HDBSCAN finds fewer clusters than true labels, a high V-measure would indicate that the embeddings are grouping semantically similar categories together.
If HDBSCAN finds more clusters than true labels, a high V-measure would suggest that the embeddings are capturing fine-grained semantic distinctions within categories.

Additional metrics:
We could introduce additional metrics to complement the V-measure:

Adjusted Rand Index (ARI) or Adjusted Mutual Information (AMI), which are also suitable for comparing clusterings with different numbers of clusters
Silhouette score to measure how well-separated the HDBSCAN clusters are
A measure of how close the number of HDBSCAN clusters is to the number of true labels

enjalot added the web label Sep 20, 2024

enjalot added this to the Research Features milestone Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compare two Clusterings interactively #61

Compare two Clusterings interactively #61

enjalot commented Sep 20, 2024

enjalot commented Oct 15, 2024

Compare two Clusterings interactively #61

Compare two Clusterings interactively #61

Comments

enjalot commented Sep 20, 2024

enjalot commented Oct 15, 2024