Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compare two Clusterings interactively #61

Open
enjalot opened this issue Sep 20, 2024 · 1 comment
Open

Compare two Clusterings interactively #61

enjalot opened this issue Sep 20, 2024 · 1 comment
Labels

Comments

@enjalot
Copy link
Owner

enjalot commented Sep 20, 2024

It would be great to have a page dedicated to comparing the results of two clusterings of the same data.

This StackExchange post has many useful pointers for potential techniques to enable, as well as some of the challenges to consider.

@enjalot enjalot added the web label Sep 20, 2024
@enjalot enjalot added this to the Research Features milestone Sep 20, 2024
@enjalot
Copy link
Owner Author

enjalot commented Oct 15, 2024

I see that the MTEB evaluates the clustering capability of embedding models using V-measure (to compare a k-means clustering vs ground-truth labels): https://github.com/embeddings-benchmark/mteb/blob/main/mteb/abstasks/AbsTaskClusteringFast.py

V-measure is a metric that evaluates the quality of clustering by comparing the cluster assignments to the true labels. It's the harmonic mean of two other metrics: homogeneity and completeness.

Homogeneity: Measures whether each cluster contains only members of a single class.
Completeness: Measures whether all members of a given class are assigned to the same cluster.

The V-measure ranges from 0 to 1, where 1 indicates perfect clustering.

Some ideas from Claude for comparing clusters with variable sizes:

Calculate V-measure:
We can still calculate the V-measure between the HDBSCAN clusters and the true labels. The interpretation would be slightly different:

If HDBSCAN finds fewer clusters than true labels, a high V-measure would indicate that the embeddings are grouping semantically similar categories together.
If HDBSCAN finds more clusters than true labels, a high V-measure would suggest that the embeddings are capturing fine-grained semantic distinctions within categories.

Additional metrics:
We could introduce additional metrics to complement the V-measure:

Adjusted Rand Index (ARI) or Adjusted Mutual Information (AMI), which are also suitable for comparing clusterings with different numbers of clusters
Silhouette score to measure how well-separated the HDBSCAN clusters are
A measure of how close the number of HDBSCAN clusters is to the number of true labels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant