Skip to content
/ kmeans Public

Go library implementing Kmeans++ and Elkan's Kmeans algorithm

License

Notifications You must be signed in to change notification settings

arjunsk/kmeans

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

69 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Go Kmeans

Go Reference Go Report Card Codecov

This is a simple implementation of the Elkan's Kmeans algorithm in Go. The library also contains Kmeans++, Lloyd's kmeans and Simple Random Sampling algorithms.

Installing

$ go get github.com/arjunsk/kmeans

Usage

package main

import (
	"fmt"
	"github.com/arjunsk/kmeans"
)

func main() {
	vectors := [][]float64{
		{1, 2, 3, 4},
		{0, 3, 4, 1},
		{0, 9, 3, 1},
		{0, 8, 4, 4},
		{130, 200, 343, 224},
		{100, 200, 300, 400},
		{300, 400, 200, 110},
	}

	clusterer, err := kmeans.NewCluster(kmeans.ELKAN, vectors, 2)
	if err != nil {
		panic(err)
	}

	clusters, err := clusterer.Cluster()
	if err != nil {
		panic(err)
	}

	for _, cluster := range clusters {
		fmt.Println(cluster.Center())
	}
	// Output:
	// [1 2 3 4]
	// [130 200 343 224]

}

FAQ

Read More

Why not Kmeans++ initialization in Elkan's?

The default settings of Elkan's Kmeans is to use random initialization instead of Kmeans++ initialization.

Based on the excerpt from FAISS discussion, it was observed that Kmeans++ overhead computation cost is not worth for large scale use case.

Scikitlearn uses k-means++ initialization by default (you can also use random points), which is good in the specific corner-case you consider. It should actually gives you perfect result even without any iteration with high probability, because the kind of evaluation you consider is exactly what k-means++ has be designed to better handle. We have not implemented it in Faiss, because with our former Yael library, which implements both k-means++ and regular random initialization, we observed that the overhead computational cost was not worth the saving (negligible) in all large-scale settings we have considered.

When should you consider sub-sampling?

As mentioned here, when the number of vectors is large, it is recommended to use sub-sampling.

When applying k-means algorithm to cluster n points to k centroids, there are several cases:

  • n < k: this raises an exception with an assertion because we cannot do anything meaningful
  • n < min_points_per_centroid * k: this produces the warning above. It means that usually there are too few points to reliably estimate the centroids. This may still be ok if the dataset to index is as small as the training set.
  • n < max_points_per_centroid * k: comfort zone
  • n > max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled.

The parameters {min,max}_points_per_centroids (39 and 256 by default) belong to the ClusteringParameters structure.

What could be your sample size?

Number of partitions (ie K) cannot exceed half the number of records.

Returns total number of records if it is < 1000. Otherwise, returns 1% of the total number of records or 2x number of partitions whichever is larger. Never returns a number > Integer.MAX_VALUE.

The 2x could be based on the dimension of vector (here is geo-coordinates). For example, if the vector is 1000 dimension, then the sample size could be Max( 1% * total vectors, 1000 x k).

  • Based on FAISS, the sample size could be max_points_per_centroid * k if n > max_points_per_centroid * k.

What should be the ideal K?

Based on the recommendations from PGVector IVF INDEX, the idea K should

Choose an appropriate number of K - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows

About

Go library implementing Kmeans++ and Elkan's Kmeans algorithm

Topics

Resources

License

Stars

Watchers

Forks

Languages