This is a simple implementation of the Elkan's Kmeans algorithm in Go. The library also contains Kmeans++, Lloyd's kmeans and Simple Random Sampling algorithms.
$ go get github.com/arjunsk/kmeans
package main
import (
"fmt"
"github.com/arjunsk/kmeans"
)
func main() {
vectors := [][]float64{
{1, 2, 3, 4},
{0, 3, 4, 1},
{0, 9, 3, 1},
{0, 8, 4, 4},
{130, 200, 343, 224},
{100, 200, 300, 400},
{300, 400, 200, 110},
}
clusterer, err := kmeans.NewCluster(kmeans.ELKAN, vectors, 2)
if err != nil {
panic(err)
}
clusters, err := clusterer.Cluster()
if err != nil {
panic(err)
}
for _, cluster := range clusters {
fmt.Println(cluster.Center())
}
// Output:
// [1 2 3 4]
// [130 200 343 224]
}
Read More
The default settings of Elkan's Kmeans is to use random initialization instead of Kmeans++ initialization.
Based on the excerpt from FAISS discussion, it was observed that Kmeans++ overhead computation cost is not worth for large scale use case.
Scikitlearn uses k-means++ initialization by default (you can also use random points), which is good in the specific corner-case you consider. It should actually gives you perfect result even without any iteration with high probability, because the kind of evaluation you consider is exactly what k-means++ has be designed to better handle. We have not implemented it in Faiss, because with our former Yael library, which implements both k-means++ and regular random initialization, we observed that the overhead computational cost was not worth the saving (negligible) in all large-scale settings we have considered.
As mentioned here, when the number of vectors is large, it is recommended to use sub-sampling.
When applying k-means algorithm to cluster n points to k centroids, there are several cases:
- n < k: this raises an exception with an assertion because we cannot do anything meaningful
- n < min_points_per_centroid * k: this produces the warning above. It means that usually there are too few points to reliably estimate the centroids. This may still be ok if the dataset to index is as small as the training set.
- n < max_points_per_centroid * k: comfort zone
- n > max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled.
The parameters {min,max}_points_per_centroids (39 and 256 by default) belong to the ClusteringParameters structure.
- Apache Sedona uses the following sampling rule.
Number of partitions (ie K) cannot exceed half the number of records.
Returns total number of records if it is < 1000. Otherwise, returns 1% of the total number of records or 2x number of partitions whichever is larger. Never returns a number > Integer.MAX_VALUE.
The 2x could be based on the dimension of vector (here is geo-coordinates). For example, if the vector is 1000 dimension, then the sample size could be Max( 1% * total vectors, 1000 x k).
- Based on FAISS, the sample size could be
max_points_per_centroid * k
ifn > max_points_per_centroid * k
.
Based on the recommendations from PGVector IVF INDEX, the idea K should
Choose an appropriate number of K - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows