Name		Name	Last commit message	Last commit date
Latest commit History 69 Commits
.github		.github
assertx		assertx
elkans		elkans
moarray		moarray
moerr		moerr
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
go.mod		go.mod
go.sum		go.sum
types.go		types.go

Repository files navigation

Go Kmeans

This is a simple implementation of the Elkan's Kmeans algorithm in Go. The library also contains Kmeans++, Lloyd's kmeans and Simple Random Sampling algorithms.

Installing

$ go get github.com/arjunsk/kmeans

Usage

package main

import (
	"fmt"
	"github.com/arjunsk/kmeans"
)

func main() {
	vectors := [][]float64{
		{1, 2, 3, 4},
		{0, 3, 4, 1},
		{0, 9, 3, 1},
		{0, 8, 4, 4},
		{130, 200, 343, 224},
		{100, 200, 300, 400},
		{300, 400, 200, 110},
	}

	clusterer, err := kmeans.NewCluster(kmeans.ELKAN, vectors, 2)
	if err != nil {
		panic(err)
	}

	clusters, err := clusterer.Cluster()
	if err != nil {
		panic(err)
	}

	for _, cluster := range clusters {
		fmt.Println(cluster.Center())
	}
	// Output:
	// [1 2 3 4]
	// [130 200 343 224]

}

FAQ

Why not Kmeans++ initialization in Elkan's?

The default settings of Elkan's Kmeans is to use random initialization instead of Kmeans++ initialization.

Based on the excerpt from FAISS discussion, it was observed that Kmeans++ overhead computation cost is not worth for large scale use case.

Scikitlearn uses k-means++ initialization by default (you can also use random points), which is good in the specific corner-case you consider. It should actually gives you perfect result even without any iteration with high probability, because the kind of evaluation you consider is exactly what k-means++ has be designed to better handle. We have not implemented it in Faiss, because with our former Yael library, which implements both k-means++ and regular random initialization, we observed that the overhead computational cost was not worth the saving (negligible) in all large-scale settings we have considered.

When should you consider sub-sampling?

As mentioned here, when the number of vectors is large, it is recommended to use sub-sampling.

When applying k-means algorithm to cluster n points to k centroids, there are several cases:

n < k: this raises an exception with an assertion because we cannot do anything meaningful

n < min_points_per_centroid * k: this produces the warning above. It means that usually there are too few points to reliably estimate the centroids. This may still be ok if the dataset to index is as small as the training set.

n < max_points_per_centroid * k: comfort zone

n > max_points_per_centroid * k: there are too many points, making k-means unnecessarily slow. Then the training set is sampled.

The parameters {min,max}_points_per_centroids (39 and 256 by default) belong to the ClusteringParameters structure.

What could be your sample size?

Apache Sedona uses the following sampling rule.

Number of partitions (ie K) cannot exceed half the number of records.

Returns total number of records if it is < 1000. Otherwise, returns 1% of the total number of records or 2x number of partitions whichever is larger. Never returns a number > Integer.MAX_VALUE.

The 2x could be based on the dimension of vector (here is geo-coordinates). For example, if the vector is 1000 dimension, then the sample size could be Max( 1% * total vectors, 1000 x k).

Based on FAISS, the sample size could be max_points_per_centroid * k if n > max_points_per_centroid * k.

What should be the ideal K?

Based on the recommendations from PGVector IVF INDEX, the idea K should

Choose an appropriate number of K - a good place to start is rows / 1000 for up to 1M rows and sqrt(rows) for over 1M rows

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Go Kmeans

Installing

Usage

FAQ

Why not Kmeans++ initialization in Elkan's?

When should you consider sub-sampling?

What could be your sample size?

What should be the ideal K?

About

Languages

License

arjunsk/kmeans

Folders and files

Latest commit

History

Repository files navigation

Go Kmeans

Installing

Usage

FAQ

Why not Kmeans++ initialization in Elkan's?

When should you consider sub-sampling?

What could be your sample size?

What should be the ideal K?

About

Topics

Resources

License

Stars

Watchers

Forks

Languages