Last Update Date: 2020-3-3
The goal of this project is to implement the K-means algorithm and use the k-center algorithm for center initialization.
-
Dataset
data/bgau.txt
is synthetic 2-d data with N=5000 vectors and k=15 Gaussian clusters with different degree of cluster overlap.- P. Fränti and O. Virmajoki, "Iterative shrinking method for clustering problems", Pattern Recognition, 39 (5), 761-765.
- Each line has the following format: x y , which represent the x- and y-coordinates of a point.
-
Project Output
- A visualization of each cluster
-
Project Environment
- python3
K-means clustering algorithm is an iterative clustering analysis algorithm. It steps are to randomly select K objects as the initial clustering center, then calculate the distance between each object and each seed clustering center, and assign each object to the nearest clustering.
- The advantages of K-means
- Simple principle, coonvenient implementation and fast convergence
- The clustering effect is better
- The interpretability of the model is strong
- Only cluster number K is needed for parameter adjustment
- The disadvantage of K-means
- It often ends with local optimization and is sensitive to noise and isolated points. Moreover, this method is not suitable for finding clusters with non convex shape or with large difference in size
- The number of clustering centers K needs to be given in advance, but in practice, the selection of K is very difficult to estimate. In many cases, it's not known in advance how many categories a given dataset should be divied into before it's most appropriate
- K-means needs to determine the initial clustering centers artificially. Different initial clustering centers may lead to completely different clustering results.
K-means algorithm depends on the selection of inital solution. We introduce K-center clustering algorithm. The K-center method uses the center points to define the prototype, in which the center points are the most representative points of a group of points, and uses the distance between the spatial point groups to find the typical points as the seed objects of clustering.
- Step 1: Create a single cluster of all nodes.
- Step 2: Suppose in the current iteration, there are x existing clusters, and a distance d is the current maximum distance between any node and its hub. We find such a node v that the distance between v and its hub is d.
- Step 3: Continue at iteration x, Create a new cluster with v as the only node. Then for each point, if we find that the dist(point, its hub) > dist(point, v), that is, the point is closer to the new cluster hub than it is to its old hub, we move the point from its old cluster to the new cluster.
- Step 4: Iterate the Step 2 and 3 for K-1 times, and we finally get K clusters with corresponding hub nodes as output.
- In the first step, we need to assume the first center point randomly. In order to reduce the uncertainty brought by random hypothesis, we introduce the concept of point density. We will take point P as the center of a circle and the number of points within the radius R as the density of P. And then we select the point with the highest density as the initial first center point.
- Compactness: The lower the value is, the closer the in-cluster-distance is, but the effect between clusters is not considered.
- Separation: The higher the value is, the farther the between-cluster-distance is, but the intra cluster effect is not considered.
- Davies Bouldin Index (Recommended): The smaller the DBI is, the smaller the in-cluster-distance and the larger the between-cluster-distance are. But it's not suitable for circular distribution.
- Dunn Validity Index: The higher the DVI is, the larger the between-cluster-distance and the smaller the in-cluster-distance are. But it's not suitable for circular distribution either.
- Sum of Squared Error, SEE: According to the change trend of SEE with different K, K corresponding to eblow points is selected as the optimal cluster number.
- Silhouette Coefficient: The value is between [-1,1], the closer to 1, the better cohesion and separation.
from entry import k_means_cluster
k_means_cluster(True, 15, 'init_with_kcenter')
k_means_cluster(False, 15, 'init_with_random')
k_means_cluster(using_kcenter, picture_name)
Function
The function is used to read the dataset, and use k-means algorithm to cluster the dataset, and output it in the form of image.
Props:
using_kcenter
decide whether to use k-center algorithm to initialize centers of the dataset.True
for use, whileFalse
for not, defaultTrue
.
k
decide the K.
picture_name
decide the name of the output image. Defaultpicture
.evaluate
Class
It's a class to calculate some measure standards of clustering accuracy.
It includes all the above evaluate criteria.
from Evaluate.evaluate import evaluate
# Need to input the K, the point sets and the centers of points.
eva = evaluate(k, point_sets, centers)
eva.compactness()
eva.separation()
eva.davies_bouldin_index()
eva.dunn_validity_index()
eva.cost_function() # return the SEE
eva.silhouette_coefficient(point, i) # single point silhouette coefficient
eva.silhouette_score() # return the total silhouette score.
Author | RongRongJi |
---|---|
Contact | homepage |