Python implementations of the k-modes and k-prototypes clustering algorithms. Relies on numpy for a lot of the heavy lifting.
k-modes is used for clustering categorical variables. It defines clusters based on the number of matching categories between data points. (This is in contrast to the more well-known k-means algorithm, which clusters numerical data based on Euclidean distance.) The k-prototypes algorithm combines k-modes and k-means and is able to cluster mixed numerical / categorical data.
Implemented are:
- k-modes [HUANG97] [HUANG98]
- k-modes with initialization based on density [CAO09]
- k-prototypes [HUANG97]
The code is modeled after the k-means module in scikit-learn and has the same familiar interface.
Usage examples of both k-modes ('soybean.py') and k-prototypes ('stocks.py') are included.
I would love to have more people play around with this and give me feedback on my implementation.
Enjoy!
git clone https://github.com/nicodv/kmodes.git
cd kmodes
python setup.py install
import numpy as np
from kmodes import kmodes
# random categorical data
data = np.random.choice(20, (100, 10))
km = kmodes.KModes(n_clusters=4, init='Huang', n_init=5, verbose=1)
clusters = km.fit_predict(data)
[HUANG97] | (1, 2) Huang, Z.: Clustering large data sets with mixed numeric and categorical values, Proceedings of the First Pacific Asia Knowledge Discovery and Data Mining Conference, Singapore, pp. 21-34, 1997. |
[HUANG98] | Huang, Z.: Extensions to the k-modes algorithm for clustering large data sets with categorical values, Data Mining and Knowledge Discovery 2(3), pp. 283-304, 1998. |
[CAO09] | Cao, F., Liang, J, Bai, L.: A new initialization method for categorical data clustering, Expert Systems with Applications 36(7), pp. 10223-10228., 2009. |