Skip to content

Commit

Permalink
PUB-923: Fix K-Means to pass golden R test.
Browse files Browse the repository at this point in the history
Problem: within_cluster_variances was computed correctly, but the name was
wrong, since the R wrapper equates this value with withinss, which is the
within_cluster_sum_of_squares. So in order to not upset the R side of things,
we compute sum of squares instead of variances.

Also re-enable returning total_within_SS from R.
  • Loading branch information
arnocandel committed Aug 8, 2014
1 parent 5fa5c25 commit 11835ec
Show file tree
Hide file tree
Showing 2 changed files with 11 additions and 6 deletions.
4 changes: 2 additions & 2 deletions R/h2o-package/R/Algorithms.R
Original file line number Diff line number Diff line change
Expand Up @@ -333,8 +333,8 @@ h2o.kmeans <- function(data, centers, cols = '', key = "", iter.max = 10, normal
result$centers = t(matrix(unlist(res$centers), ncol = res$parameters$k))
dimnames(result$centers) = list(seq(1,res$parameters$k), feat)
#result$totss <- res$total_SS
result$withinss <- res$within_cluster_variances
#result$tot.withinss <- res$total_within_SS
result$withinss <- res$within_cluster_variances ## FIXME: sum of squares != variances (bad name of the latter)
result$tot.withinss <- res$total_within_SS
#result$betweenss <- res$between_cluster_SS
result$size <- res$size
result$iter <- res$iterations
Expand Down
13 changes: 9 additions & 4 deletions src/main/java/hex/KMeans2.java
Original file line number Diff line number Diff line change
Expand Up @@ -185,9 +185,11 @@ public KMeans2() {
double ssq = 0; // sum squared error
for( int i=0; i<k; i++ ) {
ssq += model.within_cluster_variances[i]; // sum squared error all clusters
model.within_cluster_variances[i] /= task._rows[i]; // mse per-cluster
// model.within_cluster_variances[i] /= task._rows[i]; // mse per-cluster
}
model.total_within_SS = ssq/fr.numRows(); // mse total
// model.total_within_SS = ssq/fr.numRows(); // mse total
model.total_within_SS = ssq; //total within sum of squares

model.update(self()); // Update model in K/V store

// Compute change in clusters centers
Expand Down Expand Up @@ -289,7 +291,7 @@ public static Response redirect(Request req, Key model) {
rows[i][0] = model.within_cluster_variances[i];
columnHTMLlong(sb, "Cluster Size", model.size);
DocGen.HTML.section(sb, "Cluster Variances: ");
table(sb, "Clusters", new String[]{"Within Cluster Variances"}, rows);
table(sb, "Clusters", new String[]{"Within Cluster Sum of Squares"}, rows);
// columnHTML(sb, "Between Cluster Variances", model.between_cluster_variances);
sb.append("<br />");
DocGen.HTML.section(sb, "Overall Totals: ");
Expand Down Expand Up @@ -406,7 +408,10 @@ public static class KMeans2Model extends Model implements Progress {
public int iterations;

@API(help = "Within cluster sum of squares per cluster")
public double[] within_cluster_variances;
public double[] within_cluster_variances; //Warning: See note below
//Note: The R wrapper interprets this as withinss (sum of squares), so that's what we compute here, and NOT the variances.
//FIXME: => wrong name, should be within_cluster_sum_of_squares, but leaving to be backward-compatible with REST API


// @API(help = "Between Cluster square distances per cluster")
// public double[] between_cluster_variances;
Expand Down

0 comments on commit 11835ec

Please sign in to comment.