Skip to content

Commit

Permalink
[SPARK-1212] Adding sparse data support and update KMeans
Browse files Browse the repository at this point in the history
Continue our discussions from https://github.com/apache/incubator-spark/pull/575

This PR is WIP because it depends on a SNAPSHOT version of breeze.

Per previous discussions and benchmarks, I switched to breeze for linear algebra operations. @dlwh and I made some improvements to breeze to keep its performance comparable to the bare-bone implementation, including norm computation and squared distance. This is why this PR needs to depend on a SNAPSHOT version of breeze.

@fommil , please find the notice of using netlib-core in `NOTICE`. This is following Apache's instructions on appropriate labeling.

I'm going to update this PR to include:

1. Fast distance computation: using `\|a\|_2^2 + \|b\|_2^2 - 2 a^T b` when it doesn't introduce too much numerical error. The squared norms are pre-computed. Otherwise, computing the distance between the center (dense) and a point (possibly sparse) always takes O(n) time.

2. Some numbers about the performance.

3. A released version of breeze. @dlwh, a minor release of breeze will help this PR get merged early. Do you mind sharing breeze's release plan? Thanks!

Author: Xiangrui Meng <[email protected]>

Closes apache#117 from mengxr/sparse-kmeans and squashes the following commits:

67b368d [Xiangrui Meng] fix SparseVector.toArray
5eda0de [Xiangrui Meng] update NOTICE
67abe31 [Xiangrui Meng] move ArrayRDDs to mllib.rdd
1da1033 [Xiangrui Meng] remove dependency on commons-math3 and compute EPSILON directly
9bb1b31 [Xiangrui Meng] optimize SparseVector.toArray
226d2cd [Xiangrui Meng] update Java friendly methods in Vectors
238ba34 [Xiangrui Meng] add VectorRDDs with a converter from RDD[Array[Double]]
b28ba2f [Xiangrui Meng] add toArray to Vector
e69b10c [Xiangrui Meng] remove examples/JavaKMeans.java, which is replaced by mllib/examples/JavaKMeans.java
72bde33 [Xiangrui Meng] clean up code for distance computation
712cb88 [Xiangrui Meng] make Vectors.sparse Java friendly
27858e4 [Xiangrui Meng] update breeze version to 0.7
07c3cf2 [Xiangrui Meng] change Mahout to breeze in doc use a simple lower bound to avoid unnecessary distance computation
6f5cdde [Xiangrui Meng] fix a bug in filtering finished runs
42512f2 [Xiangrui Meng] Merge branch 'master' into sparse-kmeans
d6e6c07 [Xiangrui Meng] add predict(RDD[Vector]) to KMeansModel
42b4e50 [Xiangrui Meng] line feed at the end
a4ace73 [Xiangrui Meng] Merge branch 'fast-dist' into sparse-kmeans
3ed1a24 [Xiangrui Meng] add doc to BreezeVectorWithSquaredNorm
0107e19 [Xiangrui Meng] update NOTICE
87bc755 [Xiangrui Meng] tuned the KMeans code: changed some for loops to while, use view to avoid copying arrays
0ff8046 [Xiangrui Meng] update KMeans to use fastSquaredDistance
f355411 [Xiangrui Meng] add BreezeVectorWithSquaredNorm case class
ab74f67 [Xiangrui Meng] add fastSquaredDistance for KMeans
4e7d5ca [Xiangrui Meng] minor style update
07ffaf2 [Xiangrui Meng] add dense/sparse vector data models and conversions to/from breeze vectors use breeze to implement KMeans in order to support both dense and sparse data
  • Loading branch information
mengxr authored and mateiz committed Mar 24, 2014
1 parent 8265dc7 commit 80c2968
Show file tree
Hide file tree
Showing 20 changed files with 930 additions and 397 deletions.
9 changes: 9 additions & 0 deletions NOTICE
Original file line number Diff line number Diff line change
Expand Up @@ -3,3 +3,12 @@ Copyright 2014 The Apache Software Foundation.

This product includes software developed at
The Apache Software Foundation (http://www.apache.org/).

In addition, this product includes:

- JUnit (http://www.junit.org) is a testing framework for Java. We included it
under the terms of the Eclipse Public License v1.0.

- JTransforms (https://sites.google.com/site/piotrwendykier/software/jtransforms)
provides fast transforms in Java. It is tri-licensed, and we included it under
the terms of the Mozilla Public License v1.1.
138 changes: 0 additions & 138 deletions examples/src/main/java/org/apache/spark/examples/JavaKMeans.java

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -17,32 +17,33 @@

package org.apache.spark.mllib.examples;

import java.util.regex.Pattern;

import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.api.java.function.Function;

import org.apache.spark.mllib.clustering.KMeans;
import org.apache.spark.mllib.clustering.KMeansModel;

import java.util.Arrays;
import java.util.regex.Pattern;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;

/**
* Example using MLLib KMeans from Java.
*/
public final class JavaKMeans {

static class ParsePoint implements Function<String, double[]> {
private static class ParsePoint implements Function<String, Vector> {
private static final Pattern SPACE = Pattern.compile(" ");

@Override
public double[] call(String line) {
public Vector call(String line) {
String[] tok = SPACE.split(line);
double[] point = new double[tok.length];
for (int i = 0; i < tok.length; ++i) {
point[i] = Double.parseDouble(tok[i]);
}
return point;
return Vectors.dense(point);
}
}

Expand All @@ -65,15 +66,15 @@ public static void main(String[] args) {

JavaSparkContext sc = new JavaSparkContext(args[0], "JavaKMeans",
System.getenv("SPARK_HOME"), JavaSparkContext.jarOfClass(JavaKMeans.class));
JavaRDD<String> lines = sc.textFile(args[1]);
JavaRDD<String> lines = sc.textFile(inputFile);

JavaRDD<double[]> points = lines.map(new ParsePoint());
JavaRDD<Vector> points = lines.map(new ParsePoint());

KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs);
KMeansModel model = KMeans.train(points.rdd(), k, iterations, runs, KMeans.K_MEANS_PARALLEL());

System.out.println("Cluster centers:");
for (double[] center : model.clusterCenters()) {
System.out.println(" " + Arrays.toString(center));
for (Vector center : model.clusterCenters()) {
System.out.println(" " + center);
}
double cost = model.computeCost(points.rdd());
System.out.println("Cost: " + cost);
Expand Down
5 changes: 5 additions & 0 deletions mllib/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,11 @@
<artifactId>jblas</artifactId>
<version>1.2.3</version>
</dependency>
<dependency>
<groupId>org.scalanlp</groupId>
<artifactId>breeze_${scala.binary.version}</artifactId>
<version>0.7</version>
</dependency>
<dependency>
<groupId>org.scalatest</groupId>
<artifactId>scalatest_${scala.binary.version}</artifactId>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -16,14 +16,16 @@
*/

package org.apache.spark.mllib.api.python

import java.nio.{ByteBuffer, ByteOrder}

import org.apache.spark.api.java.JavaRDD
import org.apache.spark.mllib.regression._
import org.apache.spark.mllib.classification._
import org.apache.spark.mllib.clustering._
import org.apache.spark.mllib.linalg.Vectors
import org.apache.spark.mllib.recommendation._
import org.apache.spark.mllib.regression._
import org.apache.spark.rdd.RDD
import java.nio.ByteBuffer
import java.nio.ByteOrder

/**
* The Java stubs necessary for the Python mllib bindings.
Expand Down Expand Up @@ -205,10 +207,10 @@ class PythonMLLibAPI extends Serializable {
def trainKMeansModel(dataBytesJRDD: JavaRDD[Array[Byte]], k: Int,
maxIterations: Int, runs: Int, initializationMode: String):
java.util.List[java.lang.Object] = {
val data = dataBytesJRDD.rdd.map(xBytes => deserializeDoubleVector(xBytes))
val data = dataBytesJRDD.rdd.map(xBytes => Vectors.dense(deserializeDoubleVector(xBytes)))
val model = KMeans.train(data, k, maxIterations, runs, initializationMode)
val ret = new java.util.LinkedList[java.lang.Object]()
ret.add(serializeDoubleMatrix(model.clusterCenters))
ret.add(serializeDoubleMatrix(model.clusterCenters.map(_.toArray)))
ret
}

Expand Down
Loading

0 comments on commit 80c2968

Please sign in to comment.