Skip to content

Commit

Permalink
Principal Component Analysis
Browse files Browse the repository at this point in the history
# Principal Component Analysis

Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the coefficients return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm.

## Testing
Tests included:
 * All principal components
 * Only top k principal components
 * Dense SVD tests
 * Dense/sparse matrix tests

The results are tested against MATLAB's pca: http://www.mathworks.com/help/stats/pca.html

## Documentation
Added to mllib-guide.md

## Example Usage
Added to examples directory under SparkPCA.scala

Author: Reza Zadeh <[email protected]>

Closes apache#88 from rezazadeh/sparkpca and squashes the following commits:

e298700 [Reza Zadeh] reformat using IDE
3f23271 [Reza Zadeh] documentation and cleanup
b025ab2 [Reza Zadeh] documentation
e2667d4 [Reza Zadeh] assertMatrixApproximatelyEquals
3787bb4 [Reza Zadeh] stylin
c6ecc1f [Reza Zadeh] docs
aa2bbcb [Reza Zadeh] rename sparseToTallSkinnyDense
56975b0 [Reza Zadeh] docs
2df9bde [Reza Zadeh] docs update
8fb0015 [Reza Zadeh] rcond documentation
dbf7797 [Reza Zadeh] correct argument number
a9f1f62 [Reza Zadeh] documentation
4ce6caa [Reza Zadeh] style changes
9a56a02 [Reza Zadeh] use rcond relative to larget svalue
120f796 [Reza Zadeh] housekeeping
156ff78 [Reza Zadeh] string comprehension
2e1cf43 [Reza Zadeh] rename rcond
ea223a6 [Reza Zadeh] many style changes
f4002d7 [Reza Zadeh] more docs
bd53c7a [Reza Zadeh] proper accumulator
a8b5ecf [Reza Zadeh] Don't use for loops
0dc7980 [Reza Zadeh] filter zeros in sparse
6115610 [Reza Zadeh] More documentation
36d51e8 [Reza Zadeh] use JBLAS for UVS^-1 computation
bc4599f [Reza Zadeh] configurable rcond
86f7515 [Reza Zadeh] compute per parition, use while
09726b3 [Reza Zadeh] more style changes
4195e69 [Reza Zadeh] private, accumulator
17002be [Reza Zadeh] style changes
4ba7471 [Reza Zadeh] style change
f4982e6 [Reza Zadeh] Use dense matrix in example
2828d28 [Reza Zadeh] optimizations: normalize once, use inplace ops
72c9fa1 [Reza Zadeh] rename DenseMatrix to TallSkinnyDenseMatrix, lean
f807be9 [Reza Zadeh] fix typo
2d7ccde [Reza Zadeh] Array interface for dense svd and pca
cd290fa [Reza Zadeh] provide RDD[Array[Double]] support
398d123 [Reza Zadeh] style change
55abbfa [Reza Zadeh] docs fix
ef29644 [Reza Zadeh] bad chnage undo
472566e [Reza Zadeh] all files from old pr
555168f [Reza Zadeh] initial files
  • Loading branch information
rezazadeh authored and mateiz committed Mar 20, 2014
1 parent ffe272d commit 66a03e5
Show file tree
Hide file tree
Showing 12 changed files with 819 additions and 114 deletions.
1 change: 1 addition & 0 deletions docs/mllib-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ The following links provide a detailed explanation of the methods and usage exam
* Gradient Descent and Stochastic Gradient Descent
* <a href="mllib-linear-algebra.html">Linear Algebra</a>
* Singular Value Decomposition
* Principal Component Analysis

# Dependencies
MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself
Expand Down
13 changes: 13 additions & 0 deletions docs/mllib-linear-algebra.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,3 +59,16 @@ val = decomposed.S.data

println("singular values = " + s.toArray.mkString)
{% endhighlight %}


# Principal Component Analysis

Computes the top k principal component coefficients for the m-by-n data matrix X.
Rows of X correspond to observations and columns correspond to variables.
The coefficient matrix is n-by-k. Each column of the return matrix contains coefficients
for one principal component, and the columns are in descending
order of component variance. This function centers the data and uses the
singular value decomposition (SVD) algorithm.

All input and output is expected in DenseMatrix matrix format. See the examples directory
under "SparkPCA.scala" for example usage.
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.examples.mllib

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.PCA
import org.apache.spark.mllib.linalg.MatrixEntry
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.util._


/**
* Compute PCA of an example matrix.
*/
object SparkPCA {
def main(args: Array[String]) {
if (args.length != 3) {
System.err.println("Usage: SparkPCA <master> m n")
System.exit(1)
}
val sc = new SparkContext(args(0), "PCA",
System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass))

val m = args(2).toInt
val n = args(3).toInt

// Make example matrix
val data = Array.tabulate(m, n) { (a, b) =>
(a + 2).toDouble * (b + 1) / (1 + a + b) }

// recover top principal component
val coeffs = new PCA().setK(1).compute(sc.makeRDD(data))

println("top principal component = " + coeffs.mkString(", "))
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ object SparkSVD {
System.exit(1)
}
val sc = new SparkContext(args(0), "SVD",
System.getenv("SPARK_HOME"), Seq(System.getenv("SPARK_EXAMPLES_JAR")))
System.getenv("SPARK_HOME"), SparkContext.jarOfClass(this.getClass))

// Load and parse the data file
val data = sc.textFile(args(1)).map { line =>
Expand All @@ -49,7 +49,7 @@ object SparkSVD {
val n = args(3).toInt

// recover largest singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1)
val decomposed = new SVD().setK(1).compute(SparseMatrix(data, m, n))
val u = decomposed.U.data
val s = decomposed.S.data
val v = decomposed.V.data
Expand Down
26 changes: 26 additions & 0 deletions mllib/src/main/scala/org/apache/spark/mllib/linalg/MatrixRow.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.linalg

/**
* Class that represents a row of a dense matrix
*
* @param i row index (0 indexing used)
* @param data entries of the row
*/
case class MatrixRow(val i: Int, val data: Array[Double])
120 changes: 120 additions & 0 deletions mllib/src/main/scala/org/apache/spark/mllib/linalg/PCA.scala
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/

package org.apache.spark.mllib.linalg

import org.apache.spark.rdd.RDD


import org.jblas.DoubleMatrix


/**
* Class used to obtain principal components
*/
class PCA {
private var k = 1

/**
* Set the number of top-k principle components to return
*/
def setK(k: Int): PCA = {
this.k = k
this
}

/**
* Compute PCA using the current set parameters
*/
def compute(matrix: TallSkinnyDenseMatrix): Array[Array[Double]] = {
computePCA(matrix)
}

/**
* Compute PCA using the parameters currently set
* See computePCA() for more details
*/
def compute(matrix: RDD[Array[Double]]): Array[Array[Double]] = {
computePCA(matrix)
}

/**
* Computes the top k principal component coefficients for the m-by-n data matrix X.
* Rows of X correspond to observations and columns correspond to variables.
* The coefficient matrix is n-by-k. Each column of coeff contains coefficients
* for one principal component, and the columns are in descending
* order of component variance.
* This function centers the data and uses the
* singular value decomposition (SVD) algorithm.
*
* @param matrix dense matrix to perform PCA on
* @return An nxk matrix with principal components in columns. Columns are inner arrays
*/
private def computePCA(matrix: TallSkinnyDenseMatrix): Array[Array[Double]] = {
val m = matrix.m
val n = matrix.n

if (m <= 0 || n <= 0) {
throw new IllegalArgumentException("Expecting a well-formed matrix: m=$m n=$n")
}

computePCA(matrix.rows.map(_.data))
}

/**
* Computes the top k principal component coefficients for the m-by-n data matrix X.
* Rows of X correspond to observations and columns correspond to variables.
* The coefficient matrix is n-by-k. Each column of coeff contains coefficients
* for one principal component, and the columns are in descending
* order of component variance.
* This function centers the data and uses the
* singular value decomposition (SVD) algorithm.
*
* @param matrix dense matrix to perform pca on
* @return An nxk matrix of principal components
*/
private def computePCA(matrix: RDD[Array[Double]]): Array[Array[Double]] = {
val n = matrix.first.size

// compute column sums and normalize matrix
val colSumsTemp = matrix.map((_, 1)).fold((Array.ofDim[Double](n), 0)) {
(a, b) =>
val am = new DoubleMatrix(a._1)
val bm = new DoubleMatrix(b._1)
am.addi(bm)
(a._1, a._2 + b._2)
}

val m = colSumsTemp._2
val colSums = colSumsTemp._1.map(x => x / m)

val data = matrix.map {
x =>
val row = Array.ofDim[Double](n)
var i = 0
while (i < n) {
row(i) = x(i) - colSums(i)
i += 1
}
row
}

val (u, s, v) = new SVD().setK(k).compute(data)
v
}
}

Loading

0 comments on commit 66a03e5

Please sign in to comment.