--- layout: global title: MLlib - Linear Algebra --- * Table of contents {:toc} # Singular Value Decomposition Singular Value `Decomposition` for Tall and Skinny matrices. Given an `$m \times n$` matrix `$A$`, we can compute matrices `$U,S,V$` such that `\[ A = U \cdot S \cdot V^T \]` There is no restriction on m, but we require n^2 doubles to fit in memory locally on one machine. Further, n should be less than m. The decomposition is computed by first computing `$A^TA = V S^2 V^T$`, computing SVD locally on that (since `$n \times n$` is small), from which we recover `$S$` and `$V$`. Then we compute U via easy matrix multiplication as `$U = A \cdot V \cdot S^{-1}$`. Only singular vectors associated with largest k singular values are recovered. If there are k such values, then the dimensions of the return will be: * `$S$` is `$k \times k$` and diagonal, holding the singular values on diagonal. * `$U$` is `$m \times k$` and satisfies `$U^T U = \mathop{eye}(k)$`. * `$V$` is `$n \times k$` and satisfies `$V^T V = \mathop{eye}(k)$`. All input and output is expected in sparse matrix format, 0-indexed as tuples of the form ((i,j),value) all in SparseMatrix RDDs. Below is example usage. {% highlight scala %} import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.SVD import org.apache.spark.mllib.linalg.SparseMatrix import org.apache.spark.mllib.linalg.MatrixEntry // Load and parse the data file val data = sc.textFile("mllib/data/als/test.data").map { line => val parts = line.split(',') MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble) } val m = 4 val n = 4 val k = 1 // recover largest singular vector val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), k) val = decomposed.S.data println("singular values = " + s.toArray.mkString) {% endhighlight %} # Principal Component Analysis Computes the top k principal component coefficients for the m-by-n data matrix X. Rows of X correspond to observations and columns correspond to variables. The coefficient matrix is n-by-k. Each column of the return matrix contains coefficients for one principal component, and the columns are in descending order of component variance. This function centers the data and uses the singular value decomposition (SVD) algorithm. All input and output is expected in DenseMatrix matrix format. See the examples directory under "SparkPCA.scala" for example usage.