docs/mllib-linear-algebra.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74

---
layout: global
title: MLlib - Linear Algebra
---

* Table of contents
{:toc}


# Singular Value Decomposition
Singular Value `Decomposition` for Tall and Skinny matrices.
Given an `$m \times n$` matrix `$A$`, we can compute matrices `$U,S,V$` such that

`\[
 A = U \cdot S \cdot V^T
 \]`

There is no restriction on m, but we require n^2 doubles to
fit in memory locally on one machine.
Further, n should be less than m.

The decomposition is computed by first computing `$A^TA = V S^2 V^T$`,
computing SVD locally on that (since `$n \times n$` is small),
from which we recover `$S$` and `$V$`.
Then we compute U via easy matrix multiplication
as `$U =  A \cdot V \cdot S^{-1}$`.

Only singular vectors associated with largest k singular values
are recovered. If there are k
such values, then the dimensions of the return will be:

* `$S$` is `$k \times k$` and diagonal, holding the singular values on diagonal.
* `$U$` is `$m \times k$` and satisfies `$U^T U = \mathop{eye}(k)$`.
* `$V$` is `$n \times k$` and satisfies `$V^T V = \mathop{eye}(k)$`.

All input and output is expected in sparse matrix format, 0-indexed
as tuples of the form ((i,j),value) all in
SparseMatrix RDDs. Below is example usage.

{% highlight scala %}

import org.apache.spark.SparkContext
import org.apache.spark.mllib.linalg.SVD
import org.apache.spark.mllib.linalg.SparseMatrix
import org.apache.spark.mllib.linalg.MatrixEntry

// Load and parse the data file
val data = sc.textFile("mllib/data/als/test.data").map { line =>
  val parts = line.split(',')
  MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble)
}
val m = 4
val n = 4
val k = 1

// recover largest singular vector
val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), k)
val = decomposed.S.data

println("singular values = " + s.toArray.mkString)
{% endhighlight %}


# Principal Component Analysis

Computes the top k principal component coefficients for the m-by-n data matrix X.
Rows of X correspond to observations and columns correspond to variables.
The coefficient matrix is n-by-k. Each column of the return matrix contains coefficients
for one principal component, and the columns are in descending
order of component variance. This function centers the data and uses the
singular value decomposition (SVD) algorithm.

All input and output is expected in DenseMatrix matrix format. See the examples directory
under "SparkPCA.scala" for example usage.