aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-dimensionality-reduction.md
diff options
context:
space:
mode:
authorReza Zadeh <rizlar@gmail.com>2014-08-24 17:35:54 -0700
committerXiangrui Meng <meng@databricks.com>2014-08-24 17:35:54 -0700
commitb1b20301b3a1b35564d61e58eb5964d5ad5e4d7d (patch)
tree4dde97e8f650b8ee274ff9853e1bc50b83de53e6 /docs/mllib-dimensionality-reduction.md
parent572952ae615895efaaabcd509d582262000c0852 (diff)
downloadspark-b1b20301b3a1b35564d61e58eb5964d5ad5e4d7d.tar.gz
spark-b1b20301b3a1b35564d61e58eb5964d5ad5e4d7d.tar.bz2
spark-b1b20301b3a1b35564d61e58eb5964d5ad5e4d7d.zip
[MLlib][SPARK-2997] Update SVD documentation to reflect roughly square
Update the documentation to reflect the fact we can handle roughly square matrices. Author: Reza Zadeh <rizlar@gmail.com> Closes #2070 from rezazadeh/svddocs and squashes the following commits: 826b8fe [Reza Zadeh] left singular vectors 3f34fc6 [Reza Zadeh] PCA is still TS 7ffa2aa [Reza Zadeh] better title aeaf39d [Reza Zadeh] More docs 788ed13 [Reza Zadeh] add computational cost explanation 6429c59 [Reza Zadeh] Add link to rowmatrix docs 1eeab8b [Reza Zadeh] Update SVD documentation to reflect roughly square
Diffstat (limited to 'docs/mllib-dimensionality-reduction.md')
-rw-r--r--docs/mllib-dimensionality-reduction.md29
1 files changed, 23 insertions, 6 deletions
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md
index 065d646496..9f2cf6d48e 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -11,7 +11,7 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
of reducing the number of variables under consideration.
It can be used to extract latent features from raw and noisy features
or compress data while maintaining the structure.
-MLlib provides support for dimensionality reduction on tall-and-skinny matrices.
+MLlib provides support for dimensionality reduction on the <a href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
## Singular value decomposition (SVD)
@@ -39,8 +39,26 @@ If we keep the top $k$ singular values, then the dimensions of the resulting low
* `$\Sigma$`: `$k \times k$`,
* `$V$`: `$n \times k$`.
-MLlib provides SVD functionality to row-oriented matrices that have only a few columns,
-say, less than $1000$, but many rows, i.e., *tall-and-skinny* matrices.
+### Performance
+We assume $n$ is smaller than $m$. The singular values and the right singular vectors are derived
+from the eigenvalues and the eigenvectors of the Gramian matrix $A^T A$. The matrix
+storing the left singular vectors $U$, is computed via matrix multiplication as
+$U = A (V S^{-1})$, if requested by the user via the computeU parameter.
+The actual method to use is determined automatically based on the computational cost:
+
+* If $n$ is small ($n < 100$) or $k$ is large compared with $n$ ($k > n / 2$), we compute the Gramian matrix
+first and then compute its top eigenvalues and eigenvectors locally on the driver.
+This requires a single pass with $O(n^2)$ storage on each executor and on the driver, and
+$O(n^2 k)$ time on the driver.
+* Otherwise, we compute $(A^T A) v$ in a distributive way and send it to
+<a href="http://www.caam.rice.edu/software/ARPACK/">ARPACK</a> to
+compute $(A^T A)$'s top eigenvalues and eigenvectors on the driver node. This requires $O(k)$
+passes, $O(n)$ storage on each executor, and $O(n k)$ storage on the driver.
+
+### SVD Example
+
+MLlib provides SVD functionality to row-oriented matrices, provided in the
+<a href="mllib-basics.html#rowmatrix">RowMatrix</a> class.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -124,9 +142,8 @@ MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format.
<div class="codetabs">
<div data-lang="scala" markdown="1">
-The following code demonstrates how to compute principal components on a tall-and-skinny `RowMatrix`
+The following code demonstrates how to compute principal components on a `RowMatrix`
and use them to project the vectors into a low-dimensional space.
-The number of columns should be small, e.g, less than 1000.
{% highlight scala %}
import org.apache.spark.mllib.linalg.Matrix
@@ -144,7 +161,7 @@ val projected: RowMatrix = mat.multiply(pc)
<div data-lang="java" markdown="1">
-The following code demonstrates how to compute principal components on a tall-and-skinny `RowMatrix`
+The following code demonstrates how to compute principal components on a `RowMatrix`
and use them to project the vectors into a low-dimensional space.
The number of columns should be small, e.g, less than 1000.