aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-dimensionality-reduction.md
diff options
context:
space:
mode:
authorXiangrui Meng <meng@databricks.com>2014-04-22 11:20:47 -0700
committerPatrick Wendell <pwendell@gmail.com>2014-04-22 11:20:47 -0700
commit26d35f3fd942761b0adecd1a720e1fa834db4de9 (patch)
tree16e57e2ff01e7cd2d7a1a3c1f3bf98c9cf98a082 /docs/mllib-dimensionality-reduction.md
parentbf9d49b6d1f668b49795c2d380ab7d64ec0029da (diff)
downloadspark-26d35f3fd942761b0adecd1a720e1fa834db4de9.tar.gz
spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.tar.bz2
spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.zip
[SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0
Preview: http://54.82.240.23:4000/mllib-guide.html Table of contents: * Basics * Data types * Summary statistics * Classification and regression * linear support vector machine (SVM) * logistic regression * linear linear squares, Lasso, and ridge regression * decision tree * naive Bayes * Collaborative Filtering * alternating least squares (ALS) * Clustering * k-means * Dimensionality reduction * singular value decomposition (SVD) * principal component analysis (PCA) * Optimization * stochastic gradient descent * limited-memory BFGS (L-BFGS) Author: Xiangrui Meng <meng@databricks.com> Closes #422 from mengxr/mllib-doc and squashes the following commits: 944e3a9 [Xiangrui Meng] merge master f9fda28 [Xiangrui Meng] minor 9474065 [Xiangrui Meng] add alpha to ALS examples 928e630 [Xiangrui Meng] initialization_mode -> initializationMode 5bbff49 [Xiangrui Meng] add imports to labeled point examples c17440d [Xiangrui Meng] fix python nb example 28f40dc [Xiangrui Meng] remove localhost:4000 369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc 7dc95cc [Xiangrui Meng] update linear methods 053ad8a [Xiangrui Meng] add links to go back to the main page abbbf7e [Xiangrui Meng] update ALS argument names 648283e [Xiangrui Meng] level down statistics 14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide 8cd2441 [Xiangrui Meng] minor updates 186ab07 [Xiangrui Meng] update section names 6568d65 [Xiangrui Meng] update toc, level up lr and svm 162ee12 [Xiangrui Meng] rename section names 5c1e1b1 [Xiangrui Meng] minor 8aeaba1 [Xiangrui Meng] wrap long lines 6ce6a6f [Xiangrui Meng] add summary statistics to toc 5760045 [Xiangrui Meng] claim beta cc604bf [Xiangrui Meng] remove classification and regression 92747b3 [Xiangrui Meng] make section titles consistent e605dd6 [Xiangrui Meng] add LIBSVM loader f639674 [Xiangrui Meng] add python section to migration guide c82ffb4 [Xiangrui Meng] clean optimization 31660eb [Xiangrui Meng] update linear algebra and stat 0a40837 [Xiangrui Meng] first pass over linear methods 1fc8271 [Xiangrui Meng] update toc 906ed0a [Xiangrui Meng] add a python example to naive bayes 5f0a700 [Xiangrui Meng] update collaborative filtering 656d416 [Xiangrui Meng] update mllib-clustering 86e143a [Xiangrui Meng] remove data types section from main page 8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples d1b5cbf [Xiangrui Meng] merge master 72e4804 [Xiangrui Meng] one pass over tree guide 64f8995 [Xiangrui Meng] move decision tree guide to a separate file 9fca001 [Xiangrui Meng] add first version of linear algebra guide 53c9552 [Xiangrui Meng] update dependencies f316ec2 [Xiangrui Meng] add migration guide f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction 182460f [Xiangrui Meng] add guide for naive Bayes 137fd1d [Xiangrui Meng] re-organize toc a61e434 [Xiangrui Meng] update mllib's toc
Diffstat (limited to 'docs/mllib-dimensionality-reduction.md')
-rw-r--r--docs/mllib-dimensionality-reduction.md86
1 files changed, 86 insertions, 0 deletions
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md
new file mode 100644
index 0000000000..4e9ecf7c00
--- /dev/null
+++ b/docs/mllib-dimensionality-reduction.md
@@ -0,0 +1,86 @@
+---
+layout: global
+title: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
+---
+
+* Table of contents
+{:toc}
+
+[Dimensionality reduction](http://en.wikipedia.org/wiki/Dimensionality_reduction) is the process
+of reducing the number of variables under consideration.
+It is used to extract latent features from raw and noisy features,
+or compress data while maintaining the structure.
+In this release, we provide preliminary support for dimensionality reduction on tall-and-skinny matrices.
+
+## Singular value decomposition (SVD)
+
+[Singular value decomposition (SVD)](http://en.wikipedia.org/wiki/Singular_value_decomposition)
+factorizes a matrix into three matrices: $U$, $\Sigma$, and $V$ such that
+
+`\[
+A = U \Sigma V^T,
+\]`
+
+where
+
+* $U$ is an orthonormal matrix, whose columns are called left singular vectors,
+* $\Sigma$ is a diagonal matrix with non-negative diagonals in descending order,
+ whose diagonals are called singular values,
+* $V$ is an orthonormal matrix, whose columns are called right singular vectors.
+
+For large matrices, usually we don't need the complete factorization but only the top singular
+values and its associated singular vectors. This can save storage, and more importantly, de-noise
+and recover the low-rank structure of the matrix.
+
+If we keep the top $k$ singular values, then the dimensions of the return will be:
+
+* `$U$`: `$m \times k$`,
+* `$\Sigma$`: `$k \times k$`,
+* `$V$`: `$n \times k$`.
+
+In this release, we provide SVD computation to row-oriented matrices that have only a few columns,
+say, less than $1000$, but many rows, which we call *tall-and-skinny*.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+{% highlight scala %}
+val mat: RowMatrix = ...
+
+// Compute the top 20 singular values and corresponding singular vectors.
+val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(20, computeU = true)
+val U: RowMatrix = svd.U // The U factor is a RowMatrix.
+val s: Vector = svd.s // The singular values are stored in a local dense vector.
+val V: Matrix = svd.V // The V factor is a local dense matrix.
+{% endhighlight %}
+</div>
+Same code applies to `IndexedRowMatrix`.
+The only difference that the `U` matrix becomes an `IndexedRowMatrix`.
+</div>
+
+## Principal component analysis (PCA)
+
+[Principal component analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) is a
+statistical method to find a rotation such that the first coordinate has the largest variance
+possible, and each succeeding coordinate in turn has the largest variance possible. The columns of
+the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.
+
+In this release, we implement PCA for tall-and-skinny matrices stored in row-oriented format.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+The following code demonstrates how to compute principal components on a tall-and-skinny `RowMatrix`
+and use them to project the vectors into a low-dimensional space.
+The number of columns should be small, e.g, less than 1000.
+
+{% highlight scala %}
+val mat: RowMatrix = ...
+
+// Compute the top 10 principal components.
+val pc: Matrix = mat.computePrincipalComponents(10) // Principal components are stored in a local dense matrix.
+
+// Project the rows to the linear space spanned by the top 10 principal components.
+val projected: RowMatrix = mat.multiply(pc)
+{% endhighlight %}
+</div>
+</div>