diff options
author | Xiangrui Meng <meng@databricks.com> | 2014-04-22 11:20:47 -0700 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-04-22 11:20:47 -0700 |
commit | 26d35f3fd942761b0adecd1a720e1fa834db4de9 (patch) | |
tree | 16e57e2ff01e7cd2d7a1a3c1f3bf98c9cf98a082 /docs/mllib-dimensionality-reduction.md | |
parent | bf9d49b6d1f668b49795c2d380ab7d64ec0029da (diff) | |
download | spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.tar.gz spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.tar.bz2 spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.zip |
[SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0
Preview: http://54.82.240.23:4000/mllib-guide.html
Table of contents:
* Basics
* Data types
* Summary statistics
* Classification and regression
* linear support vector machine (SVM)
* logistic regression
* linear linear squares, Lasso, and ridge regression
* decision tree
* naive Bayes
* Collaborative Filtering
* alternating least squares (ALS)
* Clustering
* k-means
* Dimensionality reduction
* singular value decomposition (SVD)
* principal component analysis (PCA)
* Optimization
* stochastic gradient descent
* limited-memory BFGS (L-BFGS)
Author: Xiangrui Meng <meng@databricks.com>
Closes #422 from mengxr/mllib-doc and squashes the following commits:
944e3a9 [Xiangrui Meng] merge master
f9fda28 [Xiangrui Meng] minor
9474065 [Xiangrui Meng] add alpha to ALS examples
928e630 [Xiangrui Meng] initialization_mode -> initializationMode
5bbff49 [Xiangrui Meng] add imports to labeled point examples
c17440d [Xiangrui Meng] fix python nb example
28f40dc [Xiangrui Meng] remove localhost:4000
369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc
7dc95cc [Xiangrui Meng] update linear methods
053ad8a [Xiangrui Meng] add links to go back to the main page
abbbf7e [Xiangrui Meng] update ALS argument names
648283e [Xiangrui Meng] level down statistics
14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide
8cd2441 [Xiangrui Meng] minor updates
186ab07 [Xiangrui Meng] update section names
6568d65 [Xiangrui Meng] update toc, level up lr and svm
162ee12 [Xiangrui Meng] rename section names
5c1e1b1 [Xiangrui Meng] minor
8aeaba1 [Xiangrui Meng] wrap long lines
6ce6a6f [Xiangrui Meng] add summary statistics to toc
5760045 [Xiangrui Meng] claim beta
cc604bf [Xiangrui Meng] remove classification and regression
92747b3 [Xiangrui Meng] make section titles consistent
e605dd6 [Xiangrui Meng] add LIBSVM loader
f639674 [Xiangrui Meng] add python section to migration guide
c82ffb4 [Xiangrui Meng] clean optimization
31660eb [Xiangrui Meng] update linear algebra and stat
0a40837 [Xiangrui Meng] first pass over linear methods
1fc8271 [Xiangrui Meng] update toc
906ed0a [Xiangrui Meng] add a python example to naive bayes
5f0a700 [Xiangrui Meng] update collaborative filtering
656d416 [Xiangrui Meng] update mllib-clustering
86e143a [Xiangrui Meng] remove data types section from main page
8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples
d1b5cbf [Xiangrui Meng] merge master
72e4804 [Xiangrui Meng] one pass over tree guide
64f8995 [Xiangrui Meng] move decision tree guide to a separate file
9fca001 [Xiangrui Meng] add first version of linear algebra guide
53c9552 [Xiangrui Meng] update dependencies
f316ec2 [Xiangrui Meng] add migration guide
f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction
182460f [Xiangrui Meng] add guide for naive Bayes
137fd1d [Xiangrui Meng] re-organize toc
a61e434 [Xiangrui Meng] update mllib's toc
Diffstat (limited to 'docs/mllib-dimensionality-reduction.md')
-rw-r--r-- | docs/mllib-dimensionality-reduction.md | 86 |
1 files changed, 86 insertions, 0 deletions
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md new file mode 100644 index 0000000000..4e9ecf7c00 --- /dev/null +++ b/docs/mllib-dimensionality-reduction.md @@ -0,0 +1,86 @@ +--- +layout: global +title: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction +--- + +* Table of contents +{:toc} + +[Dimensionality reduction](http://en.wikipedia.org/wiki/Dimensionality_reduction) is the process +of reducing the number of variables under consideration. +It is used to extract latent features from raw and noisy features, +or compress data while maintaining the structure. +In this release, we provide preliminary support for dimensionality reduction on tall-and-skinny matrices. + +## Singular value decomposition (SVD) + +[Singular value decomposition (SVD)](http://en.wikipedia.org/wiki/Singular_value_decomposition) +factorizes a matrix into three matrices: $U$, $\Sigma$, and $V$ such that + +`\[ +A = U \Sigma V^T, +\]` + +where + +* $U$ is an orthonormal matrix, whose columns are called left singular vectors, +* $\Sigma$ is a diagonal matrix with non-negative diagonals in descending order, + whose diagonals are called singular values, +* $V$ is an orthonormal matrix, whose columns are called right singular vectors. + +For large matrices, usually we don't need the complete factorization but only the top singular +values and its associated singular vectors. This can save storage, and more importantly, de-noise +and recover the low-rank structure of the matrix. + +If we keep the top $k$ singular values, then the dimensions of the return will be: + +* `$U$`: `$m \times k$`, +* `$\Sigma$`: `$k \times k$`, +* `$V$`: `$n \times k$`. + +In this release, we provide SVD computation to row-oriented matrices that have only a few columns, +say, less than $1000$, but many rows, which we call *tall-and-skinny*. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +{% highlight scala %} +val mat: RowMatrix = ... + +// Compute the top 20 singular values and corresponding singular vectors. +val svd: SingularValueDecomposition[RowMatrix, Matrix] = mat.computeSVD(20, computeU = true) +val U: RowMatrix = svd.U // The U factor is a RowMatrix. +val s: Vector = svd.s // The singular values are stored in a local dense vector. +val V: Matrix = svd.V // The V factor is a local dense matrix. +{% endhighlight %} +</div> +Same code applies to `IndexedRowMatrix`. +The only difference that the `U` matrix becomes an `IndexedRowMatrix`. +</div> + +## Principal component analysis (PCA) + +[Principal component analysis (PCA)](http://en.wikipedia.org/wiki/Principal_component_analysis) is a +statistical method to find a rotation such that the first coordinate has the largest variance +possible, and each succeeding coordinate in turn has the largest variance possible. The columns of +the rotation matrix are called principal components. PCA is used widely in dimensionality reduction. + +In this release, we implement PCA for tall-and-skinny matrices stored in row-oriented format. + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> + +The following code demonstrates how to compute principal components on a tall-and-skinny `RowMatrix` +and use them to project the vectors into a low-dimensional space. +The number of columns should be small, e.g, less than 1000. + +{% highlight scala %} +val mat: RowMatrix = ... + +// Compute the top 10 principal components. +val pc: Matrix = mat.computePrincipalComponents(10) // Principal components are stored in a local dense matrix. + +// Project the rows to the linear space spanned by the top 10 principal components. +val projected: RowMatrix = mat.multiply(pc) +{% endhighlight %} +</div> +</div> |