aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-dimensionality-reduction.md
diff options
context:
space:
mode:
authorAmeet Talwalkar <atalwalkar@gmail.com>2014-08-12 17:15:21 -0700
committerXiangrui Meng <meng@databricks.com>2014-08-12 17:15:21 -0700
commitc235b83e2782cce0626ecc403c0a67e442be52c1 (patch)
tree30b4ada17cba016cc2a8a7f01f09b7bcb78fbace /docs/mllib-dimensionality-reduction.md
parent882da57a1c8c075a87909d516b169b624941a6ec (diff)
downloadspark-c235b83e2782cce0626ecc403c0a67e442be52c1.tar.gz
spark-c235b83e2782cce0626ecc403c0a67e442be52c1.tar.bz2
spark-c235b83e2782cce0626ecc403c0a67e442be52c1.zip
SPARK-2830 [MLlib]: re-organize mllib documentation
As per discussions with Xiangrui, I've reorganized and edited the mllib documentation. Author: Ameet Talwalkar <atalwalkar@gmail.com> Closes #1908 from atalwalkar/master and squashes the following commits: fe6938a [Ameet Talwalkar] made xiangruis suggested changes 840028b [Ameet Talwalkar] made xiangruis suggested changes 7ec366a [Ameet Talwalkar] reorganize and edit mllib documentation
Diffstat (limited to 'docs/mllib-dimensionality-reduction.md')
-rw-r--r--docs/mllib-dimensionality-reduction.md44
1 files changed, 24 insertions, 20 deletions
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md
index 8e434998c1..065d646496 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -9,9 +9,9 @@ displayTitle: <a href="mllib-guide.html">MLlib</a> - Dimensionality Reduction
[Dimensionality reduction](http://en.wikipedia.org/wiki/Dimensionality_reduction) is the process
of reducing the number of variables under consideration.
-It is used to extract latent features from raw and noisy features,
+It can be used to extract latent features from raw and noisy features
or compress data while maintaining the structure.
-In this release, we provide preliminary support for dimensionality reduction on tall-and-skinny matrices.
+MLlib provides support for dimensionality reduction on tall-and-skinny matrices.
## Singular value decomposition (SVD)
@@ -30,17 +30,17 @@ where
* $V$ is an orthonormal matrix, whose columns are called right singular vectors.
For large matrices, usually we don't need the complete factorization but only the top singular
-values and its associated singular vectors. This can save storage, and more importantly, de-noise
+values and its associated singular vectors. This can save storage, de-noise
and recover the low-rank structure of the matrix.
-If we keep the top $k$ singular values, then the dimensions of the return will be:
+If we keep the top $k$ singular values, then the dimensions of the resulting low-rank matrix will be:
* `$U$`: `$m \times k$`,
* `$\Sigma$`: `$k \times k$`,
* `$V$`: `$n \times k$`.
-In this release, we provide SVD computation to row-oriented matrices that have only a few columns,
-say, less than $1000$, but many rows, which we call *tall-and-skinny*.
+MLlib provides SVD functionality to row-oriented matrices that have only a few columns,
+say, less than $1000$, but many rows, i.e., *tall-and-skinny* matrices.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -58,15 +58,10 @@ val s: Vector = svd.s // The singular values are stored in a local dense vector.
val V: Matrix = svd.V // The V factor is a local dense matrix.
{% endhighlight %}
-Same code applies to `IndexedRowMatrix`.
-The only difference that the `U` matrix becomes an `IndexedRowMatrix`.
+The same code applies to `IndexedRowMatrix` if `U` is defined as an
+`IndexedRowMatrix`.
</div>
<div data-lang="java" markdown="1">
-In order to run the following standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
-
{% highlight java %}
import java.util.LinkedList;
@@ -104,8 +99,16 @@ public class SVD {
}
}
{% endhighlight %}
-Same code applies to `IndexedRowMatrix`.
-The only difference that the `U` matrix becomes an `IndexedRowMatrix`.
+
+The same code applies to `IndexedRowMatrix` if `U` is defined as an
+`IndexedRowMatrix`.
+
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
+
</div>
</div>
@@ -116,7 +119,7 @@ statistical method to find a rotation such that the first coordinate has the lar
possible, and each succeeding coordinate in turn has the largest variance possible. The columns of
the rotation matrix are called principal components. PCA is used widely in dimensionality reduction.
-In this release, we implement PCA for tall-and-skinny matrices stored in row-oriented format.
+MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format.
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -180,9 +183,10 @@ public class PCA {
}
{% endhighlight %}
-In order to run the above standalone application using Spark framework make
-sure that you follow the instructions provided at section [Standalone
-Applications](quick-start.html) of the quick-start guide. What is more, you
-should include to your build file *spark-mllib* as a dependency.
+In order to run the above standalone application, follow the instructions
+provided in the [Standalone
+Applications](quick-start.html#standalone-applications) section of the Spark
+quick-start guide. Be sure to also include *spark-mllib* to your build file as
+a dependency.
</div>
</div>