aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorDongjoon Hyun <dongjoon@apache.org>2016-06-11 12:55:38 +0100
committerSean Owen <sowen@cloudera.com>2016-06-11 12:55:38 +0100
commitad102af169c7344b30d3b84aa16452fcdc22542c (patch)
tree3ddc38bba4e271d6e361c7a880d12c030a76a532
parent3761330dd0151d7369d7fba4d4c344e9863990ef (diff)
downloadspark-ad102af169c7344b30d3b84aa16452fcdc22542c.tar.gz
spark-ad102af169c7344b30d3b84aa16452fcdc22542c.tar.bz2
spark-ad102af169c7344b30d3b84aa16452fcdc22542c.zip
[SPARK-15883][MLLIB][DOCS] Fix broken links in mllib documents
## What changes were proposed in this pull request? This issue fixes all broken links on Spark 2.0 preview MLLib documents. Also, this contains some editorial change. **Fix broken links** * mllib-data-types.md * mllib-decision-tree.md * mllib-ensembles.md * mllib-feature-extraction.md * mllib-pmml-model-export.md * mllib-statistics.md **Fix malformed section header and scala coding style** * mllib-linear-methods.md **Replace indirect forward links with direct one** * ml-classification-regression.md ## How was this patch tested? Manual tests (with `cd docs; jekyll build`.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13608 from dongjoon-hyun/SPARK-15883.
-rw-r--r--docs/ml-classification-regression.md4
-rw-r--r--docs/mllib-data-types.md16
-rw-r--r--docs/mllib-decision-tree.md6
-rw-r--r--docs/mllib-ensembles.md6
-rw-r--r--docs/mllib-feature-extraction.md2
-rw-r--r--docs/mllib-linear-methods.md10
-rw-r--r--docs/mllib-pmml-model-export.md2
-rw-r--r--docs/mllib-statistics.md8
8 files changed, 25 insertions, 29 deletions
diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md
index 88457d4bb1..d7e5521cbc 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -815,7 +815,7 @@ The main differences between this API and the [original MLlib ensembles API](mll
## Random Forests
[Random forests](http://en.wikipedia.org/wiki/Random_forest)
-are ensembles of [decision trees](ml-decision-tree.html).
+are ensembles of [decision trees](ml-classification-regression.html#decision-trees).
Random forests combine many decision trees in order to reduce the risk of overfitting.
The `spark.ml` implementation supports random forests for binary and multiclass classification and for regression,
using both continuous and categorical features.
@@ -896,7 +896,7 @@ All output columns are optional; to exclude an output column, set its correspond
## Gradient-Boosted Trees (GBTs)
[Gradient-Boosted Trees (GBTs)](http://en.wikipedia.org/wiki/Gradient_boosting)
-are ensembles of [decision trees](ml-decision-tree.html).
+are ensembles of [decision trees](ml-classification-regression.html#decision-trees).
GBTs iteratively train decision trees in order to minimize a loss function.
The `spark.ml` implementation supports GBTs for binary classification and for regression,
using both continuous and categorical features.
diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md
index 2ffe0f1c2b..ef56aebbc3 100644
--- a/docs/mllib-data-types.md
+++ b/docs/mllib-data-types.md
@@ -33,7 +33,7 @@ implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.lin
using the factory methods implemented in
[`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to create local vectors.
-Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API.
+Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API.
{% highlight scala %}
import org.apache.spark.mllib.linalg.{Vector, Vectors}
@@ -199,7 +199,7 @@ After loading, the feature indices are converted to zero-based.
[`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training
examples stored in LIBSVM format.
-Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils) for details on the API.
+Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) for details on the API.
{% highlight scala %}
import org.apache.spark.mllib.regression.LabeledPoint
@@ -264,7 +264,7 @@ We recommend using the factory methods implemented
in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) to create local
matrices. Remember, local matrices in MLlib are stored in column-major order.
-Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) for details on the API.
+Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) for details on the API.
{% highlight scala %}
import org.apache.spark.mllib.linalg.{Matrix, Matrices}
@@ -331,7 +331,7 @@ sm = Matrices.sparse(3, 2, [0, 1, 3], [0, 2, 1], [9, 6, 8])
A distributed matrix has long-typed row and column indices and double-typed values, stored
distributively in one or more RDDs. It is very important to choose the right format to store large
and distributed matrices. Converting a distributed matrix to a different format may require a
-global shuffle, which is quite expensive. Three types of distributed matrices have been implemented
+global shuffle, which is quite expensive. Four types of distributed matrices have been implemented
so far.
The basic type is called `RowMatrix`. A `RowMatrix` is a row-oriented distributed
@@ -344,6 +344,8 @@ An `IndexedRowMatrix` is similar to a `RowMatrix` but with row indices,
which can be used for identifying rows and executing joins.
A `CoordinateMatrix` is a distributed matrix stored in [coordinate list (COO)](https://en.wikipedia.org/wiki/Sparse_matrix#Coordinate_list_.28COO.29) format,
backed by an RDD of its entries.
+A `BlockMatrix` is a distributed matrix backed by an RDD of `MatrixBlock`
+which is a tuple of `(Int, Int, Matrix)`.
***Note***
@@ -535,12 +537,6 @@ rowsRDD = mat.rows
# Convert to a RowMatrix by dropping the row indices.
rowMat = mat.toRowMatrix()
-
-# Convert to a CoordinateMatrix.
-coordinateMat = mat.toCoordinateMatrix()
-
-# Convert to a BlockMatrix.
-blockMat = mat.toBlockMatrix()
{% endhighlight %}
</div>
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 9af48357b3..11f5de1fc9 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -136,7 +136,7 @@ When tuning these parameters, be careful to validate on held-out test data to av
* **`maxDepth`**: Maximum depth of a tree. Deeper trees are more expressive (potentially allowing higher accuracy), but they are also more costly to train and are more likely to overfit.
-* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) since those are often trained deeper than individual trees.
+* **`minInstancesPerNode`**: For a node to be split further, each of its children must receive at least this number of training instances. This is commonly used with [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) since those are often trained deeper than individual trees.
* **`minInfoGain`**: For a node to be split further, the split must improve at least this much (in terms of information gain).
@@ -152,13 +152,13 @@ These parameters may be tuned. Be careful to validate on held-out test data whe
* The default value is conservatively chosen to be 256 MB to allow the decision algorithm to work in most scenarios. Increasing `maxMemoryInMB` can lead to faster training (if the memory is available) by allowing fewer passes over the data. However, there may be decreasing returns as `maxMemoryInMB` grows since the amount of communication on each iteration can be proportional to `maxMemoryInMB`.
* *Implementation details*: For faster processing, the decision tree algorithm collects statistics about groups of nodes to split (rather than 1 node at a time). The number of nodes which can be handled in one group is determined by the memory requirements (which vary per features). The `maxMemoryInMB` parameter specifies the memory limit in terms of megabytes which each worker can use for these statistics.
-* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
+* **`subsamplingRate`**: Fraction of the training data used for learning the decision tree. This parameter is most relevant for training ensembles of trees (using [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees)), where it can be useful to subsample the original data. For training a single decision tree, this parameter is less useful since the number of training instances is generally not the main constraint.
* **`impurity`**: Impurity measure (discussed above) used to choose between candidate splits. This measure must match the `algo` parameter.
### Caching and checkpointing
-MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) when `numTrees` is set to be large.
+MLlib 1.2 adds several features for scaling up to larger (deeper) trees and tree ensembles. When `maxDepth` is set to be large, it can be useful to turn on node ID caching and checkpointing. These parameters are also useful for [RandomForest](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) when `numTrees` is set to be large.
* **`useNodeIdCache`**: If this is set to true, the algorithm will avoid passing the current model (tree or trees) to executors on each iteration.
* This can be useful with deep trees (speeding up computation on workers) and for large Random Forests (reducing communication on each iteration).
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md
index 2416b6fa0a..5543262a89 100644
--- a/docs/mllib-ensembles.md
+++ b/docs/mllib-ensembles.md
@@ -9,7 +9,7 @@ displayTitle: Ensembles - spark.mllib
An [ensemble method](http://en.wikipedia.org/wiki/Ensemble_learning)
is a learning algorithm which creates a model composed of a set of other base models.
-`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest).
+`spark.mllib` supports two major ensemble algorithms: [`GradientBoostedTrees`](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`RandomForest`](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$).
Both use [decision trees](mllib-decision-tree.html) as their base models.
## Gradient-Boosted Trees vs. Random Forests
@@ -96,7 +96,7 @@ The test error is calculated to measure the algorithm accuracy.
<div class="codetabs">
<div data-lang="scala" markdown="1">
-Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
+Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/RandomForestClassificationExample.scala %}
</div>
@@ -127,7 +127,7 @@ The Mean Squared Error (MSE) is computed at the end to evaluate
<div class="codetabs">
<div data-lang="scala" markdown="1">
-Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
+Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest$) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/RandomForestRegressionExample.scala %}
</div>
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 4c027c84ec..67c033e9e4 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -333,7 +333,7 @@ Details you can read at [dimensionality reduction](mllib-dimensionality-reductio
The following code demonstrates how to compute principal components on a `Vector`
and use them to project the vectors into a low-dimensional space while keeping associated labels
-for calculation a [Linear Regression]((mllib-linear-methods.html))
+for calculation a [Linear Regression](mllib-linear-methods.html)
<div class="codetabs">
<div data-lang="scala" markdown="1">
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index 63665c49bc..17d781ac23 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -185,10 +185,10 @@ algorithm for 200 iterations.
import org.apache.spark.mllib.optimization.L1Updater
val svmAlg = new SVMWithSGD()
-svmAlg.optimizer.
- setNumIterations(200).
- setRegParam(0.1).
- setUpdater(new L1Updater)
+svmAlg.optimizer
+ .setNumIterations(200)
+ .setRegParam(0.1)
+ .setUpdater(new L1Updater)
val modelL1 = svmAlg.run(training)
{% endhighlight %}
@@ -395,7 +395,7 @@ section of the Spark
quick-start guide. Be sure to also include *spark-mllib* to your build file as
a dependency.
-###Streaming linear regression
+### Streaming linear regression
When data arrive in a streaming fashion, it is useful to fit regression models online,
updating the parameters of the model as new data arrives. `spark.mllib` currently supports
diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md
index 58ed5a0e9d..7f2347dc0b 100644
--- a/docs/mllib-pmml-model-export.md
+++ b/docs/mllib-pmml-model-export.md
@@ -47,7 +47,7 @@ To export a supported `model` (see table above) to PMML, simply call `model.toPM
As well as exporting the PMML model to a String (`model.toPMML` as in the example above), you can export the PMML model to other formats.
-Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API.
+Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) for details on the API.
Here a complete example of building a KMeansModel and print it out in PMML format:
{% include_example scala/org/apache/spark/examples/mllib/PMMLModelExportExample.scala %}
diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md
index 02b81f153b..329855e565 100644
--- a/docs/mllib-statistics.md
+++ b/docs/mllib-statistics.md
@@ -80,7 +80,7 @@ correlation methods are currently Pearson's and Spearman's correlation.
calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or
an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively.
-Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API.
+Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/CorrelationsExample.scala %}
</div>
@@ -210,7 +210,7 @@ message.
run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run
and interpret the hypothesis tests.
-Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API.
+Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics$) for details on the API.
{% include_example scala/org/apache/spark/examples/mllib/HypothesisTestingKolmogorovSmirnovTestExample.scala %}
</div>
@@ -277,12 +277,12 @@ uniform, standard normal, or Poisson.
<div class="codetabs">
<div data-lang="scala" markdown="1">
-[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) provides factory
+[`RandomRDDs`](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) provides factory
methods to generate random double RDDs or vector RDDs.
The following example generates a random double RDD, whose values follows the standard normal
distribution `N(0, 1)`, and then map it to `N(1, 4)`.
-Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) for details on the API.
+Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs$) for details on the API.
{% highlight scala %}
import org.apache.spark.SparkContext