diff options
-rw-r--r-- | docs/mllib-clustering.md | 30 | ||||
-rw-r--r-- | docs/mllib-collaborative-filtering.md | 6 | ||||
-rw-r--r-- | docs/mllib-data-types.md | 47 | ||||
-rw-r--r-- | docs/mllib-decision-tree.md | 22 | ||||
-rw-r--r-- | docs/mllib-dimensionality-reduction.md | 10 | ||||
-rw-r--r-- | docs/mllib-ensembles.md | 44 | ||||
-rw-r--r-- | docs/mllib-evaluation-metrics.md | 15 | ||||
-rw-r--r-- | docs/mllib-feature-extraction.md | 47 | ||||
-rw-r--r-- | docs/mllib-frequent-pattern-mining.md | 13 | ||||
-rw-r--r-- | docs/mllib-isotonic-regression.md | 6 | ||||
-rw-r--r-- | docs/mllib-linear-methods.md | 18 | ||||
-rw-r--r-- | docs/mllib-naive-bayes.md | 6 | ||||
-rw-r--r-- | docs/mllib-optimization.md | 4 | ||||
-rw-r--r-- | docs/mllib-pmml-model-export.md | 2 | ||||
-rw-r--r-- | docs/mllib-statistics.md | 34 |
15 files changed, 274 insertions, 30 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index c2711cf82d..8fbced6c87 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -4,10 +4,10 @@ title: Clustering - MLlib displayTitle: <a href="mllib-guide.html">MLlib</a> - Clustering --- -Clustering is an unsupervised learning problem whereby we aim to group subsets +[Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) is an unsupervised learning problem whereby we aim to group subsets of entities with one another based on some notion of similarity. Clustering is often used for exploratory analysis and/or as a component of a hierarchical -supervised learning pipeline (in which distinct classifiers or regression +[supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) pipeline (in which distinct classifiers or regression models are trained for each cluster). MLlib supports the following models: @@ -47,6 +47,8 @@ into two clusters. The number of desired clusters is passed to the algorithm. We Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the optimal *k* is usually one where there is an "elbow" in the WSSSE graph. +Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`KMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors @@ -77,6 +79,8 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a calling `.rdd()` on your `JavaRDD` object. A self-contained application example that is equivalent to the provided example in Scala is given below: +Refer to the [`KMeans` Java docs](api/java/org/apache/spark/mllib/clustering/KMeans.html) and [`KMeansModel` Java docs](api/java/org/apache/spark/mllib/clustering/KMeansModel.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.Function; @@ -132,6 +136,8 @@ data into two clusters. The number of desired clusters is passed to the algorith Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the optimal *k* is usually one where there is an "elbow" in the WSSSE graph. +Refer to the [`KMeans` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.KMeans) and [`KMeansModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.KMeansModel) for more details on the API. + {% highlight python %} from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array @@ -184,6 +190,8 @@ In the following example after loading and parsing data, we use a object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then output the parameters of the mixture model. +Refer to the [`GaussianMixture` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture) and [`GaussianMixtureModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixtureModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.clustering.GaussianMixture import org.apache.spark.mllib.clustering.GaussianMixtureModel @@ -216,6 +224,8 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a calling `.rdd()` on your `JavaRDD` object. A self-contained application example that is equivalent to the provided example in Scala is given below: +Refer to the [`GaussianMixture` Java docs](api/java/org/apache/spark/mllib/clustering/GaussianMixture.html) and [`GaussianMixtureModel` Java docs](api/java/org/apache/spark/mllib/clustering/GaussianMixtureModel.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.Function; @@ -268,6 +278,8 @@ In the following example after loading and parsing data, we use a object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then output the parameters of the mixture model. +Refer to the [`GaussianMixture` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixture) and [`GaussianMixtureModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixtureModel) for more details on the API. + {% highlight python %} from pyspark.mllib.clustering import GaussianMixture from numpy import array @@ -324,6 +336,8 @@ Calling `PowerIterationClustering.run` returns a [`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel), which contains the computed clustering assignments. +Refer to the [`PowerIterationClustering` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering) and [`PowerIterationClusteringModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.clustering.{PowerIterationClustering, PowerIterationClusteringModel} import org.apache.spark.mllib.linalg.Vectors @@ -365,6 +379,8 @@ Calling `PowerIterationClustering.run` returns a [`PowerIterationClusteringModel`](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html) which contains the computed clustering assignments. +Refer to the [`PowerIterationClustering` Java docs](api/java/org/apache/spark/mllib/clustering/PowerIterationClustering.html) and [`PowerIterationClusteringModel` Java docs](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; import scala.Tuple3; @@ -411,6 +427,8 @@ Calling `PowerIterationClustering.run` returns a [`PowerIterationClusteringModel`](api/python/pyspark.mllib.html#pyspark.mllib.clustering.PowerIterationClustering), which contains the computed clustering assignments. +Refer to the [`PowerIterationClustering` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.PowerIterationClustering) and [`PowerIterationClusteringModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.PowerIterationClusteringModel) for more details on the API. + {% highlight python %} from __future__ import print_function from pyspark.mllib.clustering import PowerIterationClustering, PowerIterationClusteringModel @@ -571,6 +589,7 @@ to the algorithm. We then output the topics, represented as probability distribu <div class="codetabs"> <div data-lang="scala" markdown="1"> +Refer to the [`LDA` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) and [`DistributedLDAModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel) for details on the API. {% highlight scala %} import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel} @@ -602,6 +621,8 @@ val sameModel = DistributedLDAModel.load(sc, "myLDAModel") </div> <div data-lang="java" markdown="1"> +Refer to the [`LDA` Java docs](api/java/org/apache/spark/mllib/clustering/LDA.html) and [`DistributedLDAModel` Java docs](api/java/org/apache/spark/mllib/clustering/DistributedLDAModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; @@ -666,6 +687,8 @@ public class JavaLDAExample { </div> <div data-lang="python" markdown="1"> +Refer to the [`LDA` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA) and [`LDAModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDAModel) for more details on the API. + {% highlight python %} from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vectors @@ -730,6 +753,7 @@ This example shows how to estimate clusters on streaming data. <div class="codetabs"> <div data-lang="scala" markdown="1"> +Refer to the [`StreamingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.StreamingKMeans) for details on the API. First we import the neccessary classes. @@ -780,6 +804,8 @@ ssc.awaitTermination() </div> <div data-lang="python" markdown="1"> +Refer to the [`StreamingKMeans` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.StreamingKMeans) for more details on the API. + First we import the neccessary classes. {% highlight python %} diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md index eedc23424a..b3fd51dca5 100644 --- a/docs/mllib-collaborative-filtering.md +++ b/docs/mllib-collaborative-filtering.md @@ -64,6 +64,8 @@ We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.rec method which assumes ratings are explicit. We evaluate the recommendation model by measuring the Mean Squared Error of rating prediction. +Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.recommendation.ALS import org.apache.spark.mllib.recommendation.MatrixFactorizationModel @@ -119,6 +121,8 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a calling `.rdd()` on your `JavaRDD` object. A self-contained application example that is equivalent to the provided example in Scala is given bellow: +Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for details on the API. + {% highlight java %} import scala.Tuple2; @@ -201,6 +205,8 @@ In the following example we load rating data. Each row consists of a user, a pro We use the default ALS.train() method which assumes ratings are explicit. We evaluate the recommendation by measuring the Mean Squared Error of rating prediction. +Refer to the [`ALS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.recommendation.ALS) for more details on the API. + {% highlight python %} from pyspark.mllib.recommendation import ALS, MatrixFactorizationModel, Rating diff --git a/docs/mllib-data-types.md b/docs/mllib-data-types.md index d8c7bdc63c..3c0c047967 100644 --- a/docs/mllib-data-types.md +++ b/docs/mllib-data-types.md @@ -33,6 +33,8 @@ implementations: [`DenseVector`](api/scala/index.html#org.apache.spark.mllib.lin using the factory methods implemented in [`Vectors`](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) to create local vectors. +Refer to the [`Vector` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.{Vector, Vectors} @@ -59,6 +61,8 @@ implementations: [`DenseVector`](api/java/org/apache/spark/mllib/linalg/DenseVec using the factory methods implemented in [`Vectors`](api/java/org/apache/spark/mllib/linalg/Vectors.html) to create local vectors. +Refer to the [`Vector` Java docs](api/java/org/apache/spark/mllib/linalg/Vector.html) and [`Vectors` Java docs](api/java/org/apache/spark/mllib/linalg/Vectors.html) for details on the API. + {% highlight java %} import org.apache.spark.mllib.linalg.Vector; import org.apache.spark.mllib.linalg.Vectors; @@ -86,6 +90,8 @@ and the following as sparse vectors: We recommend using NumPy arrays over lists for efficiency, and using the factory methods implemented in [`Vectors`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors) to create sparse vectors. +Refer to the [`Vectors` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Vectors) for more details on the API. + {% highlight python %} import numpy as np import scipy.sparse as sps @@ -119,6 +125,8 @@ For multiclass classification, labels should be class indices starting from zero A labeled point is represented by the case class [`LabeledPoint`](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint). +Refer to the [`LabeledPoint` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.regression.LabeledPoint @@ -136,6 +144,8 @@ val neg = LabeledPoint(0.0, Vectors.sparse(3, Array(0, 2), Array(1.0, 3.0))) A labeled point is represented by [`LabeledPoint`](api/java/org/apache/spark/mllib/regression/LabeledPoint.html). +Refer to the [`LabeledPoint` Java docs](api/java/org/apache/spark/mllib/regression/LabeledPoint.html) for details on the API. + {% highlight java %} import org.apache.spark.mllib.linalg.Vectors; import org.apache.spark.mllib.regression.LabeledPoint; @@ -153,6 +163,8 @@ LabeledPoint neg = new LabeledPoint(0.0, Vectors.sparse(3, new int[] {0, 2}, new A labeled point is represented by [`LabeledPoint`](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint). +Refer to the [`LabeledPoint` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.LabeledPoint) for more details on the API. + {% highlight python %} from pyspark.mllib.linalg import SparseVector from pyspark.mllib.regression import LabeledPoint @@ -187,6 +199,8 @@ After loading, the feature indices are converted to zero-based. [`MLUtils.loadLibSVMFile`](api/scala/index.html#org.apache.spark.mllib.util.MLUtils$) reads training examples stored in LIBSVM format. +Refer to the [`MLUtils` Scala docs](api/scala/index.html#org.apache.spark.mllib.util.MLUtils) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.util.MLUtils @@ -200,6 +214,8 @@ val examples: RDD[LabeledPoint] = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_ [`MLUtils.loadLibSVMFile`](api/java/org/apache/spark/mllib/util/MLUtils.html) reads training examples stored in LIBSVM format. +Refer to the [`MLUtils` Java docs](api/java/org/apache/spark/mllib/util/MLUtils.html) for details on the API. + {% highlight java %} import org.apache.spark.mllib.regression.LabeledPoint; import org.apache.spark.mllib.util.MLUtils; @@ -214,6 +230,8 @@ JavaRDD<LabeledPoint> examples = [`MLUtils.loadLibSVMFile`](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) reads training examples stored in LIBSVM format. +Refer to the [`MLUtils` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.util.MLUtils) for more details on the API. + {% highlight python %} from pyspark.mllib.util import MLUtils @@ -246,6 +264,8 @@ We recommend using the factory methods implemented in [`Matrices`](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices$) to create local matrices. Remember, local matrices in MLlib are stored in column-major order. +Refer to the [`Matrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrix) and [`Matrices` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Matrices) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.{Matrix, Matrices} @@ -267,6 +287,8 @@ We recommend using the factory methods implemented in [`Matrices`](api/java/org/apache/spark/mllib/linalg/Matrices.html) to create local matrices. Remember, local matrices in MLlib are stored in column-major order. +Refer to the [`Matrix` Java docs](api/java/org/apache/spark/mllib/linalg/Matrix.html) and [`Matrices` Java docs](api/java/org/apache/spark/mllib/linalg/Matrices.html) for details on the API. + {% highlight java %} import org.apache.spark.mllib.linalg.Matrix; import org.apache.spark.mllib.linalg.Matrices; @@ -289,6 +311,8 @@ We recommend using the factory methods implemented in [`Matrices`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrices) to create local matrices. Remember, local matrices in MLlib are stored in column-major order. +Refer to the [`Matrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrix) and [`Matrices` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.Matrices) for more details on the API. + {% highlight python %} import org.apache.spark.mllib.linalg.{Matrix, Matrices} @@ -341,6 +365,7 @@ created from an `RDD[Vector]` instance. Then we can compute its column summary [QR decomposition](https://en.wikipedia.org/wiki/QR_decomposition) is of the form A = QR where Q is an orthogonal matrix and R is an upper triangular matrix. For [singular value decomposition (SVD)](https://en.wikipedia.org/wiki/Singular_value_decomposition) and [principal component analysis (PCA)](https://en.wikipedia.org/wiki/Principal_component_analysis), please refer to [Dimensionality reduction](mllib-dimensionality-reduction.html). +Refer to the [`RowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) for details on the API. {% highlight scala %} import org.apache.spark.mllib.linalg.Vector @@ -364,6 +389,8 @@ val qrResult = mat.tallSkinnyQR(true) A [`RowMatrix`](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) can be created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics. +Refer to the [`RowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.Vector; @@ -387,6 +414,8 @@ QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true); A [`RowMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) can be created from an `RDD` of vectors. +Refer to the [`RowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.RowMatrix) for more details on the API. + {% highlight python %} from pyspark.mllib.linalg.distributed import RowMatrix @@ -423,6 +452,8 @@ can be created from an `RDD[IndexedRow]` instance, where wrapper over `(Long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping its row indices. +Refer to the [`IndexedRowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.distributed.{IndexedRow, IndexedRowMatrix, RowMatrix} @@ -448,6 +479,8 @@ can be created from an `JavaRDD<IndexedRow>` instance, where wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping its row indices. +Refer to the [`IndexedRowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/IndexedRowMatrix.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.distributed.IndexedRow; @@ -475,6 +508,8 @@ can be created from an `RDD` of `IndexedRow`s, where wrapper over `(long, vector)`. An `IndexedRowMatrix` can be converted to a `RowMatrix` by dropping its row indices. +Refer to the [`IndexedRowMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.IndexedRowMatrix) for more details on the API. + {% highlight python %} from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix @@ -529,6 +564,8 @@ wrapper over `(Long, Long, Double)`. A `CoordinateMatrix` can be converted to a with sparse rows by calling `toIndexedRowMatrix`. Other computations for `CoordinateMatrix` are not currently supported. +Refer to the [`CoordinateMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.CoordinateMatrix) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.distributed.{CoordinateMatrix, MatrixEntry} @@ -555,6 +592,8 @@ wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a with sparse rows by calling `toIndexedRowMatrix`. Other computations for `CoordinateMatrix` are not currently supported. +Refer to the [`CoordinateMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/CoordinateMatrix.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix; @@ -582,6 +621,8 @@ can be created from an `RDD` of `MatrixEntry` entries, where wrapper over `(long, long, float)`. A `CoordinateMatrix` can be converted to a `RowMatrix` by calling `toRowMatrix`, or to an `IndexedRowMatrix` with sparse rows by calling `toIndexedRowMatrix`. +Refer to the [`CoordinateMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.CoordinateMatrix) for more details on the API. + {% highlight python %} from pyspark.mllib.linalg.distributed import CoordinateMatrix, MatrixEntry @@ -631,6 +672,8 @@ most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix` creates blocks of size 1024 x 1024 by default. Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`. +Refer to the [`BlockMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.BlockMatrix) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.distributed.{BlockMatrix, CoordinateMatrix, MatrixEntry} @@ -656,6 +699,8 @@ most easily created from an `IndexedRowMatrix` or `CoordinateMatrix` by calling `toBlockMatrix` creates blocks of size 1024 x 1024 by default. Users may change the block size by supplying the values through `toBlockMatrix(rowsPerBlock, colsPerBlock)`. +Refer to the [`BlockMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/BlockMatrix.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaRDD; import org.apache.spark.mllib.linalg.distributed.BlockMatrix; @@ -683,6 +728,8 @@ A [`BlockMatrix`](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed can be created from an `RDD` of sub-matrix blocks, where a sub-matrix block is a `((blockRowIndex, blockColIndex), sub-matrix)` tuple. +Refer to the [`BlockMatrix` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.linalg.distributed.BlockMatrix) for more details on the API. + {% highlight python %} from pyspark.mllib.linalg import Matrices from pyspark.mllib.linalg.distributed import BlockMatrix diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md index c1d0f8a6b1..f31c4f8893 100644 --- a/docs/mllib-decision-tree.md +++ b/docs/mllib-decision-tree.md @@ -191,7 +191,9 @@ maximum tree depth of 5. The test error is calculated to measure the algorithm a <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`DecisionTree` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) and [`DecisionTreeModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel @@ -229,7 +231,9 @@ val sameModel = DecisionTreeModel.load(sc, "myModelPath") {% endhighlight %} </div> -<div data-lang="java"> +<div data-lang="java" markdown="1"> +Refer to the [`DecisionTree` Java docs](api/java/org/apache/spark/mllib/tree/DecisionTree.html) and [`DecisionTreeModel` Java docs](api/java/org/apache/spark/mllib/tree/model/DecisionTreeModel.html) for details on the API. + {% highlight java %} import java.util.HashMap; import scala.Tuple2; @@ -291,7 +295,8 @@ DecisionTreeModel sameModel = DecisionTreeModel.load(sc.sc(), "myModelPath"); {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`DecisionTree` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree) and [`DecisionTreeModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTreeModel) for more details on the API. {% highlight python %} from pyspark.mllib.regression import LabeledPoint @@ -335,7 +340,9 @@ depth of 5. The Mean Squared Error (MSE) is computed at the end to evaluate <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`DecisionTree` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) and [`DecisionTreeModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.DecisionTreeModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.tree.DecisionTree import org.apache.spark.mllib.tree.model.DecisionTreeModel @@ -372,7 +379,9 @@ val sameModel = DecisionTreeModel.load(sc, "myModelPath") {% endhighlight %} </div> -<div data-lang="java"> +<div data-lang="java" markdown="1"> +Refer to the [`DecisionTree` Java docs](api/java/org/apache/spark/mllib/tree/DecisionTree.html) and [`DecisionTreeModel` Java docs](api/java/org/apache/spark/mllib/tree/model/DecisionTreeModel.html) for details on the API. + {% highlight java %} import java.util.HashMap; import scala.Tuple2; @@ -440,7 +449,8 @@ DecisionTreeModel sameModel = DecisionTreeModel.load(sc.sc(), "myModelPath"); {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`DecisionTree` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTree) and [`DecisionTreeModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.DecisionTreeModel) for more details on the API. {% highlight python %} from pyspark.mllib.regression import LabeledPoint diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md index 05f51168d8..ac3526908a 100644 --- a/docs/mllib-dimensionality-reduction.md +++ b/docs/mllib-dimensionality-reduction.md @@ -62,6 +62,8 @@ MLlib provides SVD functionality to row-oriented matrices, provided in the <div class="codetabs"> <div data-lang="scala" markdown="1"> +Refer to the [`SingularValueDecomposition` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.SingularValueDecomposition) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix @@ -80,6 +82,8 @@ The same code applies to `IndexedRowMatrix` if `U` is defined as an `IndexedRowMatrix`. </div> <div data-lang="java" markdown="1"> +Refer to the [`SingularValueDecomposition` Java docs](api/java/org/apache/spark/mllib/linalg/SingularValueDecomposition.html) for details on the API. + {% highlight java %} import java.util.LinkedList; @@ -145,6 +149,8 @@ MLlib supports PCA for tall-and-skinny matrices stored in row-oriented format an The following code demonstrates how to compute principal components on a `RowMatrix` and use them to project the vectors into a low-dimensional space. +Refer to the [`RowMatrix` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.distributed.RowMatrix) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.Matrix import org.apache.spark.mllib.linalg.distributed.RowMatrix @@ -161,6 +167,8 @@ val projected: RowMatrix = mat.multiply(pc) The following code demonstrates how to compute principal components on source vectors and use them to project the vectors into a low-dimensional space while keeping associated labels: +Refer to the [`PCA` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.PCA) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.feature.PCA @@ -182,6 +190,8 @@ The following code demonstrates how to compute principal components on a `RowMat and use them to project the vectors into a low-dimensional space. The number of columns should be small, e.g, less than 1000. +Refer to the [`RowMatrix` Java docs](api/java/org/apache/spark/mllib/linalg/distributed/RowMatrix.html) for details on the API. + {% highlight java %} import java.util.LinkedList; diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md index 1e00b2083e..fc587298f7 100644 --- a/docs/mllib-ensembles.md +++ b/docs/mllib-ensembles.md @@ -95,7 +95,9 @@ The test error is calculated to measure the algorithm accuracy. <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.model.RandomForestModel @@ -135,7 +137,9 @@ val sameModel = RandomForestModel.load(sc, "myModelPath") {% endhighlight %} </div> -<div data-lang="java"> +<div data-lang="java" markdown="1"> +Refer to the [`RandomForest` Java docs](api/java/org/apache/spark/mllib/tree/RandomForest.html) and [`RandomForestModel` Java docs](api/java/org/apache/spark/mllib/tree/model/RandomForestModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; import java.util.HashMap; @@ -200,7 +204,8 @@ RandomForestModel sameModel = RandomForestModel.load(sc.sc(), "myModelPath"); {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`RandomForest` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForest) and [`RandomForest` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForestModel) for more details on the API. {% highlight python %} from pyspark.mllib.tree import RandomForest, RandomForestModel @@ -246,7 +251,9 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`RandomForest` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.RandomForest) and [`RandomForestModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.RandomForestModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.tree.RandomForest import org.apache.spark.mllib.tree.model.RandomForestModel @@ -286,7 +293,9 @@ val sameModel = RandomForestModel.load(sc, "myModelPath") {% endhighlight %} </div> -<div data-lang="java"> +<div data-lang="java" markdown="1"> +Refer to the [`RandomForest` Java docs](api/java/org/apache/spark/mllib/tree/RandomForest.html) and [`RandomForestModel` Java docs](api/java/org/apache/spark/mllib/tree/model/RandomForestModel.html) for details on the API. + {% highlight java %} import java.util.HashMap; import scala.Tuple2; @@ -354,7 +363,8 @@ RandomForestModel sameModel = RandomForestModel.load(sc.sc(), "myModelPath"); {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`RandomForest` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForest) and [`RandomForest` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.RandomForestModel) for more details on the API. {% highlight python %} from pyspark.mllib.tree import RandomForest, RandomForestModel @@ -479,7 +489,9 @@ The test error is calculated to measure the algorithm accuracy. <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`GradientBoostedTrees` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`GradientBoostedTreesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy @@ -518,7 +530,9 @@ val sameModel = GradientBoostedTreesModel.load(sc, "myModelPath") {% endhighlight %} </div> -<div data-lang="java"> +<div data-lang="java" markdown="1"> +Refer to the [`GradientBoostedTrees` Java docs](api/java/org/apache/spark/mllib/tree/GradientBoostedTrees.html) and [`GradientBoostedTreesModel` Java docs](api/java/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; import java.util.HashMap; @@ -583,7 +597,8 @@ GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load(sc.sc(), "m {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`GradientBoostedTrees` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTrees) and [`GradientBoostedTreesModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTreesModel) for more details on the API. {% highlight python %} from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel @@ -627,7 +642,9 @@ The Mean Squared Error (MSE) is computed at the end to evaluate <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`GradientBoostedTrees` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.GradientBoostedTrees) and [`GradientBoostedTreesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.tree.model.GradientBoostedTreesModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.tree.GradientBoostedTrees import org.apache.spark.mllib.tree.configuration.BoostingStrategy @@ -665,7 +682,9 @@ val sameModel = GradientBoostedTreesModel.load(sc, "myModelPath") {% endhighlight %} </div> -<div data-lang="java"> +<div data-lang="java" markdown="1"> +Refer to the [`GradientBoostedTrees` Java docs](api/java/org/apache/spark/mllib/tree/GradientBoostedTrees.html) and [`GradientBoostedTreesModel` Java docs](api/java/org/apache/spark/mllib/tree/model/GradientBoostedTreesModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; import java.util.HashMap; @@ -736,7 +755,8 @@ GradientBoostedTreesModel sameModel = GradientBoostedTreesModel.load(sc.sc(), "m {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`GradientBoostedTrees` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTrees) and [`GradientBoostedTreesModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.tree.GradientBoostedTreesModel) for more details on the API. {% highlight python %} from pyspark.mllib.tree import GradientBoostedTrees, GradientBoostedTreesModel diff --git a/docs/mllib-evaluation-metrics.md b/docs/mllib-evaluation-metrics.md index 7066d5c974..2270f7a34b 100644 --- a/docs/mllib-evaluation-metrics.md +++ b/docs/mllib-evaluation-metrics.md @@ -102,6 +102,7 @@ The following code snippets illustrate how to load a sample dataset, train a bin data, and evaluate the performance of the algorithm by several binary evaluation metrics. <div data-lang="scala" markdown="1"> +Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) and [`BinaryClassificationMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.BinaryClassificationMetrics) for details on the API. {% highlight scala %} import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS @@ -179,6 +180,7 @@ println("Area under ROC = " + auROC) </div> <div data-lang="java" markdown="1"> +Refer to the [`LogisticRegressionModel` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionModel.html) and [`LogisticRegressionWithLBFGS` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) for details on the API. {% highlight java %} import scala.Tuple2; @@ -276,6 +278,7 @@ public class BinaryClassification { </div> <div data-lang="python" markdown="1"> +Refer to the [`BinaryClassificationMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.BinaryClassificationMetrics) and [`LogisticRegressionWithLBFGS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS) for more details on the API. {% highlight python %} from pyspark.mllib.classification import LogisticRegressionWithLBFGS @@ -428,6 +431,7 @@ The following code snippets illustrate how to load a sample dataset, train a mul the data, and evaluate the performance of the algorithm by several multiclass classification evaluation metrics. <div data-lang="scala" markdown="1"> +Refer to the [`MulticlassMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.MulticlassMetrics) for details on the API. {% highlight scala %} import org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS @@ -501,6 +505,7 @@ println(s"Weighted false positive rate: ${metrics.weightedFalsePositiveRate}") </div> <div data-lang="java" markdown="1"> +Refer to the [`MulticlassMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/MulticlassMetrics.html) for details on the API. {% highlight java %} import scala.Tuple2; @@ -580,6 +585,7 @@ public class MulticlassClassification { </div> <div data-lang="python" markdown="1"> +Refer to the [`MulticlassMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.MulticlassMetrics) for more details on the API. {% highlight python %} from pyspark.mllib.classification import LogisticRegressionWithLBFGS @@ -758,6 +764,7 @@ True classes: <div class="codetabs"> <div data-lang="scala" markdown="1"> +Refer to the [`MultilabelMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.MultilabelMetrics) for details on the API. {% highlight scala %} import org.apache.spark.mllib.evaluation.MultilabelMetrics @@ -802,6 +809,7 @@ println(s"Subset accuracy = ${metrics.subsetAccuracy}") </div> <div data-lang="java" markdown="1"> +Refer to the [`MultilabelMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/MultilabelMetrics.html) for details on the API. {% highlight java %} import scala.Tuple2; @@ -864,6 +872,7 @@ public class MultilabelClassification { </div> <div data-lang="python" markdown="1"> +Refer to the [`MultilabelMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.MultilabelMetrics) for more details on the API. {% highlight python %} from pyspark.mllib.evaluation import MultilabelMetrics @@ -1016,6 +1025,7 @@ expanded world of non-positive weights are "the same as never having interacted <div class="codetabs"> <div data-lang="scala" markdown="1"> +Refer to the [`RegressionMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.RegressionMetrics) and [`RankingMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.RankingMetrics) for details on the API. {% highlight scala %} import org.apache.spark.mllib.evaluation.{RegressionMetrics, RankingMetrics} @@ -1095,6 +1105,7 @@ println(s"R-squared = ${regressionMetrics.r2}") </div> <div data-lang="java" markdown="1"> +Refer to the [`RegressionMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/RegressionMetrics.html) and [`RankingMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/RankingMetrics.html) for details on the API. {% highlight java %} import scala.Tuple2; @@ -1256,6 +1267,7 @@ public class Ranking { </div> <div data-lang="python" markdown="1"> +Refer to the [`RegressionMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RegressionMetrics) and [`RankingMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RankingMetrics) for more details on the API. {% highlight python %} from pyspark.mllib.recommendation import ALS, Rating @@ -1336,6 +1348,7 @@ The following code snippets illustrate how to load a sample dataset, train a lin and evaluate the performance of the algorithm by several regression metrics. <div data-lang="scala" markdown="1"> +Refer to the [`RegressionMetrics` Scala docs](api/scala/index.html#org.apache.spark.mllib.evaluation.RegressionMetrics) for details on the API. {% highlight scala %} import org.apache.spark.mllib.regression.LabeledPoint @@ -1379,6 +1392,7 @@ println(s"Explained variance = ${metrics.explainedVariance}") </div> <div data-lang="java" markdown="1"> +Refer to the [`RegressionMetrics` Java docs](api/java/org/apache/spark/mllib/evaluation/RegressionMetrics.html) for details on the API. {% highlight java %} import scala.Tuple2; @@ -1455,6 +1469,7 @@ public class LinearRegression { </div> <div data-lang="python" markdown="1"> +Refer to the [`RegressionMetrics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.evaluation.RegressionMetrics) for more details on the API. {% highlight python %} from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md index 7e417ed5f3..5bee170c61 100644 --- a/docs/mllib-feature-extraction.md +++ b/docs/mllib-feature-extraction.md @@ -56,6 +56,9 @@ and [IDF](api/scala/index.html#org.apache.spark.mllib.feature.IDF). `HashingTF` takes an `RDD[Iterable[_]]` as the input. Each record could be an iterable of strings or other types. +Refer to the [`HashingTF` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.HashingTF) for details on the API. + + {% highlight scala %} import org.apache.spark.rdd.RDD import org.apache.spark.SparkContext @@ -103,6 +106,9 @@ and [IDF](api/python/pyspark.mllib.html#pyspark.mllib.feature.IDF). `HashingTF` takes an RDD of list as the input. Each record could be an iterable of strings or other types. + +Refer to the [`HashingTF` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.HashingTF) for details on the API. + {% highlight python %} from pyspark import SparkContext from pyspark.mllib.feature import HashingTF @@ -183,7 +189,9 @@ the [text8](http://mattmahoney.net/dc/text8.zip) data and extract it to your pre Here we assume the extracted file is `text8` and in same directory as you run the spark shell. <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`Word2Vec` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.Word2Vec) for details on the API. + {% highlight scala %} import org.apache.spark._ import org.apache.spark.rdd._ @@ -207,7 +215,9 @@ model.save(sc, "myModelPath") val sameModel = Word2VecModel.load(sc, "myModelPath") {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`Word2Vec` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.Word2Vec) for more details on the API. + {% highlight python %} from pyspark import SparkContext from pyspark.mllib.feature import Word2Vec @@ -264,7 +274,9 @@ The example below demonstrates how to load a dataset in libsvm format, and stand so that the new features have unit standard deviation and/or zero mean. <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`StandardScaler` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) for details on the API. + {% highlight scala %} import org.apache.spark.SparkContext._ import org.apache.spark.mllib.feature.StandardScaler @@ -288,7 +300,9 @@ val data2 = data.map(x => (x.label, scaler2.transform(Vectors.dense(x.features.t {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`StandardScaler` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.StandardScaler) for more details on the API. + {% highlight python %} from pyspark.mllib.util import MLUtils from pyspark.mllib.linalg import Vectors @@ -338,7 +352,9 @@ The example below demonstrates how to load a dataset in libsvm format, and norma with $L^2$ norm, and $L^\infty$ norm. <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`Normalizer` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.Normalizer) for details on the API. + {% highlight scala %} import org.apache.spark.SparkContext._ import org.apache.spark.mllib.feature.Normalizer @@ -358,7 +374,9 @@ val data2 = data.map(x => (x.label, normalizer2.transform(x.features))) {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`Normalizer` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.Normalizer) for more details on the API. + {% highlight python %} from pyspark.mllib.util import MLUtils from pyspark.mllib.linalg import Vectors @@ -532,7 +550,10 @@ v_N This example below demonstrates how to transform vectors using a transforming vector value. <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> + +Refer to the [`ElementwiseProduct` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.ElementwiseProduct) for details on the API. + {% highlight scala %} import org.apache.spark.SparkContext._ import org.apache.spark.mllib.feature.ElementwiseProduct @@ -551,7 +572,9 @@ val transformedData2 = data.map(x => transformer.transform(x)) {% endhighlight %} </div> -<div data-lang="java"> +<div data-lang="java" markdown="1"> +Refer to the [`ElementwiseProduct` Java docs](api/java/org/apache/spark/mllib/feature/ElementwiseProduct.html) for details on the API. + {% highlight java %} import java.util.Arrays; import org.apache.spark.api.java.JavaRDD; @@ -580,7 +603,9 @@ JavaRDD<Vector> transformedData2 = data.map( {% endhighlight %} </div> -<div data-lang="python"> +<div data-lang="python" markdown="1"> +Refer to the [`ElementwiseProduct` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.feature.ElementwiseProduct) for more details on the API. + {% highlight python %} from pyspark import SparkContext from pyspark.mllib.linalg import Vectors @@ -617,7 +642,9 @@ and use them to project the vectors into a low-dimensional space while keeping a for calculation a [Linear Regression]((mllib-linear-methods.html)) <div class="codetabs"> -<div data-lang="scala"> +<div data-lang="scala" markdown="1"> +Refer to the [`PCA` Scala docs](api/scala/index.html#org.apache.spark.mllib.feature.PCA) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md index 4d4f5cfdc5..f749eb4f2f 100644 --- a/docs/mllib-frequent-pattern-mining.md +++ b/docs/mllib-frequent-pattern-mining.md @@ -50,6 +50,7 @@ example illustrates how to mine frequent itemsets and association rules Rules](mllib-frequent-pattern-mining.html#association-rules) for details) from `transactions`. +Refer to the [`FPGrowth` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.FPGrowth) for details on the API. {% highlight scala %} import org.apache.spark.rdd.RDD @@ -92,6 +93,8 @@ example illustrates how to mine frequent itemsets and association rules Rules](mllib-frequent-pattern-mining.html#association-rules) for details) from `transactions`. +Refer to the [`FPGrowth` Java docs](api/java/org/apache/spark/mllib/fpm/FPGrowth.html) for details on the API. + {% highlight java %} import java.util.Arrays; import java.util.List; @@ -144,6 +147,8 @@ Calling `FPGrowth.train` with transactions returns an [`FPGrowthModel`](api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowthModel) that stores the frequent itemsets with their frequencies. +Refer to the [`FPGrowth` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.fpm.FPGrowth) for more details on the API. + {% highlight python %} from pyspark.mllib.fpm import FPGrowth @@ -170,6 +175,8 @@ for fi in result: implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent. +Refer to the [`AssociationRules` Scala docs](api/java/org/apache/spark/mllib/fpm/AssociationRules.html) for details on the API. + {% highlight scala %} import org.apache.spark.rdd.RDD import org.apache.spark.mllib.fpm.AssociationRules @@ -199,6 +206,8 @@ results.collect().foreach { rule => implements a parallel rule generation algorithm for constructing rules that have a single item as the consequent. +Refer to the [`AssociationRules` Java docs](api/java/org/apache/spark/mllib/fpm/AssociationRules.html) for details on the API. + {% highlight java %} import java.util.Arrays; @@ -267,6 +276,8 @@ Calling `PrefixSpan.run` returns a [`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel) that stores the frequent sequences with their frequencies. +Refer to the [`PrefixSpan` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) and [`PrefixSpanModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.fpm.PrefixSpan @@ -296,6 +307,8 @@ Calling `PrefixSpan.run` returns a [`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html) that stores the frequent sequences with their frequencies. +Refer to the [`PrefixSpan` Java docs](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) and [`PrefixSpanModel` Java docs](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html) for details on the API. + {% highlight java %} import java.util.Arrays; import java.util.List; diff --git a/docs/mllib-isotonic-regression.md b/docs/mllib-isotonic-regression.md index 6aa881f749..f91a697b31 100644 --- a/docs/mllib-isotonic-regression.md +++ b/docs/mllib-isotonic-regression.md @@ -59,6 +59,8 @@ i.e. 4710.28,500.00. The data are split to training and testing set. Model is created using the training set and a mean squared error is calculated from the predicted labels and real labels in the test set. +Refer to the [`IsotonicRegression` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegression) and [`IsotonicRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.IsotonicRegressionModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.regression.{IsotonicRegression, IsotonicRegressionModel} @@ -101,6 +103,8 @@ i.e. 4710.28,500.00. The data are split to training and testing set. Model is created using the training set and a mean squared error is calculated from the predicted labels and real labels in the test set. +Refer to the [`IsotonicRegression` Java docs](api/java/org/apache/spark/mllib/regression/IsotonicRegression.html) and [`IsotonicRegressionModel` Java docs](api/java/org/apache/spark/mllib/regression/IsotonicRegressionModel.html) for details on the API. + {% highlight java %} import org.apache.spark.SparkConf; import org.apache.spark.api.java.JavaDoubleRDD; @@ -167,6 +171,8 @@ i.e. 4710.28,500.00. The data are split to training and testing set. Model is created using the training set and a mean squared error is calculated from the predicted labels and real labels in the test set. +Refer to the [`IsotonicRegression` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.IsotonicRegression) and [`IsotonicRegressionModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.IsotonicRegressionModel) for more details on the API. + {% highlight python %} import math from pyspark.mllib.regression import IsotonicRegression, IsotonicRegressionModel diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md index e9b2d276cd..a3e1620c77 100644 --- a/docs/mllib-linear-methods.md +++ b/docs/mllib-linear-methods.md @@ -165,6 +165,8 @@ training algorithm on this training data using a static method in the algorithm object, and make predictions with the resulting model to compute the training error. +Refer to the [`SVMWithSGD` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD) and [`SVMModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.SVMModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.classification.{SVMModel, SVMWithSGD} import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics @@ -230,6 +232,8 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a calling `.rdd()` on your `JavaRDD` object. A self-contained application example that is equivalent to the provided example in Scala is given bellow: +Refer to the [`SVMWithSGD` Java docs](api/java/org/apache/spark/mllib/classification/SVMWithSGD.html) and [`SVMModel` Java docs](api/java/org/apache/spark/mllib/classification/SVMModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; @@ -316,6 +320,8 @@ a dependency. The following example shows how to load a sample dataset, build SVM model, and make predictions with the resulting model to compute the training error. +Refer to the [`SVMWithSGD` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.SVMWithSGD) and [`SVMModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.SVMModel) for more details on the API. + {% highlight python %} from pyspark.mllib.classification import SVMWithSGD, SVMModel from pyspark.mllib.regression import LabeledPoint @@ -395,6 +401,8 @@ test, and use to fit a logistic regression model. Then the model is evaluated against the test dataset and saved to disk. +Refer to the [`LogisticRegressionWithLBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithLBFGS) and [`LogisticRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionModel) for details on the API. + {% highlight scala %} import org.apache.spark.SparkContext import org.apache.spark.mllib.classification.{LogisticRegressionWithLBFGS, LogisticRegressionModel} @@ -441,6 +449,8 @@ test, and use to fit a logistic regression model. Then the model is evaluated against the test dataset and saved to disk. +Refer to the [`LogisticRegressionWithLBFGS` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionWithLBFGS.html) and [`LogisticRegressionModel` Java docs](api/java/org/apache/spark/mllib/classification/LogisticRegressionModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; @@ -501,6 +511,8 @@ and make predictions with the resulting model to compute the training error. Note that the Python API does not yet support multiclass classification and model save/load but will in the future. +Refer to the [`LogisticRegressionWithLBFGS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionWithLBFGS) and [`LogisticRegressionModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.LogisticRegressionModel) for more details on the API. + {% highlight python %} from pyspark.mllib.classification import LogisticRegressionWithLBFGS, LogisticRegressionModel from pyspark.mllib.regression import LabeledPoint @@ -558,6 +570,8 @@ The example then uses LinearRegressionWithSGD to build a simple linear model to values. We compute the mean squared error at the end to evaluate [goodness of fit](http://en.wikipedia.org/wiki/Goodness_of_fit). +Refer to the [`LinearRegressionWithSGD` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD) and [`LinearRegressionModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.regression.LinearRegressionModel @@ -600,6 +614,8 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a calling `.rdd()` on your `JavaRDD` object. The corresponding Java example to the Scala snippet provided, is presented bellow: +Refer to the [`LinearRegressionWithSGD` Java docs](api/java/org/apache/spark/mllib/regression/LinearRegressionWithSGD.html) and [`LinearRegressionModel` Java docs](api/java/org/apache/spark/mllib/regression/LinearRegressionModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; @@ -673,6 +689,8 @@ values. We compute the mean squared error at the end to evaluate Note that the Python API does not yet support model save/load but will in the future. +Refer to the [`LinearRegressionWithSGD` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.LinearRegressionWithSGD) and [`LinearRegressionModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.regression.LinearRegressionModel) for more details on the API. + {% highlight python %} from pyspark.mllib.regression import LabeledPoint, LinearRegressionWithSGD, LinearRegressionModel diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md index e73bd30f3a..f4f6a10c82 100644 --- a/docs/mllib-naive-bayes.md +++ b/docs/mllib-naive-bayes.md @@ -38,6 +38,8 @@ smoothing parameter `lambda` as input, an optional model type parameter (default [NaiveBayesModel](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which can be used for evaluation and prediction. +Refer to the [`NaiveBayes` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayes) and [`NaiveBayesModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.classification.NaiveBayesModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel} import org.apache.spark.mllib.linalg.Vectors @@ -73,6 +75,8 @@ optionally smoothing parameter `lambda` as input, and output a [NaiveBayesModel](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html), which can be used for evaluation and prediction. +Refer to the [`NaiveBayes` Java docs](api/java/org/apache/spark/mllib/classification/NaiveBayes.html) and [`NaiveBayesModel` Java docs](api/java/org/apache/spark/mllib/classification/NaiveBayesModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; @@ -118,6 +122,8 @@ used for evaluation and prediction. Note that the Python API does not yet support model save/load but will in the future. +Refer to the [`NaiveBayes` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayes) and [`NaiveBayesModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.classification.NaiveBayesModel) for more details on the API. + {% highlight python %} from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel from pyspark.mllib.linalg import Vectors diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md index 6cabc1610a..a3bd130ba0 100644 --- a/docs/mllib-optimization.md +++ b/docs/mllib-optimization.md @@ -218,6 +218,8 @@ L-BFGS optimizer. <div class="codetabs"> <div data-lang="scala" markdown="1"> +Refer to the [`LBFGS` Scala docs](api/scala/index.html#org.apache.spark.mllib.optimization.LBFGS) and [`SquaredL2Updater` Scala docs](api/scala/index.html#org.apache.spark.mllib.optimization.SquaredL2Updater) for details on the API. + {% highlight scala %} import org.apache.spark.SparkContext import org.apache.spark.mllib.evaluation.BinaryClassificationMetrics @@ -278,6 +280,8 @@ println("Area under ROC = " + auROC) </div> <div data-lang="java" markdown="1"> +Refer to the [`LBFGS` Java docs](api/java/org/apache/spark/mllib/optimization/LBFGS.html) and [`SquaredL2Updater` Java docs](api/java/org/apache/spark/mllib/optimization/SquaredL2Updater.html) for details on the API. + {% highlight java %} import java.util.Arrays; import java.util.Random; diff --git a/docs/mllib-pmml-model-export.md b/docs/mllib-pmml-model-export.md index 42ea2ca81f..615287125c 100644 --- a/docs/mllib-pmml-model-export.md +++ b/docs/mllib-pmml-model-export.md @@ -45,6 +45,8 @@ The table below outlines the MLlib models that can be exported to PMML and their <div data-lang="scala" markdown="1"> To export a supported `model` (see table above) to PMML, simply call `model.toPMML`. +Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`Vectors` Scala docs](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors) for details on the API. + Here a complete example of building a KMeansModel and print it out in PMML format: {% highlight scala %} import org.apache.spark.mllib.clustering.KMeans diff --git a/docs/mllib-statistics.md b/docs/mllib-statistics.md index 6acfc71d7b..2c7c9ed693 100644 --- a/docs/mllib-statistics.md +++ b/docs/mllib-statistics.md @@ -38,6 +38,8 @@ available in `Statistics`. which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count. +Refer to the [`MultivariateStatisticalSummary` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.MultivariateStatisticalSummary) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.linalg.Vector import org.apache.spark.mllib.stat.{MultivariateStatisticalSummary, Statistics} @@ -60,6 +62,8 @@ println(summary.numNonzeros) // number of nonzeros in each column which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count. +Refer to the [`MultivariateStatisticalSummary` Java docs](api/java/org/apache/spark/mllib/stat/MultivariateStatisticalSummary.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; @@ -86,6 +90,8 @@ System.out.println(summary.numNonzeros()); // number of nonzeros in each column which contains the column-wise max, min, mean, variance, and number of nonzeros, as well as the total count. +Refer to the [`MultivariateStatisticalSummary` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.MultivariateStatisticalSummary) for more details on the API. + {% highlight python %} from pyspark.mllib.stat import Statistics @@ -116,6 +122,8 @@ correlation methods are currently Pearson's and Spearman's correlation. calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively. +Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API. + {% highlight scala %} import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg._ @@ -144,6 +152,8 @@ val correlMatrix: Matrix = Statistics.corr(data, "pearson") calculate correlations between series. Depending on the type of input, two `JavaDoubleRDD`s or a `JavaRDD<Vector>`, the output will be a `Double` or the correlation `Matrix` respectively. +Refer to the [`Statistics` Java docs](api/java/org/apache/spark/mllib/stat/Statistics.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaDoubleRDD; import org.apache.spark.api.java.JavaSparkContext; @@ -173,6 +183,8 @@ Matrix correlMatrix = Statistics.corr(data.rdd(), "pearson"); calculate correlations between series. Depending on the type of input, two `RDD[Double]`s or an `RDD[Vector]`, the output will be a `Double` or the correlation `Matrix` respectively. +Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API. + {% highlight python %} from pyspark.mllib.stat import Statistics @@ -338,6 +350,8 @@ featureTestResults.foreach { result => run Pearson's chi-squared tests. The following example demonstrates how to run and interpret hypothesis tests. +Refer to the [`ChiSqTestResult` Java docs](api/java/org/apache/spark/mllib/stat/test/ChiSqTestResult.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaRDD; import org.apache.spark.api.java.JavaSparkContext; @@ -385,6 +399,8 @@ for (ChiSqTestResult result : featureTestResults) { run Pearson's chi-squared tests. The following example demonstrates how to run and interpret hypothesis tests. +Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API. + {% highlight python %} from pyspark import SparkContext from pyspark.mllib.linalg import Vectors, Matrices @@ -437,6 +453,8 @@ message. run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run and interpret the hypothesis tests. +Refer to the [`Statistics` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.Statistics) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.stat.Statistics @@ -459,6 +477,8 @@ val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF) run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run and interpret the hypothesis tests. +Refer to the [`Statistics` Java docs](api/java/org/apache/spark/mllib/stat/Statistics.html) for details on the API. + {% highlight java %} import java.util.Arrays; @@ -483,6 +503,8 @@ System.out.println(testResult); run a 1-sample, 2-sided Kolmogorov-Smirnov test. The following example demonstrates how to run and interpret the hypothesis tests. +Refer to the [`Statistics` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.Statistics) for more details on the API. + {% highlight python %} from pyspark.mllib.stat import Statistics @@ -513,6 +535,8 @@ methods to generate random double RDDs or vector RDDs. The following example generates a random double RDD, whose values follows the standard normal distribution `N(0, 1)`, and then map it to `N(1, 4)`. +Refer to the [`RandomRDDs` Scala docs](api/scala/index.html#org.apache.spark.mllib.random.RandomRDDs) for details on the API. + {% highlight scala %} import org.apache.spark.SparkContext import org.apache.spark.mllib.random.RandomRDDs._ @@ -533,6 +557,8 @@ methods to generate random double RDDs or vector RDDs. The following example generates a random double RDD, whose values follows the standard normal distribution `N(0, 1)`, and then map it to `N(1, 4)`. +Refer to the [`RandomRDDs` Java docs](api/java/org/apache/spark/mllib/random/RandomRDDs) for details on the API. + {% highlight java %} import org.apache.spark.SparkContext; import org.apache.spark.api.JavaDoubleRDD; @@ -559,6 +585,8 @@ methods to generate random double RDDs or vector RDDs. The following example generates a random double RDD, whose values follows the standard normal distribution `N(0, 1)`, and then map it to `N(1, 4)`. +Refer to the [`RandomRDDs` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.random.RandomRDDs) for more details on the API. + {% highlight python %} from pyspark.mllib.random import RandomRDDs @@ -589,6 +617,8 @@ mean of PDFs of normal distributions centered around each of the samples. to compute kernel density estimates from an RDD of samples. The following example demonstrates how to do so. +Refer to the [`KernelDensity` Scala docs](api/scala/index.html#org.apache.spark.mllib.stat.KernelDensity) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.stat.KernelDensity import org.apache.spark.rdd.RDD @@ -611,6 +641,8 @@ val densities = kd.estimate(Array(-1.0, 2.0, 5.0)) to compute kernel density estimates from an RDD of samples. The following example demonstrates how to do so. +Refer to the [`KernelDensity` Java docs](api/java/org/apache/spark/mllib/stat/KernelDensity.html) for details on the API. + {% highlight java %} import org.apache.spark.mllib.stat.KernelDensity; import org.apache.spark.rdd.RDD; @@ -633,6 +665,8 @@ double[] densities = kd.estimate(new double[] {-1.0, 2.0, 5.0}); to compute kernel density estimates from an RDD of samples. The following example demonstrates how to do so. +Refer to the [`KernelDensity` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.stat.KernelDensity) for more details on the API. + {% highlight python %} from pyspark.mllib.stat import KernelDensity |