diff options
Diffstat (limited to 'docs/mllib-clustering.md')
-rw-r--r-- | docs/mllib-clustering.md | 30 |
1 files changed, 28 insertions, 2 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index c2711cf82d..8fbced6c87 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -4,10 +4,10 @@ title: Clustering - MLlib displayTitle: <a href="mllib-guide.html">MLlib</a> - Clustering --- -Clustering is an unsupervised learning problem whereby we aim to group subsets +[Clustering](https://en.wikipedia.org/wiki/Cluster_analysis) is an unsupervised learning problem whereby we aim to group subsets of entities with one another based on some notion of similarity. Clustering is often used for exploratory analysis and/or as a component of a hierarchical -supervised learning pipeline (in which distinct classifiers or regression +[supervised learning](https://en.wikipedia.org/wiki/Supervised_learning) pipeline (in which distinct classifiers or regression models are trained for each cluster). MLlib supports the following models: @@ -47,6 +47,8 @@ into two clusters. The number of desired clusters is passed to the algorithm. We Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the optimal *k* is usually one where there is an "elbow" in the WSSSE graph. +Refer to the [`KMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) and [`KMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.KMeansModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.clustering.{KMeans, KMeansModel} import org.apache.spark.mllib.linalg.Vectors @@ -77,6 +79,8 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a calling `.rdd()` on your `JavaRDD` object. A self-contained application example that is equivalent to the provided example in Scala is given below: +Refer to the [`KMeans` Java docs](api/java/org/apache/spark/mllib/clustering/KMeans.html) and [`KMeansModel` Java docs](api/java/org/apache/spark/mllib/clustering/KMeansModel.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.Function; @@ -132,6 +136,8 @@ data into two clusters. The number of desired clusters is passed to the algorith Within Set Sum of Squared Error (WSSSE). You can reduce this error measure by increasing *k*. In fact the optimal *k* is usually one where there is an "elbow" in the WSSSE graph. +Refer to the [`KMeans` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.KMeans) and [`KMeansModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.KMeansModel) for more details on the API. + {% highlight python %} from pyspark.mllib.clustering import KMeans, KMeansModel from numpy import array @@ -184,6 +190,8 @@ In the following example after loading and parsing data, we use a object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then output the parameters of the mixture model. +Refer to the [`GaussianMixture` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixture) and [`GaussianMixtureModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.GaussianMixtureModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.clustering.GaussianMixture import org.apache.spark.mllib.clustering.GaussianMixtureModel @@ -216,6 +224,8 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a calling `.rdd()` on your `JavaRDD` object. A self-contained application example that is equivalent to the provided example in Scala is given below: +Refer to the [`GaussianMixture` Java docs](api/java/org/apache/spark/mllib/clustering/GaussianMixture.html) and [`GaussianMixtureModel` Java docs](api/java/org/apache/spark/mllib/clustering/GaussianMixtureModel.html) for details on the API. + {% highlight java %} import org.apache.spark.api.java.*; import org.apache.spark.api.java.function.Function; @@ -268,6 +278,8 @@ In the following example after loading and parsing data, we use a object to cluster the data into two clusters. The number of desired clusters is passed to the algorithm. We then output the parameters of the mixture model. +Refer to the [`GaussianMixture` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixture) and [`GaussianMixtureModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.GaussianMixtureModel) for more details on the API. + {% highlight python %} from pyspark.mllib.clustering import GaussianMixture from numpy import array @@ -324,6 +336,8 @@ Calling `PowerIterationClustering.run` returns a [`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel), which contains the computed clustering assignments. +Refer to the [`PowerIterationClustering` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering) and [`PowerIterationClusteringModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel) for details on the API. + {% highlight scala %} import org.apache.spark.mllib.clustering.{PowerIterationClustering, PowerIterationClusteringModel} import org.apache.spark.mllib.linalg.Vectors @@ -365,6 +379,8 @@ Calling `PowerIterationClustering.run` returns a [`PowerIterationClusteringModel`](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html) which contains the computed clustering assignments. +Refer to the [`PowerIterationClustering` Java docs](api/java/org/apache/spark/mllib/clustering/PowerIterationClustering.html) and [`PowerIterationClusteringModel` Java docs](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; import scala.Tuple3; @@ -411,6 +427,8 @@ Calling `PowerIterationClustering.run` returns a [`PowerIterationClusteringModel`](api/python/pyspark.mllib.html#pyspark.mllib.clustering.PowerIterationClustering), which contains the computed clustering assignments. +Refer to the [`PowerIterationClustering` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.PowerIterationClustering) and [`PowerIterationClusteringModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.PowerIterationClusteringModel) for more details on the API. + {% highlight python %} from __future__ import print_function from pyspark.mllib.clustering import PowerIterationClustering, PowerIterationClusteringModel @@ -571,6 +589,7 @@ to the algorithm. We then output the topics, represented as probability distribu <div class="codetabs"> <div data-lang="scala" markdown="1"> +Refer to the [`LDA` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.LDA) and [`DistributedLDAModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.DistributedLDAModel) for details on the API. {% highlight scala %} import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel} @@ -602,6 +621,8 @@ val sameModel = DistributedLDAModel.load(sc, "myLDAModel") </div> <div data-lang="java" markdown="1"> +Refer to the [`LDA` Java docs](api/java/org/apache/spark/mllib/clustering/LDA.html) and [`DistributedLDAModel` Java docs](api/java/org/apache/spark/mllib/clustering/DistributedLDAModel.html) for details on the API. + {% highlight java %} import scala.Tuple2; @@ -666,6 +687,8 @@ public class JavaLDAExample { </div> <div data-lang="python" markdown="1"> +Refer to the [`LDA` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDA) and [`LDAModel` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.LDAModel) for more details on the API. + {% highlight python %} from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vectors @@ -730,6 +753,7 @@ This example shows how to estimate clusters on streaming data. <div class="codetabs"> <div data-lang="scala" markdown="1"> +Refer to the [`StreamingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.StreamingKMeans) for details on the API. First we import the neccessary classes. @@ -780,6 +804,8 @@ ssc.awaitTermination() </div> <div data-lang="python" markdown="1"> +Refer to the [`StreamingKMeans` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.clustering.StreamingKMeans) for more details on the API. + First we import the neccessary classes. {% highlight python %} |