diff options
Diffstat (limited to 'docs/ml-clustering.md')
-rw-r--r-- | docs/ml-clustering.md | 71 |
1 files changed, 71 insertions, 0 deletions
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md index a59f7e3005..440c455cd0 100644 --- a/docs/ml-clustering.md +++ b/docs/ml-clustering.md @@ -11,6 +11,77 @@ In this section, we introduce the pipeline API for [clustering in mllib](mllib-c * This will become a table of contents (this text will be scraped). {:toc} +## K-means + +[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the +most commonly used clustering algorithms that clusters the data points into a +predefined number of clusters. The MLlib implementation includes a parallelized +variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method +called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). + +`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model. + +### Input Columns + +<table class="table"> + <thead> + <tr> + <th align="left">Param name</th> + <th align="left">Type(s)</th> + <th align="left">Default</th> + <th align="left">Description</th> + </tr> + </thead> + <tbody> + <tr> + <td>featuresCol</td> + <td>Vector</td> + <td>"features"</td> + <td>Feature vector</td> + </tr> + </tbody> +</table> + +### Output Columns + +<table class="table"> + <thead> + <tr> + <th align="left">Param name</th> + <th align="left">Type(s)</th> + <th align="left">Default</th> + <th align="left">Description</th> + </tr> + </thead> + <tbody> + <tr> + <td>predictionCol</td> + <td>Int</td> + <td>"prediction"</td> + <td>Predicted cluster center</td> + </tr> + </tbody> +</table> + +### Example + +<div class="codetabs"> + +<div data-lang="scala" markdown="1"> +Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details. + +{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %} +</div> + +<div data-lang="java" markdown="1"> +Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details. + +{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %} +</div> + +</div> + + ## Latent Dirichlet allocation (LDA) `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`, |