---
layout: global
title: Clustering - spark.ml
displayTitle: Clustering - spark.ml
---
In this section, we introduce the pipeline API for [clustering in mllib](mllib-clustering.html).
**Table of Contents**
* This will become a table of contents (this text will be scraped).
{:toc}
## K-means
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the
most commonly used clustering algorithms that clusters the data points into a
predefined number of clusters. The MLlib implementation includes a parallelized
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
`KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model.
### Input Columns
Param name |
Type(s) |
Default |
Description |
featuresCol |
Vector |
"features" |
Feature vector |
### Output Columns
Param name |
Type(s) |
Default |
Description |
predictionCol |
Int |
"prediction" |
Predicted cluster center |
### Example
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details.
{% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details.
{% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}
## Latent Dirichlet allocation (LDA)
`LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`,
and generates a `LDAModel` as the base models. Expert users may cast a `LDAModel` generated by
`EMLDAOptimizer` to a `DistributedLDAModel` if needed.
Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details.
{% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}
Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details.
{% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}