--- layout: global title: Clustering displayTitle: Clustering --- This page describes clustering algorithms in MLlib. The [guide for clustering in the RDD-based API](mllib-clustering.html) also has relevant information about these algorithms. **Table of Contents** * This will become a table of contents (this text will be scraped). {:toc} ## K-means [k-means](http://en.wikipedia.org/wiki/K-means_clustering) is one of the most commonly used clustering algorithms that clusters the data points into a predefined number of clusters. The MLlib implementation includes a parallelized variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf). `KMeans` is implemented as an `Estimator` and generates a `KMeansModel` as the base model. ### Input Columns

Param name	Type(s)	Default	Description
featuresCol	Vector	"features"	Feature vector

### Output Columns

Param name	Type(s)	Default	Description
predictionCol	Int	"prediction"	Predicted cluster center

**Examples**

Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.KMeans) for more details. {% include_example scala/org/apache/spark/examples/ml/KMeansExample.scala %}

Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/KMeans.html) for more details. {% include_example java/org/apache/spark/examples/ml/JavaKMeansExample.java %}

Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans) for more details. {% include_example python/ml/kmeans_example.py %}

Refer to the [R API docs](api/R/spark.kmeans.html) for more details. {% include_example r/ml/kmeans.R %}

## Latent Dirichlet allocation (LDA) `LDA` is implemented as an `Estimator` that supports both `EMLDAOptimizer` and `OnlineLDAOptimizer`, and generates a `LDAModel` as the base model. Expert users may cast a `LDAModel` generated by `EMLDAOptimizer` to a `DistributedLDAModel` if needed. **Examples**

Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.LDA) for more details. {% include_example scala/org/apache/spark/examples/ml/LDAExample.scala %}

Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/LDA.html) for more details. {% include_example java/org/apache/spark/examples/ml/JavaLDAExample.java %}

Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.LDA) for more details. {% include_example python/ml/lda_example.py %}

Refer to the [R API docs](api/R/spark.lda.html) for more details. {% include_example r/ml/lda.R %}

## Bisecting k-means Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering. `BisectingKMeans` is implemented as an `Estimator` and generates a `BisectingKMeansModel` as the base model. **Examples**

Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.BisectingKMeans) for more details. {% include_example scala/org/apache/spark/examples/ml/BisectingKMeansExample.scala %}

Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/BisectingKMeans.html) for more details. {% include_example java/org/apache/spark/examples/ml/JavaBisectingKMeansExample.java %}

Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.BisectingKMeans) for more details. {% include_example python/ml/bisecting_k_means_example.py %}

Refer to the [R API docs](api/R/spark.bisectingKmeans.html) for more details. {% include_example r/ml/bisectingKmeans.R %}

## Gaussian Mixture Model (GMM) A [Gaussian Mixture Model](http://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) represents a composite distribution whereby points are drawn from one of *k* Gaussian sub-distributions, each with its own probability. The `spark.ml` implementation uses the [expectation-maximization](http://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) algorithm to induce the maximum-likelihood model given a set of samples. `GaussianMixture` is implemented as an `Estimator` and generates a `GaussianMixtureModel` as the base model. ### Input Columns

Param name	Type(s)	Default	Description
featuresCol	Vector	"features"	Feature vector

### Output Columns

Param name	Type(s)	Default	Description
predictionCol	Int	"prediction"	Predicted cluster center
probabilityCol	Vector	"probability"	Probability of each cluster

**Examples**

Refer to the [Scala API docs](api/scala/index.html#org.apache.spark.ml.clustering.GaussianMixture) for more details. {% include_example scala/org/apache/spark/examples/ml/GaussianMixtureExample.scala %}

Refer to the [Java API docs](api/java/org/apache/spark/ml/clustering/GaussianMixture.html) for more details. {% include_example java/org/apache/spark/examples/ml/JavaGaussianMixtureExample.java %}

Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.GaussianMixture) for more details. {% include_example python/ml/gaussian_mixture_example.py %}

Refer to the [R API docs](api/R/spark.gaussianMixture.html) for more details. {% include_example r/ml/gaussianMixture.R %}