diff options
author | Yu ISHIKAWA <yuu.ishikawa@gmail.com> | 2015-12-16 10:55:42 -0800 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2015-12-16 10:55:42 -0800 |
commit | 7b6dc29d0ebbfb3bb941130f8542120b6bc3e234 (patch) | |
tree | 94970c4bfb67f129f2e580542276f623426e8625 /docs/mllib-clustering.md | |
parent | ad8c1f0b840284d05da737fb2cc5ebf8848f4490 (diff) | |
download | spark-7b6dc29d0ebbfb3bb941130f8542120b6bc3e234.tar.gz spark-7b6dc29d0ebbfb3bb941130f8542120b6bc3e234.tar.bz2 spark-7b6dc29d0ebbfb3bb941130f8542120b6bc3e234.zip |
[SPARK-6518][MLLIB][EXAMPLE][DOC] Add example code and user guide for bisecting k-means
This PR includes only an example code in order to finish it quickly.
I'll send another PR for the docs soon.
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #9952 from yu-iskw/SPARK-6518.
Diffstat (limited to 'docs/mllib-clustering.md')
-rw-r--r-- | docs/mllib-clustering.md | 35 |
1 files changed, 35 insertions, 0 deletions
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index 48d64cd402..93cd0c1c61 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -718,6 +718,41 @@ sameModel = LDAModel.load(sc, "myModelPath") </div> +## Bisecting k-means + +Bisecting K-means can often be much faster than regular K-means, but it will generally produce a different clustering. + +Bisecting k-means is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering). +Hierarchical clustering is one of the most commonly used method of cluster analysis which seeks to build a hierarchy of clusters. +Strategies for hierarchical clustering generally fall into two types: + +- Agglomerative: This is a "bottom up" approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. +- Divisive: This is a "top down" approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. + +Bisecting k-means algorithm is a kind of divisive algorithms. +The implementation in MLlib has the following parameters: + +* *k*: the desired number of leaf clusters (default: 4). The actual number could be smaller if there are no divisible leaf clusters. +* *maxIterations*: the max number of k-means iterations to split clusters (default: 20) +* *minDivisibleClusterSize*: the minimum number of points (if >= 1.0) or the minimum proportion of points (if < 1.0) of a divisible cluster (default: 1) +* *seed*: a random seed (default: hash value of the class name) + +**Examples** + +<div class="codetabs"> +<div data-lang="scala" markdown="1"> +Refer to the [`BisectingKMeans` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeans) and [`BisectingKMeansModel` Scala docs](api/scala/index.html#org.apache.spark.mllib.clustering.BisectingKMeansModel) for details on the API. + +{% include_example scala/org/apache/spark/examples/mllib/BisectingKMeansExample.scala %} +</div> + +<div data-lang="java" markdown="1"> +Refer to the [`BisectingKMeans` Java docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeans.html) and [`BisectingKMeansModel` Java docs](api/java/org/apache/spark/mllib/clustering/BisectingKMeansModel.html) for details on the API. + +{% include_example java/org/apache/spark/examples/mllib/JavaBisectingKMeansExample.java %} +</div> +</div> + ## Streaming k-means When data arrive in a stream, we may want to estimate clusters dynamically, |