diff options
author | krishnakalyan3 <krishnakalyan3@gmail.com> | 2017-02-03 12:19:47 -0800 |
---|---|---|
committer | Felix Cheung <felixcheung@apache.org> | 2017-02-03 12:19:47 -0800 |
commit | 48aafeda7db879491ed36fff89d59ca7ec3136fa (patch) | |
tree | c69dbd72d71e3012124b0633f64a7089815e6e44 /R | |
parent | 2f523fa0c930f55a42c4c070efb24f87df33e6c2 (diff) | |
download | spark-48aafeda7db879491ed36fff89d59ca7ec3136fa.tar.gz spark-48aafeda7db879491ed36fff89d59ca7ec3136fa.tar.bz2 spark-48aafeda7db879491ed36fff89d59ca7ec3136fa.zip |
[SPARK-19386][SPARKR][DOC] Bisecting k-means in SparkR documentation
## What changes were proposed in this pull request?
Update programming guide, example and vignette with Bisecting k-means.
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes #16767 from krishnakalyan3/bisecting-kmeans.
Diffstat (limited to 'R')
-rw-r--r-- | R/pkg/vignettes/sparkr-vignettes.Rmd | 14 |
1 files changed, 14 insertions, 0 deletions
diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd index 36a78477dc..a7cac2f503 100644 --- a/R/pkg/vignettes/sparkr-vignettes.Rmd +++ b/R/pkg/vignettes/sparkr-vignettes.Rmd @@ -488,6 +488,8 @@ SparkR supports the following machine learning models and algorithms. #### Clustering +* Bisecting $k$-means + * Gaussian Mixture Model (GMM) * $k$-means Clustering @@ -738,6 +740,18 @@ summary(rfModel) predictions <- predict(rfModel, df) ``` +#### Bisecting k-Means + +`spark.bisectingKmeans` is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. + +```{r} +df <- createDataFrame(iris) +model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4) +summary(kmeansModel) +fitted <- predict(model, df) +head(select(fitted, "Sepal_Length", "prediction")) +``` + #### Gaussian Mixture Model `spark.gaussianMixture` fits multivariate [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) (GMM) against a `SparkDataFrame`. [Expectation-Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) (EM) is used to approximate the maximum likelihood estimator (MLE) of the model. |