aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorkrishnakalyan3 <krishnakalyan3@gmail.com>2017-02-03 12:19:47 -0800
committerFelix Cheung <felixcheung@apache.org>2017-02-03 12:19:47 -0800
commit48aafeda7db879491ed36fff89d59ca7ec3136fa (patch)
treec69dbd72d71e3012124b0633f64a7089815e6e44
parent2f523fa0c930f55a42c4c070efb24f87df33e6c2 (diff)
downloadspark-48aafeda7db879491ed36fff89d59ca7ec3136fa.tar.gz
spark-48aafeda7db879491ed36fff89d59ca7ec3136fa.tar.bz2
spark-48aafeda7db879491ed36fff89d59ca7ec3136fa.zip
[SPARK-19386][SPARKR][DOC] Bisecting k-means in SparkR documentation
## What changes were proposed in this pull request? Update programming guide, example and vignette with Bisecting k-means. Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16767 from krishnakalyan3/bisecting-kmeans.
-rw-r--r--R/pkg/vignettes/sparkr-vignettes.Rmd14
-rw-r--r--docs/ml-clustering.md7
-rw-r--r--examples/src/main/r/ml/bisectingKmeans.R42
3 files changed, 63 insertions, 0 deletions
diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd
index 36a78477dc..a7cac2f503 100644
--- a/R/pkg/vignettes/sparkr-vignettes.Rmd
+++ b/R/pkg/vignettes/sparkr-vignettes.Rmd
@@ -488,6 +488,8 @@ SparkR supports the following machine learning models and algorithms.
#### Clustering
+* Bisecting $k$-means
+
* Gaussian Mixture Model (GMM)
* $k$-means Clustering
@@ -738,6 +740,18 @@ summary(rfModel)
predictions <- predict(rfModel, df)
```
+#### Bisecting k-Means
+
+`spark.bisectingKmeans` is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy.
+
+```{r}
+df <- createDataFrame(iris)
+model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
+summary(kmeansModel)
+fitted <- predict(model, df)
+head(select(fitted, "Sepal_Length", "prediction"))
+```
+
#### Gaussian Mixture Model
`spark.gaussianMixture` fits multivariate [Gaussian Mixture Model](https://en.wikipedia.org/wiki/Mixture_model#Multivariate_Gaussian_mixture_model) (GMM) against a `SparkDataFrame`. [Expectation-Maximization](https://en.wikipedia.org/wiki/Expectation%E2%80%93maximization_algorithm) (EM) is used to approximate the maximum likelihood estimator (MLE) of the model.
diff --git a/docs/ml-clustering.md b/docs/ml-clustering.md
index d8b6553c5b..1186fb73d0 100644
--- a/docs/ml-clustering.md
+++ b/docs/ml-clustering.md
@@ -167,6 +167,13 @@ Refer to the [Python API docs](api/python/pyspark.ml.html#pyspark.ml.clustering.
{% include_example python/ml/bisecting_k_means_example.py %}
</div>
+
+<div data-lang="r" markdown="1">
+
+Refer to the [R API docs](api/R/spark.bisectingKmeans.html) for more details.
+
+{% include_example r/ml/bisectingKmeans.R %}
+</div>
</div>
## Gaussian Mixture Model (GMM)
diff --git a/examples/src/main/r/ml/bisectingKmeans.R b/examples/src/main/r/ml/bisectingKmeans.R
new file mode 100644
index 0000000000..37aeb74fc7
--- /dev/null
+++ b/examples/src/main/r/ml/bisectingKmeans.R
@@ -0,0 +1,42 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+#
+
+# To run this example use
+# ./bin/spark-submit examples/src/main/r/ml/bisectingKmeans.R
+
+# Load SparkR library into your R session
+library(SparkR)
+
+# Initialize SparkSession
+sparkR.session(appName = "SparkR-ML-bisectingKmeans-example")
+
+# $example on$
+irisDF <- createDataFrame(iris)
+
+# Fit bisecting k-means model with four centers
+model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4)
+
+# get fitted result from a bisecting k-means model
+fitted.model <- fitted(model, "centers")
+
+# Model summary
+summary(fitted.model)
+
+# fitted values on training data
+fitted <- predict(model, df)
+head(select(fitted, "Sepal_Length", "prediction"))
+# $example off$