[SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5

This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8085 from ericl/docs.
author: Eric Liang <ekl@databricks.com> 2015-08-11 21:26:03 -0700
committer: Xiangrui Meng <meng@databricks.com> 2015-08-11 21:26:03 -0700
commit: 74a293f4537c6982345166f8883538f81d850872 (patch)
tree: 0b4c7f29f22c5b72de91c5e4e3a53bb2466a0bc2 /docs/sparkr.md
parent: 3ef0f32928fc383ad3edd5ad167212aeb9eba6e1 (diff)
download: spark-74a293f4537c6982345166f8883538f81d850872.tar.gz
spark-74a293f4537c6982345166f8883538f81d850872.tar.bz2
spark-74a293f4537c6982345166f8883538f81d850872.zip
1 files changed, 36 insertions, 1 deletions
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 4385a4eeac..7139d16b4a 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -11,7 +11,8 @@ title: SparkR (R on Spark)
 SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.
 In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame implementation that
 supports operations like selection, filtering, aggregation etc. (similar to R data frames,
-[dplyr](https://github.com/hadley/dplyr)) but on large datasets.
+[dplyr](https://github.com/hadley/dplyr)) but on large datasets. SparkR also supports distributed
+machine learning using MLlib.
 
 # SparkR DataFrames
 
@@ -230,3 +231,37 @@ head(teenagers)
 
 {% endhighlight %}
 </div>
+
+# Machine Learning
+
+SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', '+', and '-'. The example below shows the use of building a gaussian GLM model using SparkR.
+
+<div data-lang="r"  markdown="1">
+{% highlight r %}
+# Create the DataFrame
+df <- createDataFrame(sqlContext, iris)
+
+# Fit a linear model over the dataset.
+model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
+
+# Model coefficients are returned in a similar format to R's native glm().
+summary(model)
+##$coefficients
+##                    Estimate
+##(Intercept)        2.2513930
+##Sepal_Width        0.8035609
+##Species_versicolor 1.4587432
+##Species_virginica  1.9468169
+
+# Make predictions based on the model.
+predictions <- predict(model, newData = df)
+head(select(predictions, "Sepal_Length", "prediction"))
+##  Sepal_Length prediction
+##1          5.1   5.063856
+##2          4.9   4.662076
+##3          4.7   4.822788
+##4          4.6   4.742432
+##5          5.0   5.144212
+##6          5.4   5.385281
+{% endhighlight %}
+</div>
author	Eric Liang <ekl@databricks.com>	2015-08-11 21:26:03 -0700
committer	Xiangrui Meng <meng@databricks.com>	2015-08-11 21:26:03 -0700
commit	74a293f4537c6982345166f8883538f81d850872 (patch)
tree	0b4c7f29f22c5b72de91c5e4e3a53bb2466a0bc2 /docs/sparkr.md
parent	3ef0f32928fc383ad3edd5ad167212aeb9eba6e1 (diff)
download	spark-74a293f4537c6982345166f8883538f81d850872.tar.gz spark-74a293f4537c6982345166f8883538f81d850872.tar.bz2 spark-74a293f4537c6982345166f8883538f81d850872.zip