aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--R/pkg/R/generics.R4
-rw-r--r--R/pkg/R/mllib.R8
-rw-r--r--docs/sparkr.md37
3 files changed, 42 insertions, 7 deletions
diff --git a/R/pkg/R/generics.R b/R/pkg/R/generics.R
index c43b947129..379a78b1d8 100644
--- a/R/pkg/R/generics.R
+++ b/R/pkg/R/generics.R
@@ -535,8 +535,8 @@ setGeneric("showDF", function(x,...) { standardGeneric("showDF") })
#' @export
setGeneric("summarize", function(x,...) { standardGeneric("summarize") })
-##' rdname summary
-##' @export
+#' @rdname summary
+#' @export
setGeneric("summary", function(x, ...) { standardGeneric("summary") })
# @rdname tojson
diff --git a/R/pkg/R/mllib.R b/R/pkg/R/mllib.R
index b524d1fd87..cea3d760d0 100644
--- a/R/pkg/R/mllib.R
+++ b/R/pkg/R/mllib.R
@@ -56,10 +56,10 @@ setMethod("glm", signature(formula = "formula", family = "ANY", data = "DataFram
#'
#' Makes predictions from a model produced by glm(), similarly to R's predict().
#'
-#' @param model A fitted MLlib model
+#' @param object A fitted MLlib model
#' @param newData DataFrame for testing
#' @return DataFrame containing predicted values
-#' @rdname glm
+#' @rdname predict
#' @export
#' @examples
#'\dontrun{
@@ -76,10 +76,10 @@ setMethod("predict", signature(object = "PipelineModel"),
#'
#' Returns the summary of a model produced by glm(), similarly to R's summary().
#'
-#' @param model A fitted MLlib model
+#' @param x A fitted MLlib model
#' @return a list with a 'coefficient' component, which is the matrix of coefficients. See
#' summary.glm for more information.
-#' @rdname glm
+#' @rdname summary
#' @export
#' @examples
#'\dontrun{
diff --git a/docs/sparkr.md b/docs/sparkr.md
index 4385a4eeac..7139d16b4a 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -11,7 +11,8 @@ title: SparkR (R on Spark)
SparkR is an R package that provides a light-weight frontend to use Apache Spark from R.
In Spark {{site.SPARK_VERSION}}, SparkR provides a distributed data frame implementation that
supports operations like selection, filtering, aggregation etc. (similar to R data frames,
-[dplyr](https://github.com/hadley/dplyr)) but on large datasets.
+[dplyr](https://github.com/hadley/dplyr)) but on large datasets. SparkR also supports distributed
+machine learning using MLlib.
# SparkR DataFrames
@@ -230,3 +231,37 @@ head(teenagers)
{% endhighlight %}
</div>
+
+# Machine Learning
+
+SparkR allows the fitting of generalized linear models over DataFrames using the [glm()](api/R/glm.html) function. Under the hood, SparkR uses MLlib to train a model of the specified family. Currently the gaussian and binomial families are supported. We support a subset of the available R formula operators for model fitting, including '~', '.', '+', and '-'. The example below shows the use of building a gaussian GLM model using SparkR.
+
+<div data-lang="r" markdown="1">
+{% highlight r %}
+# Create the DataFrame
+df <- createDataFrame(sqlContext, iris)
+
+# Fit a linear model over the dataset.
+model <- glm(Sepal_Length ~ Sepal_Width + Species, data = df, family = "gaussian")
+
+# Model coefficients are returned in a similar format to R's native glm().
+summary(model)
+##$coefficients
+## Estimate
+##(Intercept) 2.2513930
+##Sepal_Width 0.8035609
+##Species_versicolor 1.4587432
+##Species_virginica 1.9468169
+
+# Make predictions based on the model.
+predictions <- predict(model, newData = df)
+head(select(predictions, "Sepal_Length", "prediction"))
+## Sepal_Length prediction
+##1 5.1 5.063856
+##2 4.9 4.662076
+##3 4.7 4.822788
+##4 4.6 4.742432
+##5 5.0 5.144212
+##6 5.4 5.385281
+{% endhighlight %}
+</div>