diff options
author | Kai Jiang <jiangkai@gmail.com> | 2016-06-22 12:50:36 -0700 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2016-06-22 12:50:36 -0700 |
commit | 43b04b7ecb313a2cee6121dd575de1f7dc785c11 (patch) | |
tree | 80a1ea03e546148d5b910a426564fc61b7c3b2e1 /docs/sparkr.md | |
parent | 6f915c9ec24003877d1ef675a59145699780a2ff (diff) | |
download | spark-43b04b7ecb313a2cee6121dd575de1f7dc785c11.tar.gz spark-43b04b7ecb313a2cee6121dd575de1f7dc785c11.tar.bz2 spark-43b04b7ecb313a2cee6121dd575de1f7dc785c11.zip |
[SPARK-15672][R][DOC] R programming guide update
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions
## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">
Author: Kai Jiang <jiangkai@gmail.com>
Closes #13660 from vectorijk/spark-15672-R-guide-update.
Diffstat (limited to 'docs/sparkr.md')
-rw-r--r-- | docs/sparkr.md | 77 |
1 files changed, 77 insertions, 0 deletions
diff --git a/docs/sparkr.md b/docs/sparkr.md index f0189012f3..9e74e4a96a 100644 --- a/docs/sparkr.md +++ b/docs/sparkr.md @@ -255,6 +255,83 @@ head(df) {% endhighlight %} </div> +### Applying User-Defined Function +In SparkR, we support several kinds of User-Defined Functions: + +#### Run a given function on a large dataset using `dapply` or `dapplyCollect` + +##### dapply +Apply a function to each partition of a `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame` +and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function +should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match the R function's output. +<div data-lang="r" markdown="1"> +{% highlight r %} + +# Convert waiting time from hours to seconds. +# Note that we can apply UDF to DataFrame. +schema <- structType(structField("eruptions", "double"), structField("waiting", "double"), + structField("waiting_secs", "double")) +df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema) +head(collect(df1)) +## eruptions waiting waiting_secs +##1 3.600 79 4740 +##2 1.800 54 3240 +##3 3.333 74 4440 +##4 2.283 62 3720 +##5 4.533 85 5100 +##6 2.883 55 3300 +{% endhighlight %} +</div> + +##### dapplyCollect +Like `dapply`, apply a function to each partition of a `SparkDataFrame` and collect the result back. The output of function +should be a `data.frame`. But, Schema is not required to be passed. Note that `dapplyCollect` only can be used if the +output of UDF run on all the partitions can fit in driver memory. +<div data-lang="r" markdown="1"> +{% highlight r %} + +# Convert waiting time from hours to seconds. +# Note that we can apply UDF to DataFrame and return a R's data.frame +ldf <- dapplyCollect( + df, + function(x) { + x <- cbind(x, "waiting_secs" = x$waiting * 60) + }) +head(ldf, 3) +## eruptions waiting waiting_secs +##1 3.600 79 4740 +##2 1.800 54 3240 +##3 3.333 74 4440 + +{% endhighlight %} +</div> + +#### Run local R functions distributed using `spark.lapply` + +##### spark.lapply +Similar to `lapply` in native R, `spark.lapply` runs a function over a list of elements and distributes the computations with Spark. +Applies a function in a manner that is similar to `doParallel` or `lapply` to elements of a list. The results of all the computations +should fit in a single machine. If that is not the case they can do something like `df <- createDataFrame(list)` and then use +`dapply` +<div data-lang="r" markdown="1"> +{% highlight r %} + +# Perform distributed training of multiple models with spark.lapply. Here, we pass +# a read-only list of arguments which specifies family the generalized linear model should be. +families <- c("gaussian", "poisson") +train <- function(family) { + model <- glm(Sepal.Length ~ Sepal.Width + Species, iris, family = family) + summary(model) +} +# Return a list of model's summaries +model.summaries <- spark.lapply(families, train) + +# Print the summary of each model +print(model.summaries) + +{% endhighlight %} +</div> + ## Running SQL Queries from SparkR A SparkDataFrame can also be registered as a temporary view in Spark SQL and that allows you to run SQL queries over its data. The `sql` function enables applications to run SQL queries programmatically and returns the result as a `SparkDataFrame`. |