aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--R/pkg/R/DataFrame.R2
-rw-r--r--R/pkg/R/sparkR.R16
-rw-r--r--docs/sparkr.md107
3 files changed, 62 insertions, 63 deletions
diff --git a/R/pkg/R/DataFrame.R b/R/pkg/R/DataFrame.R
index 2e99aa026d..a4733313ed 100644
--- a/R/pkg/R/DataFrame.R
+++ b/R/pkg/R/DataFrame.R
@@ -35,7 +35,7 @@ setOldClass("structType")
#' @slot env An R environment that stores bookkeeping states of the SparkDataFrame
#' @slot sdf A Java object reference to the backing Scala DataFrame
#' @seealso \link{createDataFrame}, \link{read.json}, \link{table}
-#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkdataframe}
+#' @seealso \url{https://spark.apache.org/docs/latest/sparkr.html#sparkr-dataframes}
#' @export
#' @examples
#'\dontrun{
diff --git a/R/pkg/R/sparkR.R b/R/pkg/R/sparkR.R
index ff5297ffd5..524f7c4a26 100644
--- a/R/pkg/R/sparkR.R
+++ b/R/pkg/R/sparkR.R
@@ -28,14 +28,6 @@ connExists <- function(env) {
})
}
-#' @rdname sparkR.session.stop
-#' @name sparkR.stop
-#' @export
-#' @note sparkR.stop since 1.4.0
-sparkR.stop <- function() {
- sparkR.session.stop()
-}
-
#' Stop the Spark Session and Spark Context
#'
#' Stop the Spark Session and Spark Context.
@@ -90,6 +82,14 @@ sparkR.session.stop <- function() {
clearJobjs()
}
+#' @rdname sparkR.session.stop
+#' @name sparkR.stop
+#' @export
+#' @note sparkR.stop since 1.4.0
+sparkR.stop <- function() {
+ sparkR.session.stop()
+}
+
#' (Deprecated) Initialize a new Spark Context
#'
#' This function initializes a new SparkContext.
diff --git a/docs/sparkr.md b/docs/sparkr.md
index dfa5278ef8..4bbc362c52 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -322,8 +322,59 @@ head(ldf, 3)
Apply a function to each group of a `SparkDataFrame`. The function is to be applied to each group of the `SparkDataFrame` and should have only two parameters: grouping key and R `data.frame` corresponding to
that key. The groups are chosen from `SparkDataFrame`s column(s).
The output of function should be a `data.frame`. Schema specifies the row format of the resulting
-`SparkDataFrame`. It must represent R function's output schema on the basis of Spark data types. The column names of the returned `data.frame` are set by user. Below is the data type mapping between R
-and Spark.
+`SparkDataFrame`. It must represent R function's output schema on the basis of Spark [data types](#data-type-mapping-between-r-and-spark). The column names of the returned `data.frame` are set by user.
+
+<div data-lang="r" markdown="1">
+{% highlight r %}
+
+# Determine six waiting times with the largest eruption time in minutes.
+schema <- structType(structField("waiting", "double"), structField("max_eruption", "double"))
+result <- gapply(
+ df,
+ "waiting",
+ function(key, x) {
+ y <- data.frame(key, max(x$eruptions))
+ },
+ schema)
+head(collect(arrange(result, "max_eruption", decreasing = TRUE)))
+
+## waiting max_eruption
+##1 64 5.100
+##2 69 5.067
+##3 71 5.033
+##4 87 5.000
+##5 63 4.933
+##6 89 4.900
+{% endhighlight %}
+</div>
+
+##### gapplyCollect
+Like `gapply`, applies a function to each partition of a `SparkDataFrame` and collect the result back to R data.frame. The output of the function should be a `data.frame`. But, the schema is not required to be passed. Note that `gapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
+
+<div data-lang="r" markdown="1">
+{% highlight r %}
+
+# Determine six waiting times with the largest eruption time in minutes.
+result <- gapplyCollect(
+ df,
+ "waiting",
+ function(key, x) {
+ y <- data.frame(key, max(x$eruptions))
+ colnames(y) <- c("waiting", "max_eruption")
+ y
+ })
+head(result[order(result$max_eruption, decreasing = TRUE), ])
+
+## waiting max_eruption
+##1 64 5.100
+##2 69 5.067
+##3 71 5.033
+##4 87 5.000
+##5 63 4.933
+##6 89 4.900
+
+{% endhighlight %}
+</div>
#### Data type mapping between R and Spark
<table class="table">
@@ -394,58 +445,6 @@ and Spark.
</tr>
</table>
-<div data-lang="r" markdown="1">
-{% highlight r %}
-
-# Determine six waiting times with the largest eruption time in minutes.
-schema <- structType(structField("waiting", "double"), structField("max_eruption", "double"))
-result <- gapply(
- df,
- "waiting",
- function(key, x) {
- y <- data.frame(key, max(x$eruptions))
- },
- schema)
-head(collect(arrange(result, "max_eruption", decreasing = TRUE)))
-
-## waiting max_eruption
-##1 64 5.100
-##2 69 5.067
-##3 71 5.033
-##4 87 5.000
-##5 63 4.933
-##6 89 4.900
-{% endhighlight %}
-</div>
-
-##### gapplyCollect
-Like `gapply`, applies a function to each partition of a `SparkDataFrame` and collect the result back to R data.frame. The output of the function should be a `data.frame`. But, the schema is not required to be passed. Note that `gapplyCollect` can fail if the output of UDF run on all the partition cannot be pulled to the driver and fit in driver memory.
-
-<div data-lang="r" markdown="1">
-{% highlight r %}
-
-# Determine six waiting times with the largest eruption time in minutes.
-result <- gapplyCollect(
- df,
- "waiting",
- function(key, x) {
- y <- data.frame(key, max(x$eruptions))
- colnames(y) <- c("waiting", "max_eruption")
- y
- })
-head(result[order(result$max_eruption, decreasing = TRUE), ])
-
-## waiting max_eruption
-##1 64 5.100
-##2 69 5.067
-##3 71 5.033
-##4 87 5.000
-##5 63 4.933
-##6 89 4.900
-
-{% endhighlight %}
-</div>
-
#### Run local R functions distributed using `spark.lapply`
##### spark.lapply