aboutsummaryrefslogtreecommitdiff
path: root/docs/sparkr.md
diff options
context:
space:
mode:
authorFelix Cheung <felixcheung_m@hotmail.com>2016-07-13 15:09:23 -0700
committerShivaram Venkataraman <shivaram@cs.berkeley.edu>2016-07-13 15:09:23 -0700
commitfb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3 (patch)
tree6a99fcdae0d27149a93b7b2a48ffef5480579b20 /docs/sparkr.md
parentb4baf086ca380a46d953f2710184ad9eee3a045e (diff)
downloadspark-fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3.tar.gz
spark-fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3.tar.bz2
spark-fb2e8eeb0b1e56bea535165f7a3bec6558b3f4a3.zip
[SPARKR][DOCS][MINOR] R programming guide to include csv data source example
## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14178 from felixcheung/rcsvprogrammingguide.
Diffstat (limited to 'docs/sparkr.md')
-rw-r--r--docs/sparkr.md27
1 files changed, 18 insertions, 9 deletions
diff --git a/docs/sparkr.md b/docs/sparkr.md
index b4acb23040..9fda0ec0e6 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -111,19 +111,17 @@ head(df)
SparkR supports operating on a variety of data sources through the `SparkDataFrame` interface. This section describes the general methods for loading and saving data using Data Sources. You can check the Spark SQL programming guide for more [specific options](sql-programming-guide.html#manually-specifying-options) that are available for the built-in data sources.
The general method for creating SparkDataFrames from data sources is `read.df`. This method takes in the path for the file to load and the type of data source, and the currently active SparkSession will be used automatically. SparkR supports reading JSON, CSV and Parquet files natively and through [Spark Packages](http://spark-packages.org/) you can find data source connectors for popular file formats like [Avro](http://spark-packages.org/package/databricks/spark-avro). These packages can either be added by
-specifying `--packages` with `spark-submit` or `sparkR` commands, or if creating context through `init`
-you can specify the packages with the `packages` argument.
+specifying `--packages` with `spark-submit` or `sparkR` commands, or if initializing SparkSession with `sparkPackages` parameter when in an interactive R shell or from RStudio.
<div data-lang="r" markdown="1">
{% highlight r %}
-sc <- sparkR.session(sparkPackages="com.databricks:spark-avro_2.11:3.0.0")
+sc <- sparkR.session(sparkPackages = "com.databricks:spark-avro_2.11:3.0.0")
{% endhighlight %}
</div>
We can see how to use data sources using an example JSON input file. Note that the file that is used here is _not_ a typical JSON file. Each line in the file must contain a separate, self-contained valid JSON object. As a consequence, a regular multi-line JSON file will most often fail.
<div data-lang="r" markdown="1">
-
{% highlight r %}
people <- read.df("./examples/src/main/resources/people.json", "json")
head(people)
@@ -138,6 +136,18 @@ printSchema(people)
# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
+# Similarly, multiple files can be read with read.json
+people <- read.json(c("./examples/src/main/resources/people.json", "./examples/src/main/resources/people2.json"))
+
+{% endhighlight %}
+</div>
+
+The data sources API natively supports CSV formatted input files. For more information please refer to SparkR [read.df](api/R/read.df.html) API documentation.
+
+<div data-lang="r" markdown="1">
+{% highlight r %}
+df <- read.df(csvPath, "csv", header = "true", inferSchema = "true", na.strings = "NA")
+
{% endhighlight %}
</div>
@@ -146,7 +156,7 @@ to a Parquet file using `write.df`.
<div data-lang="r" markdown="1">
{% highlight r %}
-write.df(people, path="people.parquet", source="parquet", mode="overwrite")
+write.df(people, path = "people.parquet", source = "parquet", mode = "overwrite")
{% endhighlight %}
</div>
@@ -264,14 +274,14 @@ In SparkR, we support several kinds of User-Defined Functions:
Apply a function to each partition of a `SparkDataFrame`. The function to be applied to each partition of the `SparkDataFrame`
and should have only one parameter, to which a `data.frame` corresponds to each partition will be passed. The output of function
should be a `data.frame`. Schema specifies the row format of the resulting a `SparkDataFrame`. It must match the R function's output.
+
<div data-lang="r" markdown="1">
{% highlight r %}
-
# Convert waiting time from hours to seconds.
# Note that we can apply UDF to DataFrame.
schema <- structType(structField("eruptions", "double"), structField("waiting", "double"),
structField("waiting_secs", "double"))
-df1 <- dapply(df, function(x) {x <- cbind(x, x$waiting * 60)}, schema)
+df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)
head(collect(df1))
## eruptions waiting waiting_secs
##1 3.600 79 4740
@@ -313,9 +323,9 @@ Similar to `lapply` in native R, `spark.lapply` runs a function over a list of e
Applies a function in a manner that is similar to `doParallel` or `lapply` to elements of a list. The results of all the computations
should fit in a single machine. If that is not the case they can do something like `df <- createDataFrame(list)` and then use
`dapply`
+
<div data-lang="r" markdown="1">
{% highlight r %}
-
# Perform distributed training of multiple models with spark.lapply. Here, we pass
# a read-only list of arguments which specifies family the generalized linear model should be.
families <- c("gaussian", "poisson")
@@ -436,4 +446,3 @@ You can inspect the search path in R with [`search()`](https://stat.ethz.ch/R-ma
- The method `registerTempTable` has been deprecated to be replaced by `createOrReplaceTempView`.
- The method `dropTempTable` has been deprecated to be replaced by `dropTempView`.
- The `sc` SparkContext parameter is no longer required for these functions: `setJobGroup`, `clearJobGroup`, `cancelJobGroup`
-