aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorDavies Liu <davies@databricks.com>2015-05-23 00:00:30 -0700
committerShivaram Venkataraman <shivaram@cs.berkeley.edu>2015-05-23 00:01:40 -0700
commit7af3818c6b2bf35bfa531ab7cc3a4a714385015e (patch)
treee7dcb33da71845eaed6808045725882d6ba07796 /docs
parent4583cf4be17155c68178155acf6866d7cc8f7df0 (diff)
downloadspark-7af3818c6b2bf35bfa531ab7cc3a4a714385015e.tar.gz
spark-7af3818c6b2bf35bfa531ab7cc3a4a714385015e.tar.bz2
spark-7af3818c6b2bf35bfa531ab7cc3a4a714385015e.zip
[SPARK-6806] [SPARKR] [DOCS] Fill in SparkR examples in programming guide
sqlCtx -> sqlContext You can check the docs by: ``` $ cd docs $ SKIP_SCALADOC=1 jekyll serve ``` cc shivaram Author: Davies Liu <davies@databricks.com> Closes #5442 from davies/r_docs and squashes the following commits: 7a12ec6 [Davies Liu] remove rdd in R docs 8496b26 [Davies Liu] remove the docs related to RDD e23b9d6 [Davies Liu] delete R docs for RDD API 222e4ff [Davies Liu] Merge branch 'master' into r_docs 89684ce [Davies Liu] Merge branch 'r_docs' of github.com:davies/spark into r_docs f0a10e1 [Davies Liu] address comments from @shivaram f61de71 [Davies Liu] Update pairRDD.R 3ef7cf3 [Davies Liu] use + instead of function(a,b) a+b 2f10a77 [Davies Liu] address comments from @cafreeman 9c2a062 [Davies Liu] mention R api together with Python API 23f751a [Davies Liu] Fill in SparkR examples in programming guide
Diffstat (limited to 'docs')
-rw-r--r--docs/_plugins/copy_api_dirs.rb68
-rw-r--r--docs/api.md3
-rw-r--r--docs/index.md23
-rw-r--r--docs/programming-guide.md21
-rw-r--r--docs/quick-start.md18
-rw-r--r--docs/sql-programming-guide.md373
6 files changed, 445 insertions, 61 deletions
diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb
index 0ea3f8eab4..6073b3626c 100644
--- a/docs/_plugins/copy_api_dirs.rb
+++ b/docs/_plugins/copy_api_dirs.rb
@@ -18,50 +18,52 @@
require 'fileutils'
include FileUtils
-if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1')
- # Build Scaladoc for Java/Scala
+if not (ENV['SKIP_API'] == '1')
+ if not (ENV['SKIP_SCALADOC'] == '1')
+ # Build Scaladoc for Java/Scala
- puts "Moving to project root and building API docs."
- curr_dir = pwd
- cd("..")
+ puts "Moving to project root and building API docs."
+ curr_dir = pwd
+ cd("..")
- puts "Running 'build/sbt -Pkinesis-asl compile unidoc' from " + pwd + "; this may take a few minutes..."
- puts `build/sbt -Pkinesis-asl compile unidoc`
+ puts "Running 'build/sbt -Pkinesis-asl compile unidoc' from " + pwd + "; this may take a few minutes..."
+ puts `build/sbt -Pkinesis-asl compile unidoc`
- puts "Moving back into docs dir."
- cd("docs")
+ puts "Moving back into docs dir."
+ cd("docs")
- # Copy over the unified ScalaDoc for all projects to api/scala.
- # This directory will be copied over to _site when `jekyll` command is run.
- source = "../target/scala-2.10/unidoc"
- dest = "api/scala"
+ # Copy over the unified ScalaDoc for all projects to api/scala.
+ # This directory will be copied over to _site when `jekyll` command is run.
+ source = "../target/scala-2.10/unidoc"
+ dest = "api/scala"
- puts "Making directory " + dest
- mkdir_p dest
+ puts "Making directory " + dest
+ mkdir_p dest
- # From the rubydoc: cp_r('src', 'dest') makes src/dest, but this doesn't.
- puts "cp -r " + source + "/. " + dest
- cp_r(source + "/.", dest)
+ # From the rubydoc: cp_r('src', 'dest') makes src/dest, but this doesn't.
+ puts "cp -r " + source + "/. " + dest
+ cp_r(source + "/.", dest)
- # Append custom JavaScript
- js = File.readlines("./js/api-docs.js")
- js_file = dest + "/lib/template.js"
- File.open(js_file, 'a') { |f| f.write("\n" + js.join()) }
+ # Append custom JavaScript
+ js = File.readlines("./js/api-docs.js")
+ js_file = dest + "/lib/template.js"
+ File.open(js_file, 'a') { |f| f.write("\n" + js.join()) }
- # Append custom CSS
- css = File.readlines("./css/api-docs.css")
- css_file = dest + "/lib/template.css"
- File.open(css_file, 'a') { |f| f.write("\n" + css.join()) }
+ # Append custom CSS
+ css = File.readlines("./css/api-docs.css")
+ css_file = dest + "/lib/template.css"
+ File.open(css_file, 'a') { |f| f.write("\n" + css.join()) }
- # Copy over the unified JavaDoc for all projects to api/java.
- source = "../target/javaunidoc"
- dest = "api/java"
+ # Copy over the unified JavaDoc for all projects to api/java.
+ source = "../target/javaunidoc"
+ dest = "api/java"
- puts "Making directory " + dest
- mkdir_p dest
+ puts "Making directory " + dest
+ mkdir_p dest
- puts "cp -r " + source + "/. " + dest
- cp_r(source + "/.", dest)
+ puts "cp -r " + source + "/. " + dest
+ cp_r(source + "/.", dest)
+ end
# Build Sphinx docs for Python
diff --git a/docs/api.md b/docs/api.md
index 0346038333..45df77ac05 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -7,4 +7,5 @@ Here you can API docs for Spark and its submodules.
- [Spark Scala API (Scaladoc)](api/scala/index.html)
- [Spark Java API (Javadoc)](api/java/index.html)
-- [Spark Python API (Epydoc)](api/python/index.html)
+- [Spark Python API (Sphinx)](api/python/index.html)
+- [Spark R API (Roxygen2)](api/R/index.html)
diff --git a/docs/index.md b/docs/index.md
index b5b016e347..5ef6d983c4 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -6,7 +6,7 @@ description: Apache Spark SPARK_VERSION_SHORT documentation homepage
---
Apache Spark is a fast and general-purpose cluster computing system.
-It provides high-level APIs in Java, Scala and Python,
+It provides high-level APIs in Java, Scala, Python and R,
and an optimized engine that supports general execution graphs.
It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
@@ -20,13 +20,13 @@ Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy
locally on one machine --- all you need is to have `java` installed on your system `PATH`,
or the `JAVA_HOME` environment variable pointing to a Java installation.
-Spark runs on Java 6+ and Python 2.6+. For the Scala API, Spark {{site.SPARK_VERSION}} uses
+Spark runs on Java 6+, Python 2.6+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} uses
Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version
({{site.SCALA_BINARY_VERSION}}.x).
# Running the Examples and Shell
-Spark comes with several sample programs. Scala, Java and Python examples are in the
+Spark comes with several sample programs. Scala, Java, Python and R examples are in the
`examples/src/main` directory. To run one of the Java or Scala sample programs, use
`bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
invokes the more general
@@ -54,6 +54,15 @@ Example applications are also provided in Python. For example,
./bin/spark-submit examples/src/main/python/pi.py 10
+Spark also provides an experimental R API since 1.4 (only DataFrames APIs included).
+To run Spark interactively in a R interpreter, use `bin/sparkR`:
+
+ ./bin/sparkR --master local[2]
+
+Example applications are also provided in R. For example,
+
+ ./bin/spark-submit examples/src/main/r/dataframe.R
+
# Launching on a Cluster
The Spark [cluster mode overview](cluster-overview.html) explains the key concepts in running on a cluster.
@@ -71,7 +80,7 @@ options for deployment:
* [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
* [Spark Programming Guide](programming-guide.html): detailed overview of Spark
- in all supported languages (Scala, Java, Python)
+ in all supported languages (Scala, Java, Python, R)
* Modules built on Spark:
* [Spark Streaming](streaming-programming-guide.html): processing real-time data streams
* [Spark SQL and DataFrames](sql-programming-guide.html): support for structured data and relational queries
@@ -83,7 +92,8 @@ options for deployment:
* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package)
* [Spark Java API (Javadoc)](api/java/index.html)
-* [Spark Python API (Epydoc)](api/python/index.html)
+* [Spark Python API (Sphinx)](api/python/index.html)
+* [Spark R API (Roxygen2)](api/R/index.html)
**Deployment Guides:**
@@ -124,4 +134,5 @@ options for deployment:
available online for free.
* [Code Examples](http://spark.apache.org/examples.html): more are also available in the `examples` subfolder of Spark ([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
- [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python))
+ [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
+ [R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r))
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 07a4d29fe7..5d9df282ef 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -98,9 +98,9 @@ to your version of HDFS. Some common HDFS version tags are listed on the
[Prebuilt packages](http://spark.apache.org/downloads.html) are also available on the Spark homepage
for common HDFS versions.
-Finally, you need to import some Spark classes into your program. Add the following lines:
+Finally, you need to import some Spark classes into your program. Add the following line:
-{% highlight scala %}
+{% highlight python %}
from pyspark import SparkContext, SparkConf
{% endhighlight %}
@@ -478,7 +478,6 @@ the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main
for examples of using Cassandra / HBase ```InputFormat``` and ```OutputFormat``` with custom converters.
</div>
-
</div>
## RDD Operations
@@ -915,7 +914,8 @@ The following table lists some of the common transformations supported by Spark.
RDD API doc
([Scala](api/scala/index.html#org.apache.spark.rdd.RDD),
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
- [Python](api/python/pyspark.html#pyspark.RDD))
+ [Python](api/python/pyspark.html#pyspark.RDD),
+ [R](api/R/index.html))
and pair RDD functions doc
([Scala](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions),
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
@@ -1028,7 +1028,9 @@ The following table lists some of the common actions supported by Spark. Refer t
RDD API doc
([Scala](api/scala/index.html#org.apache.spark.rdd.RDD),
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
- [Python](api/python/pyspark.html#pyspark.RDD))
+ [Python](api/python/pyspark.html#pyspark.RDD),
+ [R](api/R/index.html))
+
and pair RDD functions doc
([Scala](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions),
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
@@ -1565,7 +1567,8 @@ You can see some [example Spark programs](http://spark.apache.org/examples.html)
In addition, Spark includes several samples in the `examples` directory
([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
- [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python)).
+ [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
+ [R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r)).
You can run Java and Scala examples by passing the class name to Spark's `bin/run-example` script; for instance:
./bin/run-example SparkPi
@@ -1574,6 +1577,10 @@ For Python examples, use `spark-submit` instead:
./bin/spark-submit examples/src/main/python/pi.py
+For R examples, use `spark-submit` instead:
+
+ ./bin/spark-submit examples/src/main/r/dataframe.R
+
For help on optimizing your programs, the [configuration](configuration.html) and
[tuning](tuning.html) guides provide information on best practices. They are especially important for
making sure that your data is stored in memory in an efficient format.
@@ -1581,4 +1588,4 @@ For help on deploying, the [cluster mode overview](cluster-overview.html) descri
in distributed operation and supported cluster managers.
Finally, full API documentation is available in
-[Scala](api/scala/#org.apache.spark.package), [Java](api/java/) and [Python](api/python/).
+[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) and [R](api/R/).
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 81143da865..bb39e4111f 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -184,10 +184,10 @@ scala> linesWithSpark.cache()
res7: spark.RDD[String] = spark.FilteredRDD@17e51082
scala> linesWithSpark.count()
-res8: Long = 15
+res8: Long = 19
scala> linesWithSpark.count()
-res9: Long = 15
+res9: Long = 19
{% endhighlight %}
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
@@ -202,10 +202,10 @@ a cluster, as described in the [programming guide](programming-guide.html#initia
>>> linesWithSpark.cache()
>>> linesWithSpark.count()
-15
+19
>>> linesWithSpark.count()
-15
+19
{% endhighlight %}
It may seem silly to use Spark to explore and cache a 100-line text file. The interesting part is
@@ -423,14 +423,14 @@ dependencies to `spark-submit` through its `--py-files` argument by packaging th
We can run this application using the `bin/spark-submit` script:
-{% highlight python %}
+{% highlight bash %}
# Use spark-submit to run your application
$ YOUR_SPARK_HOME/bin/spark-submit \
--master local[4] \
SimpleApp.py
...
Lines with a: 46, Lines with b: 23
-{% endhighlight python %}
+{% endhighlight %}
</div>
</div>
@@ -444,7 +444,8 @@ Congratulations on running your first Spark application!
* Finally, Spark includes several samples in the `examples` directory
([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
[Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
- [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python)).
+ [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
+ [R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r)).
You can run them as follows:
{% highlight bash %}
@@ -453,4 +454,7 @@ You can run them as follows:
# For Python examples, use spark-submit directly:
./bin/spark-submit examples/src/main/python/pi.py
+
+# For R examples, use spark-submit directly:
+./bin/spark-submit examples/src/main/r/dataframe.R
{% endhighlight %}
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 78b8e8ad51..5b41c0ee6e 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -16,9 +16,9 @@ Spark SQL is a Spark module for structured data processing. It provides a progra
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
-The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), [Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), and [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame).
+The DataFrame API is available in [Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), [Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), [Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and [R](api/R/index.html).
-All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell` or the `pyspark` shell.
+All of the examples on this page use sample data included in the Spark distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` shell.
## Starting Point: `SQLContext`
@@ -65,6 +65,17 @@ sqlContext = SQLContext(sc)
{% endhighlight %}
</div>
+
+<div data-lang="r" markdown="1">
+
+The entry point into all relational functionality in Spark is the
+`SQLContext` class, or one of its decedents. To create a basic `SQLContext`, all you need is a SparkContext.
+
+{% highlight r %}
+sqlContext <- sparkRSQL.init(sc)
+{% endhighlight %}
+
+</div>
</div>
In addition to the basic `SQLContext`, you can also create a `HiveContext`, which provides a
@@ -130,6 +141,19 @@ df.show()
{% endhighlight %}
</div>
+
+<div data-lang="r" markdown="1">
+{% highlight r %}
+sqlContext <- SQLContext(sc)
+
+df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
+
+# Displays the content of the DataFrame to stdout
+showDF(df)
+{% endhighlight %}
+
+</div>
+
</div>
@@ -296,6 +320,57 @@ df.groupBy("age").count().show()
{% endhighlight %}
</div>
+
+<div data-lang="r" markdown="1">
+{% highlight r %}
+sqlContext <- sparkRSQL.init(sc)
+
+# Create the DataFrame
+df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
+
+# Show the content of the DataFrame
+showDF(df)
+## age name
+## null Michael
+## 30 Andy
+## 19 Justin
+
+# Print the schema in a tree format
+printSchema(df)
+## root
+## |-- age: long (nullable = true)
+## |-- name: string (nullable = true)
+
+# Select only the "name" column
+showDF(select(df, "name"))
+## name
+## Michael
+## Andy
+## Justin
+
+# Select everybody, but increment the age by 1
+showDF(select(df, df$name, df$age + 1))
+## name (age + 1)
+## Michael null
+## Andy 31
+## Justin 20
+
+# Select people older than 21
+showDF(where(df, df$age > 21))
+## age name
+## 30 Andy
+
+# Count people by age
+showDF(count(groupBy(df, "age")))
+## age count
+## null 1
+## 19 1
+## 30 1
+
+{% endhighlight %}
+
+</div>
+
</div>
@@ -325,6 +400,14 @@ sqlContext = SQLContext(sc)
df = sqlContext.sql("SELECT * FROM table")
{% endhighlight %}
</div>
+
+<div data-lang="r" markdown="1">
+{% highlight r %}
+sqlContext <- sparkRSQL.init(sc)
+df <- sql(sqlContext, "SELECT * FROM table")
+{% endhighlight %}
+</div>
+
</div>
@@ -720,6 +803,15 @@ df.select("name", "favorite_color").save("namesAndFavColors.parquet")
{% endhighlight %}
</div>
+
+<div data-lang="r" markdown="1">
+
+{% highlight r %}
+df <- loadDF(sqlContext, "people.parquet")
+saveDF(select(df, "name", "age"), "namesAndAges.parquet")
+{% endhighlight %}
+
+</div>
</div>
### Manually Specifying Options
@@ -761,6 +853,16 @@ df.select("name", "age").save("namesAndAges.parquet", "parquet")
{% endhighlight %}
</div>
+<div data-lang="r" markdown="1">
+
+{% highlight r %}
+
+df <- loadDF(sqlContext, "people.json", "json")
+saveDF(select(df, "name", "age"), "namesAndAges.parquet", "parquet")
+
+{% endhighlight %}
+
+</div>
</div>
### Save Modes
@@ -908,6 +1010,31 @@ for teenName in teenNames.collect():
</div>
+<div data-lang="r" markdown="1">
+
+{% highlight r %}
+# sqlContext from the previous example is used in this example.
+
+schemaPeople # The DataFrame from the previous example.
+
+# DataFrames can be saved as Parquet files, maintaining the schema information.
+saveAsParquetFile(schemaPeople, "people.parquet")
+
+# Read in the Parquet file created above. Parquet files are self-describing so the schema is preserved.
+# The result of loading a parquet file is also a DataFrame.
+parquetFile <- parquetFile(sqlContext, "people.parquet")
+
+# Parquet files can also be registered as tables and then used in SQL statements.
+registerTempTable(parquetFile, "parquetFile");
+teenagers <- sql(sqlContext, "SELECT name FROM parquetFile WHERE age >= 13 AND age <= 19")
+teenNames <- map(teenagers, function(p) { paste("Name:", p$name)})
+for (teenName in collect(teenNames)) {
+ cat(teenName, "\n")
+}
+{% endhighlight %}
+
+</div>
+
<div data-lang="sql" markdown="1">
{% highlight sql %}
@@ -1033,7 +1160,7 @@ df2 = sqlContext.createDataFrame(sc.parallelize(range(6, 11))
df2.save("data/test_table/key=2", "parquet")
# Read the partitioned table
-df3 = sqlContext.parquetFile("data/test_table")
+df3 = sqlContext.load("data/test_table", "parquet")
df3.printSchema()
# The final schema consists of all 3 columns in the Parquet files together
@@ -1047,6 +1174,33 @@ df3.printSchema()
</div>
+<div data-lang="r" markdown="1">
+
+{% highlight r %}
+# sqlContext from the previous example is used in this example.
+
+# Create a simple DataFrame, stored into a partition directory
+saveDF(df1, "data/test_table/key=1", "parquet", "overwrite")
+
+# Create another DataFrame in a new partition directory,
+# adding a new column and dropping an existing column
+saveDF(df2, "data/test_table/key=2", "parquet", "overwrite")
+
+# Read the partitioned table
+df3 <- loadDF(sqlContext, "data/test_table", "parquet")
+printSchema(df3)
+
+# The final schema consists of all 3 columns in the Parquet files together
+# with the partiioning column appeared in the partition directory paths.
+# root
+# |-- single: int (nullable = true)
+# |-- double: int (nullable = true)
+# |-- triple: int (nullable = true)
+# |-- key : int (nullable = true)
+{% endhighlight %}
+
+</div>
+
</div>
### Configuration
@@ -1238,6 +1392,40 @@ anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
{% endhighlight %}
</div>
+<div data-lang="r" markdown="1">
+Spark SQL can automatically infer the schema of a JSON dataset and load it as a DataFrame.
+This conversion can be done using one of two methods in a `SQLContext`:
+
+* `jsonFile` - loads data from a directory of JSON files where each line of the files is a JSON object.
+
+Note that the file that is offered as _jsonFile_ is not a typical JSON file. Each
+line must contain a separate, self-contained valid JSON object. As a consequence,
+a regular multi-line JSON file will most often fail.
+
+{% highlight r %}
+# sc is an existing SparkContext.
+sqlContext <- sparkRSQL.init(sc)
+
+# A JSON dataset is pointed to by path.
+# The path can be either a single text file or a directory storing text files.
+path <- "examples/src/main/resources/people.json"
+# Create a DataFrame from the file(s) pointed to by path
+people <- jsonFile(sqlContex,t path)
+
+# The inferred schema can be visualized using the printSchema() method.
+printSchema(people)
+# root
+# |-- age: integer (nullable = true)
+# |-- name: string (nullable = true)
+
+# Register this DataFrame as a table.
+registerTempTable(people, "people")
+
+# SQL statements can be run by using the sql methods provided by `sqlContext`.
+teenagers <- sql(sqlContext, "SELECT name FROM people WHERE age >= 13 AND age <= 19")
+{% endhighlight %}
+</div>
+
<div data-lang="sql" markdown="1">
{% highlight sql %}
@@ -1314,10 +1502,7 @@ Row[] results = sqlContext.sql("FROM src SELECT key, value").collect();
<div data-lang="python" markdown="1">
When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
-adds support for finding tables in the MetaStore and writing queries using HiveQL. In addition to
-the `sql` method a `HiveContext` also provides an `hql` methods, which allows queries to be
-expressed in HiveQL.
-
+adds support for finding tables in the MetaStore and writing queries using HiveQL.
{% highlight python %}
# sc is an existing SparkContext.
from pyspark.sql import HiveContext
@@ -1332,6 +1517,24 @@ results = sqlContext.sql("FROM src SELECT key, value").collect()
{% endhighlight %}
</div>
+
+<div data-lang="r" markdown="1">
+
+When working with Hive one must construct a `HiveContext`, which inherits from `SQLContext`, and
+adds support for finding tables in the MetaStore and writing queries using HiveQL.
+{% highlight r %}
+# sc is an existing SparkContext.
+sqlContext <- sparkRHive.init(sc)
+
+hql(sqlContext, "CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
+hql(sqlContext, "LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE src")
+
+# Queries can be expressed in HiveQL.
+results = sqlContext.sql("FROM src SELECT key, value").collect()
+
+{% endhighlight %}
+
+</div>
</div>
## JDBC To Other Databases
@@ -1430,6 +1633,16 @@ df = sqlContext.load(source="jdbc", url="jdbc:postgresql:dbserver", dbtable="sch
</div>
+<div data-lang="r" markdown="1">
+
+{% highlight r %}
+
+df <- loadDF(sqlContext, source="jdbc", url="jdbc:postgresql:dbserver", dbtable="schema.tablename")
+
+{% endhighlight %}
+
+</div>
+
<div data-lang="sql" markdown="1">
{% highlight sql %}
@@ -2354,5 +2567,151 @@ from pyspark.sql.types import *
</div>
+<div data-lang="r" markdown="1">
+
+<table class="table">
+<tr>
+ <th style="width:20%">Data type</th>
+ <th style="width:40%">Value type in R</th>
+ <th>API to access or create a data type</th></tr>
+<tr>
+ <td> <b>ByteType</b> </td>
+ <td>
+ integer <br />
+ <b>Note:</b> Numbers will be converted to 1-byte signed integer numbers at runtime.
+ Please make sure that numbers are within the range of -128 to 127.
+ </td>
+ <td>
+ "byte"
+ </td>
+</tr>
+<tr>
+ <td> <b>ShortType</b> </td>
+ <td>
+ integer <br />
+ <b>Note:</b> Numbers will be converted to 2-byte signed integer numbers at runtime.
+ Please make sure that numbers are within the range of -32768 to 32767.
+ </td>
+ <td>
+ "short"
+ </td>
+</tr>
+<tr>
+ <td> <b>IntegerType</b> </td>
+ <td> integer </td>
+ <td>
+ "integer"
+ </td>
+</tr>
+<tr>
+ <td> <b>LongType</b> </td>
+ <td>
+ integer <br />
+ <b>Note:</b> Numbers will be converted to 8-byte signed integer numbers at runtime.
+ Please make sure that numbers are within the range of
+ -9223372036854775808 to 9223372036854775807.
+ Otherwise, please convert data to decimal.Decimal and use DecimalType.
+ </td>
+ <td>
+ "long"
+ </td>
+</tr>
+<tr>
+ <td> <b>FloatType</b> </td>
+ <td>
+ numeric <br />
+ <b>Note:</b> Numbers will be converted to 4-byte single-precision floating
+ point numbers at runtime.
+ </td>
+ <td>
+ "float"
+ </td>
+</tr>
+<tr>
+ <td> <b>DoubleType</b> </td>
+ <td> numeric </td>
+ <td>
+ "double"
+ </td>
+</tr>
+<tr>
+ <td> <b>DecimalType</b> </td>
+ <td> Not supported </td>
+ <td>
+ Not supported
+ </td>
+</tr>
+<tr>
+ <td> <b>StringType</b> </td>
+ <td> character </td>
+ <td>
+ "string"
+ </td>
+</tr>
+<tr>
+ <td> <b>BinaryType</b> </td>
+ <td> raw </td>
+ <td>
+ "binary"
+ </td>
+</tr>
+<tr>
+ <td> <b>BooleanType</b> </td>
+ <td> logical </td>
+ <td>
+ "bool"
+ </td>
+</tr>
+<tr>
+ <td> <b>TimestampType</b> </td>
+ <td> POSIXct </td>
+ <td>
+ "timestamp"
+ </td>
+</tr>
+<tr>
+ <td> <b>DateType</b> </td>
+ <td> Date </td>
+ <td>
+ "date"
+ </td>
+</tr>
+<tr>
+ <td> <b>ArrayType</b> </td>
+ <td> vector or list </td>
+ <td>
+ list(type="array", elementType=<i>elementType</i>, containsNull=[<i>containsNull</i>])<br />
+ <b>Note:</b> The default value of <i>containsNull</i> is <i>True</i>.
+ </td>
+</tr>
+<tr>
+ <td> <b>MapType</b> </td>
+ <td> enviroment </td>
+ <td>
+ list(type="map", keyType=<i>keyType</i>, valueType=<i>valueType</i>, valueContainsNull=[<i>valueContainsNull</i>])<br />
+ <b>Note:</b> The default value of <i>valueContainsNull</i> is <i>True</i>.
+ </td>
+</tr>
+<tr>
+ <td> <b>StructType</b> </td>
+ <td> named list</td>
+ <td>
+ list(type="struct", fields=<i>fields</i>)<br />
+ <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
+ name are not allowed.
+ </td>
+</tr>
+<tr>
+ <td> <b>StructField</b> </td>
+ <td> The value type in R of the data type of this field
+ (For example, integer for a StructField with the data type IntegerType) </td>
+ <td>
+ list(name=<i>name</i>, type=<i>dataType</i>, nullable=<i>nullable</i>)
+ </td>
+</tr>
+</table>
+
+</div>
+
</div>