aboutsummaryrefslogtreecommitdiff
path: root/docs/sparkr.md
diff options
context:
space:
mode:
authorDongjoon Hyun <dongjoon@apache.org>2016-04-24 22:10:27 -0700
committerShivaram Venkataraman <shivaram@cs.berkeley.edu>2016-04-24 22:10:27 -0700
commit6ab4d9e0c76b69b4d6d5f39037a77bdfb042be19 (patch)
tree494b601ba783d7b025b805504bde8f3f92b7667b /docs/sparkr.md
parent35319d326488b3bf9235dfcf9ac4533ce846f21f (diff)
downloadspark-6ab4d9e0c76b69b4d6d5f39037a77bdfb042be19.tar.gz
spark-6ab4d9e0c76b69b4d6d5f39037a77bdfb042be19.tar.bz2
spark-6ab4d9e0c76b69b4d6d5f39037a77bdfb042be19.zip
[SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date
## What changes were proposed in this pull request? This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later. - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency - Fix datatypes in `sparkr.md`. - Update a data result in `sparkr.md`. - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and a typo. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12649 from dongjoon-hyun/SPARK-14883.
Diffstat (limited to 'docs/sparkr.md')
-rw-r--r--docs/sparkr.md11
1 files changed, 5 insertions, 6 deletions
diff --git a/docs/sparkr.md b/docs/sparkr.md
index a0b4f93776..760534ae14 100644
--- a/docs/sparkr.md
+++ b/docs/sparkr.md
@@ -141,7 +141,7 @@ head(people)
# SparkR automatically infers the schema from the JSON file
printSchema(people)
# root
-# |-- age: integer (nullable = true)
+# |-- age: long (nullable = true)
# |-- name: string (nullable = true)
{% endhighlight %}
@@ -195,7 +195,7 @@ df <- createDataFrame(sqlContext, faithful)
# Get basic information about the DataFrame
df
-## DataFrame[eruptions:double, waiting:double]
+## SparkDataFrame[eruptions:double, waiting:double]
# Select only the "eruptions" column
head(select(df, df$eruptions))
@@ -228,14 +228,13 @@ SparkR data frames support a number of commonly used functions to aggregate data
# We use the `n` operator to count the number of times each waiting time appears
head(summarize(groupBy(df, df$waiting), count = n(df$waiting)))
## waiting count
-##1 81 13
-##2 60 6
-##3 68 1
+##1 70 4
+##2 67 1
+##3 69 2
# We can also sort the output from the aggregation to get the most common waiting times
waiting_counts <- summarize(groupBy(df, df$waiting), count = n(df$waiting))
head(arrange(waiting_counts, desc(waiting_counts$count)))
-
## waiting count
##1 78 15
##2 83 14