diff options
author | Dongjoon Hyun <dongjoon@apache.org> | 2016-04-24 22:10:27 -0700 |
---|---|---|
committer | Shivaram Venkataraman <shivaram@cs.berkeley.edu> | 2016-04-24 22:10:27 -0700 |
commit | 6ab4d9e0c76b69b4d6d5f39037a77bdfb042be19 (patch) | |
tree | 494b601ba783d7b025b805504bde8f3f92b7667b /examples | |
parent | 35319d326488b3bf9235dfcf9ac4533ce846f21f (diff) | |
download | spark-6ab4d9e0c76b69b4d6d5f39037a77bdfb042be19.tar.gz spark-6ab4d9e0c76b69b4d6d5f39037a77bdfb042be19.tar.bz2 spark-6ab4d9e0c76b69b4d6d5f39037a77bdfb042be19.zip |
[SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date
## What changes were proposed in this pull request?
This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules.
- Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later.
- Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency
- Fix datatypes in `sparkr.md`.
- Update a data result in `sparkr.md`.
- Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet
- Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet
- Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`.
- Other minor syntax fixes and a typo.
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #12649 from dongjoon-hyun/SPARK-14883.
Diffstat (limited to 'examples')
-rw-r--r-- | examples/src/main/r/data-manipulation.R | 22 | ||||
-rw-r--r-- | examples/src/main/r/dataframe.R | 2 |
2 files changed, 12 insertions, 12 deletions
diff --git a/examples/src/main/r/data-manipulation.R b/examples/src/main/r/data-manipulation.R index aa2336e300..594bf49d60 100644 --- a/examples/src/main/r/data-manipulation.R +++ b/examples/src/main/r/data-manipulation.R @@ -30,7 +30,7 @@ args <- commandArgs(trailing = TRUE) if (length(args) != 1) { print("Usage: data-manipulation.R <path-to-flights.csv") - print("The data can be downloaded from: http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv ") + print("The data can be downloaded from: http://s3-us-west-2.amazonaws.com/sparkr-data/flights.csv") q("no") } @@ -49,33 +49,33 @@ flights_df$date <- as.Date(flights_df$date) ## Filter flights whose destination is San Francisco and write to a local data frame SFO_df <- flights_df[flights_df$dest == "SFO", ] -# Convert the local data frame into a SparkR DataFrame +# Convert the local data frame into a SparkDataFrame SFO_DF <- createDataFrame(sqlContext, SFO_df) -# Directly create a SparkR DataFrame from the source data +# Directly create a SparkDataFrame from the source data flightsDF <- read.df(sqlContext, flightsCsvPath, source = "com.databricks.spark.csv", header = "true") -# Print the schema of this Spark DataFrame +# Print the schema of this SparkDataFrame printSchema(flightsDF) -# Cache the DataFrame +# Cache the SparkDataFrame cache(flightsDF) -# Print the first 6 rows of the DataFrame +# Print the first 6 rows of the SparkDataFrame showDF(flightsDF, numRows = 6) ## Or head(flightsDF) -# Show the column names in the DataFrame +# Show the column names in the SparkDataFrame columns(flightsDF) -# Show the number of rows in the DataFrame +# Show the number of rows in the SparkDataFrame count(flightsDF) # Select specific columns destDF <- select(flightsDF, "dest", "cancelled") # Using SQL to select columns of data -# First, register the flights DataFrame as a table +# First, register the flights SparkDataFrame as a table registerTempTable(flightsDF, "flightsTable") destDF <- sql(sqlContext, "SELECT dest, cancelled FROM flightsTable") @@ -95,11 +95,11 @@ if("magrittr" %in% rownames(installed.packages())) { library(magrittr) # Group the flights by date and then find the average daily delay - # Write the result into a DataFrame + # Write the result into a SparkDataFrame groupBy(flightsDF, flightsDF$date) %>% summarize(avg(flightsDF$dep_delay), avg(flightsDF$arr_delay)) -> dailyDelayDF - # Print the computed data frame + # Print the computed SparkDataFrame head(dailyDelayDF) } diff --git a/examples/src/main/r/dataframe.R b/examples/src/main/r/dataframe.R index 62f60e57ee..436bac6aaf 100644 --- a/examples/src/main/r/dataframe.R +++ b/examples/src/main/r/dataframe.R @@ -24,7 +24,7 @@ sqlContext <- sparkRSQL.init(sc) # Create a simple local data.frame localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18)) -# Convert local data frame to a SparkR DataFrame +# Convert local data frame to a SparkDataFrame df <- createDataFrame(sqlContext, localDF) # Print its schema |