spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-16444][SPARKR] Isotonic Regression wrapper in SparkR	wm624@hotmail.com	2016-08-17	1	-0/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Add Isotonic Regression wrapper in SparkR Wrappers in R and Scala are added. Unit tests Documentation ## How was this patch tested? Manually tested with sudo ./R/run-tests.sh (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14182 from wangmiao1981/isoR.
*	[SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R ↵	Felix Cheung	2016-08-16	12	-182/+182
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CMD check ## What changes were proposed in this pull request? Rename RDD functions for now to avoid CRAN check warnings. Some RDD functions are sharing generics with DataFrame functions (hence the problem) so after the renames we need to add new generics, for now. ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14626 from felixcheung/rrddfunctions.
*	[SPARK-16579][SPARKR] add install.spark function	Junyang Qian	2016-08-10	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add an install_spark function to the SparkR package. User can run `install_spark()` to install Spark to a local directory within R. Updates: Several changes have been made: - `install.spark()` - check existence of tar file in the cache folder, and download only if not found - trial priority of mirror_url look-up: user-provided -> preferred mirror site from apache website -> hardcoded backup option - use 2.0.0 - `sparkR.session()` - can install spark when not found in `SPARK_HOME` ## How was this patch tested? Manual tests, running the check-cran.sh script added in #14173. Author: Junyang Qian <junyangq@databricks.com> Closes #14258 from junyangq/SPARK-16579.
*	[SPARK-16710][SPARKR][ML] spark.glm should support weightCol	Yanbo Liang	2016-08-10	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14346 from yanboliang/spark-16710.
*	[SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to ↵	Shivaram Venkataraman	2016-07-19	3	-41/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SparkSubmitSuite ## What changes were proposed in this pull request? This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar. ## How was this patch tested? SparkR unit tests, SparkSubmitSuite, check-cran.sh Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14243 from shivaram/sparkr-jar-move.
*	[SPARK-16027][SPARKR] Fix R tests SparkSession init/stop	Felix Cheung	2016-07-17	14	-25/+48
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix R SparkSession init/stop, and warnings of reusing existing Spark Context ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14177 from felixcheung/rsessiontest.
*	[SPARK-16538][SPARKR] Add more tests for namespace call to SparkSession ↵	Felix Cheung	2016-07-15	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	functions ## What changes were proposed in this pull request? More tests I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0. ## How was this patch tested? unit tests shivaram dongjoon-hyun Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14206 from felixcheung/rroutetests.
*	[SPARK-16538][SPARKR] fix R call with namespace operator on SparkSession ↵	Felix Cheung	2016-07-14	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	functions ## What changes were proposed in this pull request? Fix function routing to work with and without namespace operator `SparkR::createDataFrame` ## How was this patch tested? manual, unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14195 from felixcheung/rroutedefault.
*	[SPARK-16509][SPARKR] Rename window.partitionBy and window.orderBy to ↵	Sun Rui	2016-07-14	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	windowPartitionBy and windowOrderBy. ## What changes were proposed in this pull request? Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check. ## How was this patch tested? SparkR unit tests. Author: Sun Rui <sunrui2016@gmail.com> Closes #14192 from sun-rui/SPARK-16509.
*	[SPARKR][DOCS][MINOR] R programming guide to include csv data source example	Felix Cheung	2016-07-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14178 from felixcheung/rcsvprogrammingguide.
*	[SPARKR][MINOR] R examples and test updates	Felix Cheung	2016-07-13	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Minor example updates ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14171 from felixcheung/rexample.
*	[SPARK-16429][SQL] Include `StringType` columns in `describe()`	Dongjoon Hyun	2016-07-08	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument. Background ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show() +-------+------------------+-------+ \|summary\| age\| name\| +-------+------------------+-------+ \| count\| 2\| 3\| \| mean\| 24.5\| null\| \| stddev\|7.7781745930520225\| null\| \| min\| 19\| Andy\| \| max\| 30\|Michael\| +-------+------------------+-------+ ``` Before ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe().show() +-------+------------------+ \|summary\| age\| +-------+------------------+ \| count\| 2\| \| mean\| 24.5\| \| stddev\|7.7781745930520225\| \| min\| 19\| \| max\| 30\| +-------+------------------+ ``` After ```scala scala> spark.read.json("examples/src/main/resources/people.json").describe().show() +-------+------------------+-------+ \|summary\| age\| name\| +-------+------------------+-------+ \| count\| 2\| 3\| \| mean\| 24.5\| null\| \| stddev\|7.7781745930520225\| null\| \| min\| 19\| Andy\| \| max\| 30\|Michael\| +-------+------------------+-------+ ``` ## How was this patch tested? Pass the Jenkins with a update testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14095 from dongjoon-hyun/SPARK-16429.
*	[SPARK-16425][R] `describe()` should not fail with non-numeric columns	Dongjoon Hyun	2016-07-07	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`. Before ```r > df <- createDataFrame(faithful) > df <- withColumn(df, "boolean", df$waiting==79) > summary(df) 16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType; ``` After ```r > df <- createDataFrame(faithful) > df <- withColumn(df, "boolean", df$waiting==79) > summary(df) SparkDataFrame[summary:string, eruptions:string, waiting:string] ``` ## How was this patch tested? Pass the Jenkins with a updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14096 from dongjoon-hyun/SPARK-16425.
*	[SPARK-16310][SPARKR] R na.string-like default for csv source	Felix Cheung	2016-07-07	1	-6/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Apply default "NA" as null string for R, like R read.csv na.string parameter. https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html na.strings = "NA" An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv") (couldn't open JIRA, will do that later) ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13984 from felixcheung/rcsvnastring.
*	[SPARK-16233][R][TEST] ORC test should be enabled only when HiveContext is ↵	Dongjoon Hyun	2016-07-01	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	available. ## What changes were proposed in this pull request? ORC test should be enabled only when HiveContext is available. ## How was this patch tested? Manual. ``` $ R/run-tests.sh ... 1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped 2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped 3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped 4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped 5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped DONE =========================================================================== Tests passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14019 from dongjoon-hyun/SPARK-16233.
*	[SPARK-16299][SPARKR] Capture errors from R workers in daemon.R to avoid ↵	Sun Rui	2016-07-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	deletion of R session temporary directory. ## What changes were proposed in this pull request? Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory. See detailed description at https://issues.apache.org/jira/browse/SPARK-16299 ## How was this patch tested? SparkR unit tests. Author: Sun Rui <sunrui2016@gmail.com> Closes #13975 from sun-rui/SPARK-16299.
*	[SPARK-16012][SPARKR] Implement gapplyCollect which will apply a R function ↵	Narine Kokhlikyan	2016-07-01	1	-5/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	on each group similar to gapply and collect the result back to R data.frame ## What changes were proposed in this pull request? gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided. This is similar to dapplyCollect(). ## How was this patch tested? Added test cases for gapplyCollect similar to dapplyCollect Author: Narine Kokhlikyan <narine@slice.com> Closes #13760 from NarineK/gapplyCollect.
*	[SPARK-16289][SQL] Implement posexplode table generating function	Dongjoon Hyun	2016-06-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive. Before ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7 ``` After ```scala scala> sql("select posexplode(map('a', 1, 'b', 2))").show +---+---+-----+ \|pos\|key\|value\| +---+---+-----+ \| 0\| a\| 1\| \| 1\| b\| 2\| +---+---+-----+ ``` For `array` argument, `after` is the same with `before`. ``` scala> sql("select posexplode(array(1, 2, 3))").show +---+---+ \|pos\|col\| +---+---+ \| 0\| 1\| \| 1\| 2\| \| 2\| 3\| +---+---+ ``` ## How was this patch tested? Pass the Jenkins tests with newly added testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13971 from dongjoon-hyun/SPARK-16289.
*	[SPARKR] add csv tests	Felix Cheung	2016-06-28	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add unit tests for csv data for SPARKR ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13904 from felixcheung/rcsv.
*	[SPARK-16128][SQL] Allow setting length of characters to be truncated to, in ↵	Prashant Sharma	2016-06-28	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Dataset.show function. ## What changes were proposed in this pull request? Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise. ## How was this patch tested? Existing tests. + 1 new test in DataFrameSuite. For SparkR and pyspark, existing tests and manual testing. Author: Prashant Sharma <prashsh1@in.ibm.com> Author: Prashant Sharma <prashant@apache.org> Closes #13839 from ScrapCodes/add_truncateTo_DF.show.
*	[SPARK-16184][SPARKR] conf API for SparkSession	Felix Cheung	2016-06-26	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add `conf` method to get Runtime Config from SparkSession ## How was this patch tested? unit tests, manual tests This is how it works in sparkR shell: ``` SparkSession available as 'spark'. > conf() $hive.metastore.warehouse.dir [1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse" $spark.app.id [1] "local-1466749575523" $spark.app.name [1] "SparkR" $spark.driver.host [1] "10.0.2.1" $spark.driver.port [1] "45629" $spark.executorEnv.LD_LIBRARY_PATH [1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server" $spark.executor.id [1] "driver" $spark.home [1] "/opt/spark-2.0.0-bin-hadoop2.6" $spark.master [1] "local[]" $spark.sql.catalogImplementation [1] "hive" $spark.submit.deployMode [1] "client" > conf("spark.master") $spark.master [1] "local[]" ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13885 from felixcheung/rconf.
*	[SPARK-16088][SPARKR] update setJobGroup, cancelJobGroup, clearJobGroup	Felix Cheung	2016-06-23	1	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter. Also updated roxygen2 doc and R programming guide on deprecations. ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13838 from felixcheung/rjobgroup.
*	[SPARK-16096][SPARKR] add union and deprecate unionAll	Felix Cheung	2016-06-21	2	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different) `explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet. ## How was this patch tested? unit tests, manual checks for r doc Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13805 from felixcheung/runion.
*	[SPARK-15177][.1][R] make SparkR model params and default values consistent ↵	Xiangrui Meng	2016-06-21	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with MLlib ## What changes were proposed in this pull request? This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation. Main changes: * `spark.glm`: epsilon -> tol, maxit -> maxIter * `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means\|\|" * `spark.naiveBayes`: laplace -> smoothing, default 1.0 ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #13801 from mengxr/SPARK-15177.1.
*	[SPARK-15294][R] Add `pivot` to SparkR	Dongjoon Hyun	2016-06-20	1	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did. ## How was this patch tested? Pass the Jenkins tests (including new testcase.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13786 from dongjoon-hyun/SPARK-15294.
*	[SPARK-16053][R] Add `spark_partition_id` in SparkR	Dongjoon Hyun	2016-06-20	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds `spark_partition_id` virtual column function in SparkR for API parity. The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`. ```r > collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id()))) id SPARK_PARTITION_ID() 1 3 0 2 4 0 3 8 1 4 9 1 5 0 2 6 1 3 7 2 4 8 5 5 9 6 6 10 7 7 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13768 from dongjoon-hyun/SPARK-16053.
*	[SPARK-16028][SPARKR] spark.lapply can work with active context	Felix Cheung	2016-06-20	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? spark.lapply and setLogLevel ## How was this patch tested? unit test shivaram thunterdb Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13752 from felixcheung/rlapply.
*	[SPARK-16051][R] Add `read.orc/write.orc` to SparkR	Dongjoon Hyun	2016-06-20	1	-0/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This issue adds `read.orc/write.orc` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13763 from dongjoon-hyun/SPARK-16051.
*	[SPARK-16029][SPARKR] SparkR add dropTempView and deprecate dropTempTable	Felix Cheung	2016-06-20	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add dropTempView and deprecate dropTempTable ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13753 from felixcheung/rdroptempview.
*	[SPARK-16059][R] Add `monotonically_increasing_id` function in SparkR	Dongjoon Hyun	2016-06-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds `monotonically_increasing_id` column function in SparkR for API parity. After this PR, SparkR supports the followings. ```r > df <- read.json("examples/src/main/resources/people.json") > collect(select(df, monotonically_increasing_id(), df$name, df$age)) monotonically_increasing_id() name age 1 0 Michael NA 2 1 Andy 30 3 2 Justin 19 ``` ## How was this patch tested? Pass the Jenkins tests (with added testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13774 from dongjoon-hyun/SPARK-16059.
*	[SPARK-15159][SPARKR] SparkR SparkSession API	Felix Cheung	2016-06-17	17	-67/+144
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR introduces the new SparkSession API for SparkR. `sparkR.session.getOrCreate()` and `sparkR.session.stop()` "getOrCreate" is a bit unusual in R but it's important to name this clearly. SparkR implementation should - SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR) - SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work) - Changes to SparkSession is mostly transparent to users due to SPARK-10903 - Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning - Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily - An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))` - Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession - Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView` - Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames` - `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python) - All tests are updated to use the SparkSession entrypoint - A bug in `read.jdbc` is fixed TODO - [x] Add more tests - [ ] Separate PR - update all roxygen2 doc coding example - [ ] Separate PR - update SparkR programming guide ## How was this patch tested? unit tests, manual tests shivaram sun-rui rxin Author: Felix Cheung <felixcheung_m@hotmail.com> Author: felixcheung <felixcheung_m@hotmail.com> Closes #13635 from felixcheung/rsparksession.
*	[SPARK-16005][R] Add `randomSplit` to SparkR	Dongjoon Hyun	2016-06-17	1	-0/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds `randomSplit` to SparkR for API parity. ## How was this patch tested? Pass the Jenkins tests (with new testcase.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13721 from dongjoon-hyun/SPARK-16005.
*	[SPARK-15925][SPARKR] R DataFrame add back registerTempTable, add tests	Felix Cheung	2016-06-17	1	-11/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add registerTempTable to DataFrame with Deprecate ## How was this patch tested? unit tests shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13722 from felixcheung/rregistertemptable.
*	[SPARK-15908][R] Add varargs-type dropDuplicates() function in SparkR	Dongjoon Hyun	2016-06-16	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds varargs-type `dropDuplicates` function to SparkR for API parity. Refer to https://issues.apache.org/jira/browse/SPARK-15807, too. ## How was this patch tested? Pass the Jenkins tests with new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13684 from dongjoon-hyun/SPARK-15908.
*	[SPARK-12922][SPARKR][WIP] Implement gapply() on DataFrame in SparkR	Narine Kokhlikyan	2016-06-15	2	-48/+155
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API. Please, let me know what do you think and if you have any ideas to improve it. Thank you! ## How was this patch tested? Unit tests. 1. Primitive test with different column types 2. Add a boolean column 3. Compute average by a group Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Author: NarineK <narine.kokhlikyan@us.ibm.com> Closes #12836 from NarineK/gapply2.
*	[SPARK-15637][SPARK-15931][SPARKR] Fix R masked functions checks	Felix Cheung	2016-06-15	1	-9/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Because of the fix in SPARK-15684, this exclusion is no longer necessary. ## How was this patch tested? unit tests shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13636 from felixcheung/rendswith.
*	[SPARK-15925][SQL][SPARKR] Replaces registerTempTable with ↵	Cheng Lian	2016-06-13	1	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	createOrReplaceTempView ## What changes were proposed in this pull request? This PR replaces `registerTempTable` with `createOrReplaceTempView` as a follow-up task of #12945. ## How was this patch tested? Existing SparkR tests. Author: Cheng Lian <lian@databricks.com> Closes #13644 from liancheng/spark-15925-temp-view-for-r.
*	[SPARK-15684][SPARKR] Not mask startsWith and endsWith in R	wm624@hotmail.com	2016-06-07	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In R 3.3.0, startsWith and endsWith are added. In this PR, I make the two work in SparkR. 1. Remove signature in generic.R 2. Add setMethod in column.R 3. Add unit tests ## How was this patch tested? Manually test it through SparkR shell for both column data and string data, which are added into the unit test file. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13476 from wangmiao1981/start.
*	[SPARK-15637][SPARKR] fix R tests on R 3.2.2	felixcheung	2016-05-28	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Change version check in R tests ## How was this patch tested? R tests shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #13369 from felixcheung/rversioncheck.
*	[SPARK-8603][SPARKR] Use shell() instead of system2() for SparkR on Windows	hyukjinkwon	2016-05-26	3	-7/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR corrects SparkR to use `shell()` instead of `system2()` on Windows. Using `system2(...)` on Windows does not process windows file separator `\`. `shell(tralsate = TRUE, ...)` can treat this problem. So, this was changed to be chosen according to OS. Existing tests were failed on Windows due to this problem. For example, those were failed. ``` 8. Failure: sparkJars tag in SparkContext (test_includeJAR.R#34) 9. Failure: sparkJars tag in SparkContext (test_includeJAR.R#36) ``` The cases above were due to using of `system2`. In addition, this PR also fixes some tests failed on Windows. ``` 5. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#128) 6. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#131) 7. Failure: sparkJars sparkPackages as comma-separated strings (test_context.R#134) ``` The cases above were due to a weird behaviour of `normalizePath()`. On Linux, if the path does not exist, it just prints out the input but it prints out including the current path on Windows. ```r # On Linus path <- normalizePath("aa") print(path) [1] "aa" # On Windows path <- normalizePath("aa") print(path) [1] "C:\\Users\\aa" ``` ## How was this patch tested? Jenkins tests and manually tested in a Window machine as below: Here is the [stdout](https://gist.github.com/HyukjinKwon/4bf35184f3a30f3bce987a58ec2bbbab) of testing. Closes #7025 Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Author: Prakash PC <prakash.chinnu@gmail.com> Closes #13165 from HyukjinKwon/pr/7025.
*	[SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a ↵	felixcheung	2016-05-26	3	-203/+221
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	singleton Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session. Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method). Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports. Author: felixcheung <felixcheung_m@hotmail.com> Closes #9192 from felixcheung/rsqlcontext.
*	[SPARK-15439][SPARKR] Failed to run unit test in SparkR	wm624@hotmail.com	2016-05-25	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) There are some failures when running SparkR unit tests. In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R The first one is due to different masked name. I added missed names in the expected arrays. The second one is because one PR removed the logic of a previous fix of missing subset method. The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully. test_that("pipeRDD() on RDDs", { actual <- collect(pipeRDD(rdd, "more")) When using run-test script, it complains no such directories as below: cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manually test it Author: wm624@hotmail.com <wm624@hotmail.com> Closes #13284 from wangmiao1981/R.
*	[SPARK-15397][SQL] fix string udf locate as hive	Daoyuan Wang	2016-05-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1, `locate("aa", "aaa", 1)` would yield 2 and `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0. ## How was this patch tested? tested with modified `StringExpressionsSuite` and `StringFunctionsSuite` Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #13186 from adrian-wang/locate.
*	[SPARK-15202][SPARKR] add dapplyCollect() method for DataFrame in SparkR.	Sun Rui	2016-05-12	1	-1/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? dapplyCollect() applies an R function on each partition of a SparkDataFrame and collects the result back to R as a data.frame. ``` dapplyCollect(df, function(ldf) {...}) ``` ## How was this patch tested? SparkR unit tests. Author: Sun Rui <sunrui2016@gmail.com> Closes #12989 from sun-rui/SPARK-15202.
*	[MINOR] [SPARKR] Update data-manipulation.R to use native csv reader	Yanbo Liang	2016-05-09	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? * Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR. * Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example. ## How was this patch tested? Offline test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13005 from yanboliang/r-df-examples.
*	[SPARK-12479][SPARKR] sparkR collect on GroupedData throws R error "missing ↵	Sun Rui	2016-05-08	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	value where TRUE/FALSE needed" ## What changes were proposed in this pull request? This PR is a workaround for NA handling in hash code computation. This PR is on behalf of paulomagalhaes whose PR is https://github.com/apache/spark/pull/10436 ## How was this patch tested? SparkR unit tests. Author: Sun Rui <sunrui2016@gmail.com> Author: ray <ray@rays-MacBook-Air.local> Closes #12976 from sun-rui/SPARK-12479.
*	[SPARK-11395][SPARKR] Support over and window specification in SparkR.	Sun Rui	2016-05-05	1	-0/+36
\| \| \| \| \| \| \| \| \| \| \| \|	This PR: 1. Implement WindowSpec S4 class. 2. Implement Window.partitionBy() and Window.orderBy() as utility functions to create WindowSpec objects. 3. Implement over() of Column class. Author: Sun Rui <rui.sun@intel.com> Author: Sun Rui <sunrui2016@gmail.com> Closes #10094 from sun-rui/SPARK-11395.
*	[SPARK-15110] [SPARKR] Implement repartitionByColumn for SparkR DataFrames	NarineK	2016-05-05	1	-0/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Implement repartitionByColumn on DataFrame. This will allow us to run R functions on each partition identified by column groups with dapply() method. ## How was this patch tested? Unit tests Author: NarineK <narine.kokhlikyan@us.ibm.com> Closes #12887 from NarineK/repartitionByColumns.
*	[SPARK-15091][SPARKR] Fix warnings and a failure in SparkR test cases with ↵	Sun Rui	2016-05-03	4	-10/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	testthat version 1.0.1 ## What changes were proposed in this pull request? Fix warnings and a failure in SparkR test cases with testthat version 1.0.1 ## How was this patch tested? SparkR unit test cases. Author: Sun Rui <sunrui2016@gmail.com> Closes #12867 from sun-rui/SPARK-15091.
*	[SPARK-15030][ML][SPARKR] Support formula in spark.kmeans in SparkR	Yanbo Liang	2016-04-30	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? * ```RFormula``` supports empty response variable like ```~ x + y```. * Support formula in ```spark.kmeans``` in SparkR. * Fix some outdated docs for SparkR. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12813 from yanboliang/spark-15030.