| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SparkSubmitSuite
## What changes were proposed in this pull request?
This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar.
## How was this patch tested?
SparkR unit tests, SparkSubmitSuite, check-cran.sh
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes #14243 from shivaram/sparkr-jar-move.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-16055
sparkPackages - argument is passed and we detect that we are in the R script mode, we should print some warning like --packages flag should be used with with spark-submit
## How was this patch tested?
In my system locally
Author: krishnakalyan3 <krishnakalyan3@gmail.com>
Closes #14179 from krishnakalyan3/spark-pkg.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Fix R SparkSession init/stop, and warnings of reusing existing Spark Context
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #14177 from felixcheung/rsessiontest.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add a check-cran.sh script that runs `R CMD check` as CRAN. Also fixes a number of issues pointed out by the check. These include
- Updating `DESCRIPTION` to be appropriate
- Adding a .Rbuildignore to ignore lintr, src-native, html that are non-standard files / dirs
- Adding aliases to all S4 methods in DataFrame, Column, GroupedData etc. This is required as stated in https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Documenting-S4-classes-and-methods
- Other minor fixes
## How was this patch tested?
SparkR unit tests, running the above mentioned script
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes #14173 from shivaram/sparkr-cran-changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
functions
## What changes were proposed in this pull request?
More tests
I don't think this is critical for Spark 2.0.0 RC, maybe Spark 2.0.1 or 2.1.0.
## How was this patch tested?
unit tests
shivaram dongjoon-hyun
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #14206 from felixcheung/rroutetests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
functions
## What changes were proposed in this pull request?
Fix function routing to work with and without namespace operator `SparkR::createDataFrame`
## How was this patch tested?
manual, unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #14195 from felixcheung/rroutedefault.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
windowPartitionBy and windowOrderBy.
## What changes were proposed in this pull request?
Rename window.partitionBy and window.orderBy to windowPartitionBy and windowOrderBy to pass CRAN package check.
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes #14192 from sun-rui/SPARK-16509.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Minor documentation update for code example, code style, and missed reference to "sparkR.init"
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #14178 from felixcheung/rcsvprogrammingguide.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Minor example updates
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #14171 from felixcheung/rexample.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
From SPARK-16140/PR #13921 - the issue is we left write.ml doc empty:
![image](https://cloud.githubusercontent.com/assets/8969467/16481934/856dd0ea-3e62-11e6-9474-e4d57d1ca001.png)
Here's what I meant as the fix:
![image](https://cloud.githubusercontent.com/assets/8969467/16481943/911f02ec-3e62-11e6-9d68-17363a9f5628.png)
![image](https://cloud.githubusercontent.com/assets/8969467/16481950/9bc057aa-3e62-11e6-8127-54870701c4b1.png)
I didn't realize there was already a JIRA on this. mengxr yanboliang
## How was this patch tested?
check doc generated.
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13993 from felixcheung/rmllibdoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
* Update SparkR ML section to make them consistent with SparkR API docs.
* Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page.
## How was this patch tested?
Only docs update, manually check the generated docs.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #14011 from yanboliang/r-user-guide-update.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Currently, Spark `describe` supports `StringType`. However, `describe()` returns a dataset for only all numeric columns. This PR aims to include `StringType` columns in `describe()`, `describe` without argument.
**Background**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe("age", "name").show()
+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+
```
**Before**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+
|summary| age|
+-------+------------------+
| count| 2|
| mean| 24.5|
| stddev|7.7781745930520225|
| min| 19|
| max| 30|
+-------+------------------+
```
**After**
```scala
scala> spark.read.json("examples/src/main/resources/people.json").describe().show()
+-------+------------------+-------+
|summary| age| name|
+-------+------------------+-------+
| count| 2| 3|
| mean| 24.5| null|
| stddev|7.7781745930520225| null|
| min| 19| Andy|
| max| 30|Michael|
+-------+------------------+-------+
```
## How was this patch tested?
Pass the Jenkins with a update testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #14095 from dongjoon-hyun/SPARK-16429.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR prevents ERRORs when `summary(df)` is called for `SparkDataFrame` with not-numeric columns. This failure happens only in `SparkR`.
**Before**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
16/07/07 14:15:16 ERROR RBackendHandler: describe on 34 failed
Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
org.apache.spark.sql.AnalysisException: cannot resolve 'avg(`boolean`)' due to data type mismatch: function average requires numeric types, not BooleanType;
```
**After**
```r
> df <- createDataFrame(faithful)
> df <- withColumn(df, "boolean", df$waiting==79)
> summary(df)
SparkDataFrame[summary:string, eruptions:string, waiting:string]
```
## How was this patch tested?
Pass the Jenkins with a updated testcase.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #14096 from dongjoon-hyun/SPARK-16425.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Apply default "NA" as null string for R, like R read.csv na.string parameter.
https://stat.ethz.ch/R-manual/R-devel/library/utils/html/read.table.html
na.strings = "NA"
An user passing a csv file with NA value should get the same behavior with SparkR read.df(... source = "csv")
(couldn't open JIRA, will do that later)
## How was this patch tested?
unit tests
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13984 from felixcheung/rcsvnastring.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
available.
## What changes were proposed in this pull request?
ORC test should be enabled only when HiveContext is available.
## How was this patch tested?
Manual.
```
$ R/run-tests.sh
...
1. create DataFrame from RDD (test_sparkSQL.R#200) - Hive is not build with SparkSQL, skipped
2. test HiveContext (test_sparkSQL.R#1021) - Hive is not build with SparkSQL, skipped
3. read/write ORC files (test_sparkSQL.R#1728) - Hive is not build with SparkSQL, skipped
4. enableHiveSupport on SparkSession (test_sparkSQL.R#2448) - Hive is not build with SparkSQL, skipped
5. sparkJars tag in SparkContext (test_Windows.R#21) - This test is only for Windows, skipped
DONE ===========================================================================
Tests passed.
```
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #14019 from dongjoon-hyun/SPARK-16233.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
deletion of R session temporary directory.
## What changes were proposed in this pull request?
Capture errors from R workers in daemon.R to avoid deletion of R session temporary directory. See detailed description at https://issues.apache.org/jira/browse/SPARK-16299
## How was this patch tested?
SparkR unit tests.
Author: Sun Rui <sunrui2016@gmail.com>
Closes #13975 from sun-rui/SPARK-16299.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
on each group similar to gapply and collect the result back to R data.frame
## What changes were proposed in this pull request?
gapplyCollect() does gapply() on a SparkDataFrame and collect the result back to R. Compared to gapply() + collect(), gapplyCollect() offers performance optimization as well as programming convenience, as no schema is needed to be provided.
This is similar to dapplyCollect().
## How was this patch tested?
Added test cases for gapplyCollect similar to dapplyCollect
Author: Narine Kokhlikyan <narine@slice.com>
Closes #13760 from NarineK/gapplyCollect.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
**Before**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
```
**After**
```scala
scala> sql("select posexplode(map('a', 1, 'b', 2))").show
+---+---+-----+
|pos|key|value|
+---+---+-----+
| 0| a| 1|
| 1| b| 2|
+---+---+-----+
```
For `array` argument, `after` is the same with `before`.
```
scala> sql("select posexplode(array(1, 2, 3))").show
+---+---+
|pos|col|
+---+---+
| 0| 1|
| 1| 2|
| 2| 3|
+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with newly added testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13971 from dongjoon-hyun/SPARK-16289.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-16140
## What changes were proposed in this pull request?
Group the R doc of spark.kmeans, predict(KM), summary(KM), read/write.ml(KM) under Rd spark.kmeans. The example code was updated.
## How was this patch tested?
Tested on my local machine
And on my laptop `jekyll build` is failing to build API docs, so here I can only show you the html I manually generated from Rd files, with no CSS applied, but the doc content should be there.
![screenshotkmeans](https://cloud.githubusercontent.com/assets/3925641/16403203/c2c9ca1e-3ca7-11e6-9e29-f2164aee75fc.png)
Author: Xin Ren <iamshrek@126.com>
Closes #13921 from keypointt/SPARK-16140.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Fix wrong arguments description of ```survreg``` in SparkR.
## How was this patch tested?
```Arguments``` section of ```survreg``` doc before this PR (with wrong description for ```path``` and missing ```overwrite```):
![image](https://cloud.githubusercontent.com/assets/1962026/16447548/fe7a5ed4-3da1-11e6-8b96-b5bf2083b07e.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/16447617/368e0b18-3da2-11e6-8277-45640fb11859.png)
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #13970 from yanboliang/spark-16143-followup.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add unit tests for csv data for SPARKR
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13904 from felixcheung/rcsv.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
update sparkR DataFrame.R comment
SQLContext ==> SparkSession
## How was this patch tested?
N/A
Author: WeichenXu <WeichenXu123@outlook.com>
Closes #13946 from WeichenXu123/sparkR_comment_update_sparkSession.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Dataset.show function.
## What changes were proposed in this pull request?
Allowing truncate to a specific number of character is convenient at times, especially while operating from the REPL. Sometimes those last few characters make all the difference, and showing everything brings in whole lot of noise.
## How was this patch tested?
Existing tests. + 1 new test in DataFrameSuite.
For SparkR and pyspark, existing tests and manual testing.
Author: Prashant Sharma <prashsh1@in.ibm.com>
Author: Prashant Sharma <prashant@apache.org>
Closes #13839 from ScrapCodes/add_truncateTo_DF.show.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR groups `spark.survreg`, `summary(AFT)`, `predict(AFT)`, `write.ml(AFT)` for survival regression into a single Rd.
## How was this patch tested?
Manually checked generated HTML doc. See attached screenshots.
![screen shot 2016-06-27 at 10 28 20 am](https://cloud.githubusercontent.com/assets/15318264/16392008/a14cf472-3c5e-11e6-9ce5-490ed1a52249.png)
![screen shot 2016-06-27 at 10 28 35 am](https://cloud.githubusercontent.com/assets/15318264/16392009/a14e333c-3c5e-11e6-8bd7-c2e9ba71f8e2.png)
Author: Junyang Qian <junyangq@databricks.com>
Closes #13927 from junyangq/SPARK-16143.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add `conf` method to get Runtime Config from SparkSession
## How was this patch tested?
unit tests, manual tests
This is how it works in sparkR shell:
```
SparkSession available as 'spark'.
> conf()
$hive.metastore.warehouse.dir
[1] "file:/opt/spark-2.0.0-bin-hadoop2.6/R/spark-warehouse"
$spark.app.id
[1] "local-1466749575523"
$spark.app.name
[1] "SparkR"
$spark.driver.host
[1] "10.0.2.1"
$spark.driver.port
[1] "45629"
$spark.executorEnv.LD_LIBRARY_PATH
[1] "$LD_LIBRARY_PATH:/usr/lib/R/lib:/usr/lib/x86_64-linux-gnu:/usr/lib/jvm/default-java/jre/lib/amd64/server"
$spark.executor.id
[1] "driver"
$spark.home
[1] "/opt/spark-2.0.0-bin-hadoop2.6"
$spark.master
[1] "local[*]"
$spark.sql.catalogImplementation
[1] "hive"
$spark.submit.deployMode
[1] "client"
> conf("spark.master")
$spark.master
[1] "local[*]"
```
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13885 from felixcheung/rconf.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR groups `spark.naiveBayes`, `summary(NB)`, `predict(NB)`, and `write.ml(NB)` into a single Rd.
## How was this patch tested?
Manually checked generated HTML doc. See attached screenshots.
![screen shot 2016-06-23 at 2 11 00 pm](https://cloud.githubusercontent.com/assets/829644/16320452/a5885e92-394c-11e6-994f-2ab5cddad86f.png)
![screen shot 2016-06-23 at 2 11 15 pm](https://cloud.githubusercontent.com/assets/829644/16320455/aad1f6d8-394c-11e6-8ef4-13bee989f52f.png)
Author: Xiangrui Meng <meng@databricks.com>
Closes #13877 from mengxr/SPARK-16142.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter.
Also updated roxygen2 doc and R programming guide on deprecations.
## How was this patch tested?
unit tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13838 from felixcheung/rjobgroup.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Guide for
- UDFs with dapply, dapplyCollect
- spark.lapply for running parallel R functions
## How was this patch tested?
build locally
<img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png">
Author: Kai Jiang <jiangkai@gmail.com>
Closes #13660 from vectorijk/spark-15672-R-guide-update.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This groups GLM methods (spark.glm, summary, print, predict and write.ml) in the documentation. The example code was updated.
## How was this patch tested?
N/A
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
![screen shot 2016-06-21 at 2 31 37 pm](https://cloud.githubusercontent.com/assets/15318264/16247077/f6eafc04-37bc-11e6-89a8-7898ff3e4078.png)
![screen shot 2016-06-21 at 2 31 45 pm](https://cloud.githubusercontent.com/assets/15318264/16247078/f6eb1c16-37bc-11e6-940a-2b595b10617c.png)
Author: Junyang Qian <junyangq@databricks.com>
Author: Junyang Qian <junyangq@Junyangs-MacBook-Pro.local>
Closes #13820 from junyangq/SPARK-16107.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
add union and deprecate unionAll, separate roxygen2 doc for rbind (since their usage and parameter lists are quite different)
`explode` is also deprecated - but seems like replacement is a combination of calls; not sure if we should deprecate it in SparkR, yet.
## How was this patch tested?
unit tests, manual checks for r doc
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13805 from felixcheung/runion.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Found these issues while reviewing for SPARK-16090
## How was this patch tested?
roxygen2 doc gen, checked output html
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13803 from felixcheung/rdocrd.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
with MLlib
## What changes were proposed in this pull request?
This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation.
Main changes:
* `spark.glm`: epsilon -> tol, maxit -> maxIter
* `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means||"
* `spark.naiveBayes`: laplace -> smoothing, default 1.0
## How was this patch tested?
Existing unit tests.
Author: Xiangrui Meng <meng@databricks.com>
Closes #13801 from mengxr/SPARK-15177.1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DataFrame stats functions
## What changes were proposed in this pull request?
Doc only changes. Please see screenshots.
Before:
http://spark.apache.org/docs/latest/api/R/statfunctions.html
![image](https://cloud.githubusercontent.com/assets/8969467/15264110/cd458826-1924-11e6-85bd-8ee2e2e1a85f.png)
After
![image](https://cloud.githubusercontent.com/assets/8969467/16218452/b9e89f08-3732-11e6-969d-a3a1796e7ad0.png)
(please ignore the style differences - this is due to not having the css in my local copy)
This is still a bit weird. As discussed in SPARK-15237, I think the better approach is to separate out the DataFrame stats function instead of putting everything on one page. At least now it is clearer which description is on which function.
## How was this patch tested?
Build doc
Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #13109 from felixcheung/rstatdoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
I ran a full pass from A to Z and fixed the obvious duplications, improper grouping etc.
There are still more doc issues to be cleaned up.
## How was this patch tested?
manual tests
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13798 from felixcheung/rdocseealso.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR adds `pivot` function to SparkR for API parity. Since this PR is based on https://github.com/apache/spark/pull/13295 , mhnatiuk should be credited for the work he did.
## How was this patch tested?
Pass the Jenkins tests (including new testcase.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13786 from dongjoon-hyun/SPARK-15294.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Removed unnecessary duplicated documentation in dapply and dapplyCollect.
In this pull request I created separate R docs for dapply and dapplyCollect - kept dapply's documentation separate from dapplyCollect's and referred from one to another via a link.
## How was this patch tested?
Existing test cases.
Author: Narine Kokhlikyan <narine@slice.com>
Closes #13790 from NarineK/dapply-docs-fix.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR adds `since` tags to Roxygen documentation according to the previous documentation archive.
https://home.apache.org/~dongjoon/spark-2.0.0-docs/api/R/
## How was this patch tested?
Manual.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13734 from dongjoon-hyun/SPARK-14995.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
updates
## What changes were proposed in this pull request?
roxygen2 doc, programming guide, example updates
## How was this patch tested?
manual checks
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13751 from felixcheung/rsparksessiondoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR adds `spark_partition_id` virtual column function in SparkR for API parity.
The following is just an example to illustrate a SparkR usage on a partitioned parquet table created by `spark.range(10).write.mode("overwrite").parquet("/tmp/t1")`.
```r
> collect(select(read.parquet('/tmp/t1'), c('id', spark_partition_id())))
id SPARK_PARTITION_ID()
1 3 0
2 4 0
3 8 1
4 9 1
5 0 2
6 1 3
7 2 4
8 5 5
9 6 6
10 7 7
```
## How was this patch tested?
Pass the Jenkins tests (including new testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13768 from dongjoon-hyun/SPARK-16053.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
fix code doc
## How was this patch tested?
manual
shivaram
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13782 from felixcheung/rcountdoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
spark.lapply and setLogLevel
## How was this patch tested?
unit test
shivaram thunterdb
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13752 from felixcheung/rlapply.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This issue adds `read.orc/write.orc` to SparkR for API parity.
## How was this patch tested?
Pass the Jenkins tests (with new testcases).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13763 from dongjoon-hyun/SPARK-16051.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add dropTempView and deprecate dropTempTable
## How was this patch tested?
unit tests
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13753 from felixcheung/rdroptempview.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR adds `monotonically_increasing_id` column function in SparkR for API parity.
After this PR, SparkR supports the followings.
```r
> df <- read.json("examples/src/main/resources/people.json")
> collect(select(df, monotonically_increasing_id(), df$name, df$age))
monotonically_increasing_id() name age
1 0 Michael NA
2 1 Andy 30
3 2 Justin 19
```
## How was this patch tested?
Pass the Jenkins tests (with added testcase).
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13774 from dongjoon-hyun/SPARK-16059.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR introduces the new SparkSession API for SparkR.
`sparkR.session.getOrCreate()` and `sparkR.session.stop()`
"getOrCreate" is a bit unusual in R but it's important to name this clearly.
SparkR implementation should
- SparkSession is the main entrypoint (vs SparkContext; due to limited functionality supported with SparkContext in SparkR)
- SparkSession replaces SQLContext and HiveContext (both a wrapper around SparkSession, and because of API changes, supporting all 3 would be a lot more work)
- Changes to SparkSession is mostly transparent to users due to SPARK-10903
- Full backward compatibility is expected - users should be able to initialize everything just in Spark 1.6.1 (`sparkR.init()`), but with deprecation warning
- Mostly cosmetic changes to parameter list - users should be able to move to `sparkR.session.getOrCreate()` easily
- An advanced syntax with named parameters (aka varargs aka "...") is supported; that should be closer to the Builder syntax that is in Scala/Python (which unfortunately does not work in R because it will look like this: `enableHiveSupport(config(config(master(appName(builder(), "foo"), "local"), "first", "value"), "next, "value"))`
- Updating config on an existing SparkSession is supported, the behavior is the same as Python, in which config is applied to both SparkContext and SparkSession
- Some SparkSession changes are not matched in SparkR, mostly because it would be breaking API change: `catalog` object, `createOrReplaceTempView`
- Other SQLContext workarounds are replicated in SparkR, eg. `tables`, `tableNames`
- `sparkR` shell is updated to use the SparkSession entrypoint (`sqlContext` is removed, just like with Scale/Python)
- All tests are updated to use the SparkSession entrypoint
- A bug in `read.jdbc` is fixed
TODO
- [x] Add more tests
- [ ] Separate PR - update all roxygen2 doc coding example
- [ ] Separate PR - update SparkR programming guide
## How was this patch tested?
unit tests, manual tests
shivaram sun-rui rxin
Author: Felix Cheung <felixcheung_m@hotmail.com>
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #13635 from felixcheung/rsparksession.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR adds `randomSplit` to SparkR for API parity.
## How was this patch tested?
Pass the Jenkins tests (with new testcase.)
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13721 from dongjoon-hyun/SPARK-16005.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add registerTempTable to DataFrame with Deprecate
## How was this patch tested?
unit tests
shivaram liancheng
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #13722 from felixcheung/rregistertemptable.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR adds varargs-type `dropDuplicates` function to SparkR for API parity.
Refer to https://issues.apache.org/jira/browse/SPARK-15807, too.
## How was this patch tested?
Pass the Jenkins tests with new testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13684 from dongjoon-hyun/SPARK-15908.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
changes
## What changes were proposed in this pull request?
R Docs changes
include typos, format, layout.
## How was this patch tested?
Test locally.
Author: Kai Jiang <jiangkai@gmail.com>
Closes #13394 from vectorijk/spark-15490.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
gapply() applies an R function on groups grouped by one or more columns of a DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() in the Dataset API.
Please, let me know what do you think and if you have any ideas to improve it.
Thank you!
## How was this patch tested?
Unit tests.
1. Primitive test with different column types
2. Add a boolean column
3. Compute average by a group
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Author: NarineK <narine.kokhlikyan@us.ibm.com>
Closes #12836 from NarineK/gapply2.
|