diff options
author | Hossein <hossein@databricks.com> | 2016-10-12 10:32:38 -0700 |
---|---|---|
committer | Felix Cheung <felixcheung@apache.org> | 2016-10-12 10:32:38 -0700 |
commit | 5cc503f4fe9737a4c7947a80eecac053780606df (patch) | |
tree | 02cfea5ff7007d7375b17786880d55a6867eedb7 /.github | |
parent | d5580ebaa086b9feb72d5428f24c5b60cd7da745 (diff) | |
download | spark-5cc503f4fe9737a4c7947a80eecac053780606df.tar.gz spark-5cc503f4fe9737a4c7947a80eecac053780606df.tar.bz2 spark-5cc503f4fe9737a4c7947a80eecac053780606df.zip |
[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
## What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD.
I tested this on my MacBook. Following code works with this patch:
```R
intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)
```
## How was this patch tested?
* [x] Unit tests
Author: Hossein <hossein@databricks.com>
Closes #15375 from falaki/SPARK-17790.
Diffstat (limited to '.github')
0 files changed, 0 insertions, 0 deletions