aboutsummaryrefslogtreecommitdiff
path: root/.github
diff options
context:
space:
mode:
authorHossein <hossein@databricks.com>2016-10-12 10:32:38 -0700
committerFelix Cheung <felixcheung@apache.org>2016-10-12 10:32:38 -0700
commit5cc503f4fe9737a4c7947a80eecac053780606df (patch)
tree02cfea5ff7007d7375b17786880d55a6867eedb7 /.github
parentd5580ebaa086b9feb72d5428f24c5b60cd7da745 (diff)
downloadspark-5cc503f4fe9737a4c7947a80eecac053780606df.tar.gz
spark-5cc503f4fe9737a4c7947a80eecac053780606df.tar.bz2
spark-5cc503f4fe9737a4c7947a80eecac053780606df.zip
[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790.
Diffstat (limited to '.github')
0 files changed, 0 insertions, 0 deletions