[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB - spark

diff options

author	Hossein <hossein@databricks.com>	2016-10-12 10:32:38 -0700
committer	Felix Cheung <felixcheung@apache.org>	2016-10-12 10:32:38 -0700
commit	5cc503f4fe9737a4c7947a80eecac053780606df (patch)
tree	02cfea5ff7007d7375b17786880d55a6867eedb7 /.github
parent	d5580ebaa086b9feb72d5428f24c5b60cd7da745 (diff)
download	spark-5cc503f4fe9737a4c7947a80eecac053780606df.tar.gz spark-5cc503f4fe9737a4c7947a80eecac053780606df.tar.bz2 spark-5cc503f4fe9737a4c7947a80eecac053780606df.zip

[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB

## What changes were proposed in this pull request? If the R data structure that is being parallelized is larger than `INT_MAX` we use files to transfer data to JVM. The serialization protocol mimics Python pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create the RDD. I tested this on my MacBook. Following code works with this patch: ```R intMax <- .Machine$integer.max largeVec <- 1:intMax rdd <- SparkR:::parallelize(sc, largeVec, 2) ``` ## How was this patch tested? * [x] Unit tests Author: Hossein <hossein@databricks.com> Closes #15375 from falaki/SPARK-17790.

Diffstat (limited to '.github')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: