[SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame - spark

diff options

author	zero323 <matthew.szymkiewicz@gmail.com>	2015-11-15 19:15:27 -0800
committer	Shivaram Venkataraman <shivaram@cs.berkeley.edu>	2015-11-15 19:15:27 -0800
commit	d7d9fa0b8750166f8b74f9bc321df26908683a8b (patch)
tree	cbd4e96432c4f54ae07b5417eb97a22db7875b9a /make-distribution.sh
parent	72c1d68b4ab6acb3f85971e10947caabb4bd846d (diff)
download	spark-d7d9fa0b8750166f8b74f9bc321df26908683a8b.tar.gz spark-d7d9fa0b8750166f8b74f9bc321df26908683a8b.tar.bz2 spark-d7d9fa0b8750166f8b74f9bc321df26908683a8b.zip

[SPARK-11086][SPARKR] Use dropFactors column-wise instead of nested loop when createDataFrame

Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame` At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns). A simple improvement is to apply `dropFactor `column-wise and then reshape output list. It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277). Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9099 from zero323/SPARK-11086.

Diffstat (limited to 'make-distribution.sh')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: