[SPARK-16008][ML] Remove unnecessary serialization in logistic regression - spark

diff options

author	sethah <seth.hendrickson16@gmail.com>	2016-06-17 09:58:49 -0700
committer	Xiangrui Meng <meng@databricks.com>	2016-06-17 09:58:49 -0700
commit	1f0a46958ef51a01560ada23665dccde89696e12 (patch)
tree	78b28323c95b745de4a697056b90e19e5069875c /common
parent	34d6c4cd113729fcc1d0bc1df8916d06b8854922 (diff)
download	spark-1f0a46958ef51a01560ada23665dccde89696e12.tar.gz spark-1f0a46958ef51a01560ada23665dccde89696e12.tar.bz2 spark-1f0a46958ef51a01560ada23665dccde89696e12.zip

[SPARK-16008][ML] Remove unnecessary serialization in logistic regression

JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008) ## What changes were proposed in this pull request? `LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller). This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization. ## How was this patch tested? I tested this locally and verified the serialization reduction. ![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png) Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup. Author: sethah <seth.hendrickson16@gmail.com> Closes #13729 from sethah/lr_improvement.

Diffstat (limited to 'common')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: