aboutsummaryrefslogtreecommitdiff
path: root/common
diff options
context:
space:
mode:
authorsethah <seth.hendrickson16@gmail.com>2016-06-17 09:58:49 -0700
committerXiangrui Meng <meng@databricks.com>2016-06-17 09:58:49 -0700
commit1f0a46958ef51a01560ada23665dccde89696e12 (patch)
tree78b28323c95b745de4a697056b90e19e5069875c /common
parent34d6c4cd113729fcc1d0bc1df8916d06b8854922 (diff)
downloadspark-1f0a46958ef51a01560ada23665dccde89696e12.tar.gz
spark-1f0a46958ef51a01560ada23665dccde89696e12.tar.bz2
spark-1f0a46958ef51a01560ada23665dccde89696e12.zip
[SPARK-16008][ML] Remove unnecessary serialization in logistic regression
JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008) ## What changes were proposed in this pull request? `LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller). This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization. ## How was this patch tested? I tested this locally and verified the serialization reduction. ![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png) Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup. Author: sethah <seth.hendrickson16@gmail.com> Closes #13729 from sethah/lr_improvement.
Diffstat (limited to 'common')
0 files changed, 0 insertions, 0 deletions