[SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure - spark

diff options

author	Xiangrui Meng <meng@databricks.com>	2014-07-26 22:56:07 -0700
committer	Reynold Xin <rxin@apache.org>	2014-07-26 22:56:07 -0700
commit	aaf2b735fddbebccd28012006ee4647af3b3624f (patch)
tree	eb132ba2fa45cddaf7730628403e836afecb34e3 /python
parent	b547f69bdb5f4a6d5f471a2d998c2df6fb2a9347 (diff)
download	spark-aaf2b735fddbebccd28012006ee4647af3b3624f.tar.gz spark-aaf2b735fddbebccd28012006ee4647af3b3624f.tar.bz2 spark-aaf2b735fddbebccd28012006ee4647af3b3624f.zip

[SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure

We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small. Author: Xiangrui Meng <meng@databricks.com> Closes #1427 from mengxr/broadcast-new and squashes the following commits: b9a1228 [Xiangrui Meng] style update b97c184 [Xiangrui Meng] minimal change to LBFGS 9ebadcc [Xiangrui Meng] add task size test to RowMatrix 9427bf0 [Xiangrui Meng] add task size tests to linear methods e0a5cf2 [Xiangrui Meng] add task size test to GD 28a8411 [Xiangrui Meng] add test for NaiveBayes 380778c [Xiangrui Meng] update KMeans test bccab92 [Xiangrui Meng] add task size test to LBFGS 02103ba [Xiangrui Meng] remove print e73d68e [Xiangrui Meng] update tests for k-means 174cb15 [Xiangrui Meng] use local-cluster for test with a small akka.frameSize 1928a5a [Xiangrui Meng] add test for KMeans task size e00c2da [Xiangrui Meng] use broadcast in GD, KMeans 010d076 [Xiangrui Meng] modify NaiveBayesModel and GLM to use broadcast

Diffstat (limited to 'python')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: