[SPARK-4531] [MLlib] cache serialized java object

The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step. This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster. Author: Davies Liu <davies@databricks.com> Closes #3397 from davies/cache and squashes the following commits: 7f6e6ce [Davies Liu] Update -> Updater 4b52edd [Davies Liu] using named argument 63b984e [Davies Liu] fix 7da0332 [Davies Liu] add unpersist() dff33e1 [Davies Liu] address comments c2bdfc2 [Davies Liu] refactor d572f00 [Davies Liu] Merge branch 'master' into cache f1063e1 [Davies Liu] cache serialized java object
author: Davies Liu <davies@databricks.com> 2014-11-21 15:02:31 -0800
committer: Xiangrui Meng <meng@databricks.com> 2014-11-21 15:02:31 -0800
commit: ce95bd8e130b2c7688b94be40683bdd90d86012d (patch)
tree: 396d4e26517f3fc6e84a904ca6466ffb5da2f222 /python/pyspark/mllib/recommendation.py
parent: a81918c5a66fc6040f9796fc1a9d4e0bfb8d0cbe (diff)
download: spark-ce95bd8e130b2c7688b94be40683bdd90d86012d.tar.gz
spark-ce95bd8e130b2c7688b94be40683bdd90d86012d.tar.bz2
spark-ce95bd8e130b2c7688b94be40683bdd90d86012d.zip
1 files changed, 2 insertions, 2 deletions
diff --git a/python/pyspark/mllib/recommendation.py b/python/pyspark/mllib/recommendation.py
index 2bcbf2aaf8..97ec74eda0 100644
--- a/python/pyspark/mllib/recommendation.py
+++ b/python/pyspark/mllib/recommendation.py
@@ -19,7 +19,7 @@ from collections import namedtuple
 
 from pyspark import SparkContext
 from pyspark.rdd import RDD
-from pyspark.mllib.common import JavaModelWrapper, callMLlibFunc, _to_java_object_rdd
+from pyspark.mllib.common import JavaModelWrapper, callMLlibFunc
 
 __all__ = ['MatrixFactorizationModel', 'ALS', 'Rating']
 
@@ -110,7 +110,7 @@ class ALS(object):
                 ratings = ratings.map(lambda x: Rating(*x))
             else:
                 raise ValueError("rating should be RDD of Rating or tuple/list")
-        return _to_java_object_rdd(ratings, True)
+        return ratings
 
     @classmethod
     def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False,
author	Davies Liu <davies@databricks.com>	2014-11-21 15:02:31 -0800
committer	Xiangrui Meng <meng@databricks.com>	2014-11-21 15:02:31 -0800
commit	ce95bd8e130b2c7688b94be40683bdd90d86012d (patch)
tree	396d4e26517f3fc6e84a904ca6466ffb5da2f222 /python/pyspark/mllib/recommendation.py
parent	a81918c5a66fc6040f9796fc1a9d4e0bfb8d0cbe (diff)
download	spark-ce95bd8e130b2c7688b94be40683bdd90d86012d.tar.gz spark-ce95bd8e130b2c7688b94be40683bdd90d86012d.tar.bz2 spark-ce95bd8e130b2c7688b94be40683bdd90d86012d.zip