[SPARK-3094] [PySpark] compatitable with PyPy

After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example: ``` PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py ``` The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks: Job | CPython 2.7 | PyPy 2.3.1 | Speed up ------- | ------------ | ------------- | ------- Word Count | 41s | 15s | 2.7x Sort | 46s | 44s | 1.05x Stats | 174s | 3.6s | 48x Here is the code used for benchmark: ```python rdd = sc.textFile("text") def wordcount(): rdd.flatMap(lambda x:x.split('/'))\ .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap() def sort(): rdd.sortBy(lambda x:x, 1).count() def stats(): sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats() ``` Author: Davies Liu <davies.liu@gmail.com> Closes #2144 from davies/pypy and squashes the following commits: 9aed6c5 [Davies Liu] use protocol 2 in CloudPickle 4bc1f04 [Davies Liu] refactor b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way 3ca2351 [Davies Liu] Merge branch 'master' into pypy fae8b19 [Davies Liu] improve attrgetter, add tests 591f830 [Davies Liu] try to run tests with PyPy in run-tests c8d62ba [Davies Liu] cleanup f651fd0 [Davies Liu] fix tests using array with PyPy 1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways 3c1dbfe [Davies Liu] Merge branch 'master' into pypy 42fb5fa [Davies Liu] Merge branch 'master' into pypy cb2d724 [Davies Liu] fix tests 9986692 [Davies Liu] Merge branch 'master' into pypy 25b4ca7 [Davies Liu] support PyPy
author: Davies Liu <davies.liu@gmail.com> 2014-09-12 18:42:50 -0700
committer: Josh Rosen <joshrosen@apache.org> 2014-09-12 18:42:50 -0700
commit: 71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7 (patch)
tree: 1cff9839428a177b129de830c3c175b1316a6626 /python/pyspark/serializers.py
parent: 25311c2c545a60eb9dcf704814d4600987852155 (diff)
download: spark-71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7.tar.gz
spark-71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7.tar.bz2
spark-71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7.zip
1 files changed, 7 insertions, 3 deletions
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index 7b2710b913..a5f9341e81 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -355,7 +355,8 @@ class PickleSerializer(FramedSerializer):
     def dumps(self, obj):
         return cPickle.dumps(obj, 2)
 
-    loads = cPickle.loads
+    def loads(self, obj):
+        return cPickle.loads(obj)
 
 
 class CloudPickleSerializer(PickleSerializer):
@@ -374,8 +375,11 @@ class MarshalSerializer(FramedSerializer):
     This serializer is faster than PickleSerializer but supports fewer datatypes.
     """
 
-    dumps = marshal.dumps
-    loads = marshal.loads
+    def dumps(self, obj):
+        return marshal.dumps(obj)
+
+    def loads(self, obj):
+        return marshal.loads(obj)
 
 
 class AutoSerializer(FramedSerializer):
author	Davies Liu <davies.liu@gmail.com>	2014-09-12 18:42:50 -0700
committer	Josh Rosen <joshrosen@apache.org>	2014-09-12 18:42:50 -0700
commit	71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7 (patch)
tree	1cff9839428a177b129de830c3c175b1316a6626 /python/pyspark/serializers.py
parent	25311c2c545a60eb9dcf704814d4600987852155 (diff)
download	spark-71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7.tar.gz spark-71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7.tar.bz2 spark-71af030b46a89aaa9a87f18f56b9e1f1cd8ce2e7.zip