[SPARK-4548] []SPARK-4517] improve performance of python broadcast

Re-implement the Python broadcast using file: 1) serialize the python object using cPickle, write into disks. 2) Create a wrapper in JVM (for the dumped file), it read data from during serialization 3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors 4) During deserialization, writing the data into disk. 5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access. It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor). Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2): name | 1.1 | 1.2 with this patch | improvement ---------|--------|---------|-------- python-broadcast-w-bytes | 25.20 | 9.33 | 170.13% | python-broadcast-w-set | 4.13 | 4.50 | -8.35% | Testing with 100 tasks (16 CPUs): name | 1.1 | 1.2 with this patch | improvement ---------|--------|---------|-------- python-broadcast-w-bytes | 38.16 | 8.40 | 353.98% python-broadcast-w-set | 23.29 | 9.59 | 142.80% Author: Davies Liu <davies@databricks.com> Closes #3417 from davies/pybroadcast and squashes the following commits: 50a58e0 [Davies Liu] address comments b98de1d [Davies Liu] disable gc while unpickle e5ee6b9 [Davies Liu] support large string 09303b8 [Davies Liu] read all data into memory dde02dd [Davies Liu] improve performance of python broadcast
author: Davies Liu <davies@databricks.com> 2014-11-24 17:17:03 -0800
committer: Josh Rosen <joshrosen@databricks.com> 2014-11-24 17:17:03 -0800
commit: 6cf507685efd01df77d663145ae08e48c7f92948 (patch)
tree: bdca89f0ce6e0304e93a605a697adbfec4c6f737 /python/pyspark/tests.py
parent: 050616b408c60eae02256913ceb645912dbff62e (diff)
download: spark-6cf507685efd01df77d663145ae08e48c7f92948.tar.gz
spark-6cf507685efd01df77d663145ae08e48c7f92948.tar.bz2
spark-6cf507685efd01df77d663145ae08e48c7f92948.zip
1 files changed, 4 insertions, 14 deletions
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py
index 29bcd38908..32645778c2 100644
--- a/python/pyspark/tests.py
+++ b/python/pyspark/tests.py
@@ -48,7 +48,7 @@ from pyspark.conf import SparkConf
 from pyspark.context import SparkContext
 from pyspark.files import SparkFiles
 from pyspark.serializers import read_int, BatchedSerializer, MarshalSerializer, PickleSerializer, \
-    CloudPickleSerializer, SizeLimitedStream, CompressedSerializer, LargeObjectSerializer
+    CloudPickleSerializer, CompressedSerializer
 from pyspark.shuffle import Aggregator, InMemoryMerger, ExternalMerger, ExternalSorter
 from pyspark.sql import SQLContext, IntegerType, Row, ArrayType, StructType, StructField, \
     UserDefinedType, DoubleType
@@ -237,26 +237,16 @@ class SerializationTestCase(unittest.TestCase):
         self.assertTrue("exit" in foo.func_code.co_names)
         ser.dumps(foo)
 
-    def _test_serializer(self, ser):
+    def test_compressed_serializer(self):
+        ser = CompressedSerializer(PickleSerializer())
         from StringIO import StringIO
         io = StringIO()
         ser.dump_stream(["abc", u"123", range(5)], io)
         io.seek(0)
         self.assertEqual(["abc", u"123", range(5)], list(ser.load_stream(io)))
-        size = io.tell()
         ser.dump_stream(range(1000), io)
         io.seek(0)
-        first = SizeLimitedStream(io, size)
-        self.assertEqual(["abc", u"123", range(5)], list(ser.load_stream(first)))
-        self.assertEqual(range(1000), list(ser.load_stream(io)))
-
-    def test_compressed_serializer(self):
-        ser = CompressedSerializer(PickleSerializer())
-        self._test_serializer(ser)
-
-    def test_large_object_serializer(self):
-        ser = LargeObjectSerializer()
-        self._test_serializer(ser)
+        self.assertEqual(["abc", u"123", range(5)] + range(1000), list(ser.load_stream(io)))
 
 
 class PySparkTestCase(unittest.TestCase):
author	Davies Liu <davies@databricks.com>	2014-11-24 17:17:03 -0800
committer	Josh Rosen <joshrosen@databricks.com>	2014-11-24 17:17:03 -0800
commit	6cf507685efd01df77d663145ae08e48c7f92948 (patch)
tree	bdca89f0ce6e0304e93a605a697adbfec4c6f737 /python/pyspark/tests.py
parent	050616b408c60eae02256913ceb645912dbff62e (diff)
download	spark-6cf507685efd01df77d663145ae08e48c7f92948.tar.gz spark-6cf507685efd01df77d663145ae08e48c7f92948.tar.bz2 spark-6cf507685efd01df77d663145ae08e48c7f92948.zip