[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray

After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default. Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__. Here is the benchmark for parallelize 1 millions int with list or ndarray: | before | after | improvements ------- | ------------ | ------------- | ------- list | 11.7 s | 0.8 s | 14x numpy.ndarray | 32 s | 0.7 s | 40x Author: Davies Liu <davies@databricks.com> Closes #4024 from davies/opt_numpy and squashes the following commits: 7618c7c [Davies Liu] improve performance of parallelize list/ndarray
author: Davies Liu <davies@databricks.com> 2015-01-15 11:40:41 -0800
committer: Josh Rosen <joshrosen@databricks.com> 2015-01-15 11:40:41 -0800
commit: 3c8650c12ad7a97852e7bd76153210493fd83e92 (patch)
tree: 8da63d69fe3596e91a9a00892b2283e08af8fb86 /python/pyspark/serializers.py
parent: 4b325c77a270ec32d6858d204313d4f161774fae (diff)
download: spark-3c8650c12ad7a97852e7bd76153210493fd83e92.tar.gz
spark-3c8650c12ad7a97852e7bd76153210493fd83e92.tar.bz2
spark-3c8650c12ad7a97852e7bd76153210493fd83e92.zip
1 files changed, 4 insertions, 0 deletions
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index bd08c9a6d2..b8bda83517 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -181,6 +181,10 @@ class BatchedSerializer(Serializer):
     def _batched(self, iterator):
         if self.batchSize == self.UNLIMITED_BATCH_SIZE:
             yield list(iterator)
+        elif hasattr(iterator, "__len__") and hasattr(iterator, "__getslice__"):
+            n = len(iterator)
+            for i in xrange(0, n, self.batchSize):
+                yield iterator[i: i + self.batchSize]
         else:
             items = []
             count = 0
author	Davies Liu <davies@databricks.com>	2015-01-15 11:40:41 -0800
committer	Josh Rosen <joshrosen@databricks.com>	2015-01-15 11:40:41 -0800
commit	3c8650c12ad7a97852e7bd76153210493fd83e92 (patch)
tree	8da63d69fe3596e91a9a00892b2283e08af8fb86 /python/pyspark/serializers.py
parent	4b325c77a270ec32d6858d204313d4f161774fae (diff)
download	spark-3c8650c12ad7a97852e7bd76153210493fd83e92.tar.gz spark-3c8650c12ad7a97852e7bd76153210493fd83e92.tar.bz2 spark-3c8650c12ad7a97852e7bd76153210493fd83e92.zip