diff options
author | Davies Liu <davies@databricks.com> | 2015-01-15 11:40:41 -0800 |
---|---|---|
committer | Josh Rosen <joshrosen@databricks.com> | 2015-01-15 11:40:41 -0800 |
commit | 3c8650c12ad7a97852e7bd76153210493fd83e92 (patch) | |
tree | 8da63d69fe3596e91a9a00892b2283e08af8fb86 /python | |
parent | 4b325c77a270ec32d6858d204313d4f161774fae (diff) | |
download | spark-3c8650c12ad7a97852e7bd76153210493fd83e92.tar.gz spark-3c8650c12ad7a97852e7bd76153210493fd83e92.tar.bz2 spark-3c8650c12ad7a97852e7bd76153210493fd83e92.zip |
[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray
After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default.
Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__.
Here is the benchmark for parallelize 1 millions int with list or ndarray:
| before | after | improvements
------- | ------------ | ------------- | -------
list | 11.7 s | 0.8 s | 14x
numpy.ndarray | 32 s | 0.7 s | 40x
Author: Davies Liu <davies@databricks.com>
Closes #4024 from davies/opt_numpy and squashes the following commits:
7618c7c [Davies Liu] improve performance of parallelize list/ndarray
Diffstat (limited to 'python')
-rw-r--r-- | python/pyspark/context.py | 2 | ||||
-rw-r--r-- | python/pyspark/serializers.py | 4 |
2 files changed, 5 insertions, 1 deletions
diff --git a/python/pyspark/context.py b/python/pyspark/context.py index 593d74bca5..64f6a3ca6b 100644 --- a/python/pyspark/context.py +++ b/python/pyspark/context.py @@ -319,7 +319,7 @@ class SparkContext(object): # Make sure we distribute data evenly if it's smaller than self.batchSize if "__len__" not in dir(c): c = list(c) # Make it a list so we can compute its length - batchSize = max(1, min(len(c) // numSlices, self._batchSize)) + batchSize = max(1, min(len(c) // numSlices, self._batchSize or 1024)) serializer = BatchedSerializer(self._unbatched_serializer, batchSize) serializer.dump_stream(c, tempFile) tempFile.close() diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py index bd08c9a6d2..b8bda83517 100644 --- a/python/pyspark/serializers.py +++ b/python/pyspark/serializers.py @@ -181,6 +181,10 @@ class BatchedSerializer(Serializer): def _batched(self, iterator): if self.batchSize == self.UNLIMITED_BATCH_SIZE: yield list(iterator) + elif hasattr(iterator, "__len__") and hasattr(iterator, "__getslice__"): + n = len(iterator) + for i in xrange(0, n, self.batchSize): + yield iterator[i: i + self.batchSize] else: items = [] count = 0 |