aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/context.py
diff options
context:
space:
mode:
authorDavies Liu <davies@databricks.com>2015-01-15 11:40:41 -0800
committerJosh Rosen <joshrosen@databricks.com>2015-01-15 11:40:41 -0800
commit3c8650c12ad7a97852e7bd76153210493fd83e92 (patch)
tree8da63d69fe3596e91a9a00892b2283e08af8fb86 /python/pyspark/context.py
parent4b325c77a270ec32d6858d204313d4f161774fae (diff)
downloadspark-3c8650c12ad7a97852e7bd76153210493fd83e92.tar.gz
spark-3c8650c12ad7a97852e7bd76153210493fd83e92.tar.bz2
spark-3c8650c12ad7a97852e7bd76153210493fd83e92.zip
[SPARK-5224] [PySpark] improve performance of parallelize list/ndarray
After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default. Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__. Here is the benchmark for parallelize 1 millions int with list or ndarray: | before | after | improvements ------- | ------------ | ------------- | ------- list | 11.7 s | 0.8 s | 14x numpy.ndarray | 32 s | 0.7 s | 40x Author: Davies Liu <davies@databricks.com> Closes #4024 from davies/opt_numpy and squashes the following commits: 7618c7c [Davies Liu] improve performance of parallelize list/ndarray
Diffstat (limited to 'python/pyspark/context.py')
-rw-r--r--python/pyspark/context.py2
1 files changed, 1 insertions, 1 deletions
diff --git a/python/pyspark/context.py b/python/pyspark/context.py
index 593d74bca5..64f6a3ca6b 100644
--- a/python/pyspark/context.py
+++ b/python/pyspark/context.py
@@ -319,7 +319,7 @@ class SparkContext(object):
# Make sure we distribute data evenly if it's smaller than self.batchSize
if "__len__" not in dir(c):
c = list(c) # Make it a list so we can compute its length
- batchSize = max(1, min(len(c) // numSlices, self._batchSize))
+ batchSize = max(1, min(len(c) // numSlices, self._batchSize or 1024))
serializer = BatchedSerializer(self._unbatched_serializer, batchSize)
serializer.dump_stream(c, tempFile)
tempFile.close()