aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/serializers.py
diff options
context:
space:
mode:
authorDavies Liu <davies.liu@gmail.com>2014-08-19 14:46:32 -0700
committerJosh Rosen <joshrosen@apache.org>2014-08-19 14:46:32 -0700
commitd7e80c2597d4a9cae2e0cb35a86f7889323f4cbb (patch)
treebff49107d4452ea55eb7df8b1d44bae2ee0b9baa /python/pyspark/serializers.py
parent76eaeb4523ee01cabbea2d867daac48a277885a1 (diff)
downloadspark-d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.tar.gz
spark-d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.tar.bz2
spark-d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.zip
[SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes.
If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark. Author: Davies Liu <davies.liu@gmail.com> Closes #1894 from davies/zip and squashes the following commits: c4652ea [Davies Liu] add more test cases 6d05fc8 [Davies Liu] Merge branch 'master' into zip 813b1e4 [Davies Liu] add more tests for failed cases a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.
Diffstat (limited to 'python/pyspark/serializers.py')
-rw-r--r--python/pyspark/serializers.py3
1 files changed, 3 insertions, 0 deletions
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py
index 74870c0edc..fc49aa42db 100644
--- a/python/pyspark/serializers.py
+++ b/python/pyspark/serializers.py
@@ -255,6 +255,9 @@ class PairDeserializer(CartesianDeserializer):
def load_stream(self, stream):
for (keys, vals) in self.prepare_keys_values(stream):
+ if len(keys) != len(vals):
+ raise ValueError("Can not deserialize RDD with different number of items"
+ " in pair: (%d, %d)" % (len(keys), len(vals)))
for pair in izip(keys, vals):
yield pair