diff options
author | Davies Liu <davies.liu@gmail.com> | 2014-08-19 14:46:32 -0700 |
---|---|---|
committer | Josh Rosen <joshrosen@apache.org> | 2014-08-19 14:46:32 -0700 |
commit | d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb (patch) | |
tree | bff49107d4452ea55eb7df8b1d44bae2ee0b9baa /python/pyspark/serializers.py | |
parent | 76eaeb4523ee01cabbea2d867daac48a277885a1 (diff) | |
download | spark-d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.tar.gz spark-d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.tar.bz2 spark-d7e80c2597d4a9cae2e0cb35a86f7889323f4cbb.zip |
[SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes.
If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark.
Author: Davies Liu <davies.liu@gmail.com>
Closes #1894 from davies/zip and squashes the following commits:
c4652ea [Davies Liu] add more test cases
6d05fc8 [Davies Liu] Merge branch 'master' into zip
813b1e4 [Davies Liu] add more tests for failed cases
a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.
Diffstat (limited to 'python/pyspark/serializers.py')
-rw-r--r-- | python/pyspark/serializers.py | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/python/pyspark/serializers.py b/python/pyspark/serializers.py index 74870c0edc..fc49aa42db 100644 --- a/python/pyspark/serializers.py +++ b/python/pyspark/serializers.py @@ -255,6 +255,9 @@ class PairDeserializer(CartesianDeserializer): def load_stream(self, stream): for (keys, vals) in self.prepare_keys_values(stream): + if len(keys) != len(vals): + raise ValueError("Can not deserialize RDD with different number of items" + " in pair: (%d, %d)" % (len(keys), len(vals))) for pair in izip(keys, vals): yield pair |