aboutsummaryrefslogtreecommitdiff
path: root/dev/make-distribution.sh
diff options
context:
space:
mode:
authorAndrew Ray <ray.andrew@gmail.com>2016-12-08 11:08:12 -0800
committerDavies Liu <davies.liu@gmail.com>2016-12-08 11:08:12 -0800
commit3c68944b229aaaeeaee3efcbae3e3be9a2914855 (patch)
tree8f6cf65d6396567a42c7d442d37fc1a1f29438b5 /dev/make-distribution.sh
parented8869ebbf39783b16daba2e2498a2bc1889306f (diff)
downloadspark-3c68944b229aaaeeaee3efcbae3e3be9a2914855.tar.gz
spark-3c68944b229aaaeeaee3efcbae3e3be9a2914855.tar.bz2
spark-3c68944b229aaaeeaee3efcbae3e3be9a2914855.zip
[SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records
## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip. Author: Andrew Ray <ray.andrew@gmail.com> Closes #16121 from aray/fix-cartesian.
Diffstat (limited to 'dev/make-distribution.sh')
0 files changed, 0 insertions, 0 deletions