[SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records - spark

diff options

author	Andrew Ray <ray.andrew@gmail.com>	2016-12-08 11:08:12 -0800
committer	Davies Liu <davies.liu@gmail.com>	2016-12-08 11:08:12 -0800
commit	3c68944b229aaaeeaee3efcbae3e3be9a2914855 (patch)
tree	8f6cf65d6396567a42c7d442d37fc1a1f29438b5 /dev/make-distribution.sh
parent	ed8869ebbf39783b16daba2e2498a2bc1889306f (diff)
download	spark-3c68944b229aaaeeaee3efcbae3e3be9a2914855.tar.gz spark-3c68944b229aaaeeaee3efcbae3e3be9a2914855.tar.bz2 spark-3c68944b229aaaeeaee3efcbae3e3be9a2914855.zip

[SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records

## What changes were proposed in this pull request? Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching. `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks. `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added. Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization. ## How was this patch tested? Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip. Author: Andrew Ray <ray.andrew@gmail.com> Closes #16121 from aray/fix-cartesian.

Diffstat (limited to 'dev/make-distribution.sh')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: