SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys - spark

diff options

author	Matei Zaharia <matei@databricks.com>	2014-06-05 23:01:48 -0700
committer	Matei Zaharia <matei@databricks.com>	2014-06-05 23:01:59 -0700
commit	16e3910a0512cd53ad0c9c71ef20a3ee0f10c34f (patch)
tree	2646e2494d3d90b39ede85dfab2a3b391d3865de /sql/hive
parent	715fbfab9b94223ee6cb167cb69e1895ac7101e3 (diff)
download	spark-16e3910a0512cd53ad0c9c71ef20a3ee0f10c34f.tar.gz spark-16e3910a0512cd53ad0c9c71ef20a3ee0f10c34f.tar.bz2 spark-16e3910a0512cd53ad0c9c71ef20a3ee0f10c34f.zip

SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys

The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead. Author: Matei Zaharia <matei@databricks.com> Closes #986 from mateiz/spark-2043 and squashes the following commits: 0959514 [Matei Zaharia] Added unit test for having many hash collisions 892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code. (cherry picked from commit b45c13e7d798f97b92f1a6329528191b8d779c4f) Signed-off-by: Matei Zaharia <matei@databricks.com>

Diffstat (limited to 'sql/hive')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: