[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL - spark

diff options

author	Josh Rosen <joshrosen@databricks.com>	2015-07-20 22:38:05 -0700
committer	Josh Rosen <joshrosen@databricks.com>	2015-07-20 22:38:05 -0700
commit	c032b0bf92130dc4facb003f0deaeb1228aefded (patch)
tree	2a48b364f0afb53e1ff10dac554b54d54abfa11b /bin/spark-submit
parent	4d97be95300f729391c17b4c162e3c7fba09b8bf (diff)
download	spark-c032b0bf92130dc4facb003f0deaeb1228aefded.tar.gz spark-c032b0bf92130dc4facb003f0deaeb1228aefded.tar.bz2 spark-c032b0bf92130dc4facb003f0deaeb1228aefded.zip

[SPARK-8797] [SPARK-9146] [SPARK-9145] [SPARK-9147] Support NaN ordering and equality comparisons in Spark SQL

This patch addresses an issue where queries that sorted float or double columns containing NaN values could fail with "Comparison method violates its general contract!" errors from TimSort. The root of this problem is that `NaN > anything`, `NaN == anything`, and `NaN < anything` all return `false`. Per the design specified in SPARK-9079, we have decided that `NaN = NaN` should return true and that NaN should appear last when sorting in ascending order (i.e. it is larger than any other numeric value). In addition to implementing these semantics, this patch also adds canonicalization of NaN values in UnsafeRow, which is necessary in order to be able to do binary equality comparisons on equal NaNs that might have different bit representations (see SPARK-9147). Author: Josh Rosen <joshrosen@databricks.com> Closes #7194 from JoshRosen/nan and squashes the following commits: 983d4fc [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan 88bd73c [Josh Rosen] Fix Row.equals() a702e2e [Josh Rosen] normalization -> canonicalization a7267cf [Josh Rosen] Normalize NaNs in UnsafeRow fe629ae [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan fbb2a29 [Josh Rosen] Fix NaN comparisons in BinaryComparison expressions c1fd4fe [Josh Rosen] Fold NaN test into existing test framework b31eb19 [Josh Rosen] Uncomment failing tests 7fe67af [Josh Rosen] Support NaN == NaN (SPARK-9145) 58bad2c [Josh Rosen] Revert "Compare rows' string representations to work around NaN incomparability." fc6b4d2 [Josh Rosen] Update CodeGenerator 3998ef2 [Josh Rosen] Remove unused code a2ba2e7 [Josh Rosen] Fix prefix comparision for NaNs a30d371 [Josh Rosen] Compare rows' string representations to work around NaN incomparability. 6f03f85 [Josh Rosen] Fix bug in Double / Float ordering 42a1ad5 [Josh Rosen] Stop filtering NaNs in UnsafeExternalSortSuite bfca524 [Josh Rosen] Change ordering so that NaN is maximum value. 8d7be61 [Josh Rosen] Update randomized test to use ScalaTest's assume() b20837b [Josh Rosen] Add failing test for new NaN comparision ordering 5b88b2b [Josh Rosen] Fix compilation of CodeGenerationSuite d907b5b [Josh Rosen] Merge remote-tracking branch 'origin/master' into nan 630ebc5 [Josh Rosen] Specify an ordering for NaN values. 9bf195a [Josh Rosen] Re-enable NaNs in CodeGenerationSuite to produce more regression tests 13fc06a [Josh Rosen] Add regression test for NaN sorting issue f9efbb5 [Josh Rosen] Fix ORDER BY NULL e7dc4fb [Josh Rosen] Add very generic test for ordering 7d5c13e [Josh Rosen] Add regression test for SPARK-8782 (ORDER BY NULL) b55875a [Josh Rosen] Generate doubles and floats over entire possible range. 5acdd5c [Josh Rosen] Infinity and NaN are interesting. ab76cbd [Josh Rosen] Move code to Catalyst package. d2b4a4a [Josh Rosen] Add random data generator test utilities to Spark SQL.

Diffstat (limited to 'bin/spark-submit')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: