aboutsummaryrefslogtreecommitdiff
path: root/sql/core/src/test/scala
diff options
context:
space:
mode:
authorJosh Rosen <joshrosen@databricks.com>2016-09-17 11:46:15 -0700
committerJosh Rosen <joshrosen@databricks.com>2016-09-17 11:46:15 -0700
commit8faa5217b44e8d52eab7eb2d53d0652abaaf43cd (patch)
treedaf1a90737024c0dccd567f66a8b13ee0f2d3c1a /sql/core/src/test/scala
parent86c2d393a56bf1e5114bc5a781253c0460efb8af (diff)
downloadspark-8faa5217b44e8d52eab7eb2d53d0652abaaf43cd.tar.gz
spark-8faa5217b44e8d52eab7eb2d53d0652abaaf43cd.tar.bz2
spark-8faa5217b44e8d52eab7eb2d53d0652abaaf43cd.zip
[SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes()
## What changes were proposed in this pull request? `MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read. This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer). The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of `PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`. ## How was this patch tested? The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug. In addition, I added a new `PartiallySerializedBlockSuite` to unit test that component. Author: Josh Rosen <joshrosen@databricks.com> Closes #15043 from JoshRosen/partially-serialized-block-values-iterator-bugfix.
Diffstat (limited to 'sql/core/src/test/scala')
0 files changed, 0 insertions, 0 deletions