[SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the same time - spark

diff options

author	Josh Rosen <joshrosen@databricks.com>	2015-05-27 20:19:53 -0700
committer	Patrick Wendell <patrick@databricks.com>	2015-05-27 20:19:53 -0700
commit	852f4de2d3d0c5fff2fa66000a7a3088bb3dbe74 (patch)
tree	96000d8013ffcf860e725fa2188a0aae924dc4f0 /tools/src
parent	3c1f1baaf003d50786d3eee1e288f4bac69096f2 (diff)
download	spark-852f4de2d3d0c5fff2fa66000a7a3088bb3dbe74.tar.gz spark-852f4de2d3d0c5fff2fa66000a7a3088bb3dbe74.tar.bz2 spark-852f4de2d3d0c5fff2fa66000a7a3088bb3dbe74.zip

[SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the same time

This is a somewhat obscure bug, but I think that it will seriously impact KryoSerializer users who use custom registrators which disabled auto-reset. When auto-reset is disabled, then this breaks things in some of our shuffle paths which actually end up creating multiple OutputStreams from the same shared SerializerInstance (which is unsafe). This was introduced by a patch (SPARK-3386) which enables serializer re-use in some of the shuffle paths, since constructing new serializer instances is actually pretty costly for KryoSerializer. We had already fixed another corner-case (SPARK-7766) bug related to this, but missed this one. I think that the root problem here is that KryoSerializerInstance can be used in a way which is unsafe even within a single thread, e.g. by creating multiple open OutputStreams from the same instance or by interleaving deserialize and deserializeStream calls. I considered a smaller patch which adds assertions to guard against this type of "misuse" but abandoned that approach after I realized how convoluted the Scaladoc became. This patch fixes this bug by making it legal to create multiple streams from the same KryoSerializerInstance. Internally, KryoSerializerInstance now implements a `borrowKryo()` / `releaseKryo()` API that's backed by a "pool" of capacity 1. Each call to a KryoSerializerInstance method will borrow the Kryo, do its work, then release the serializer instance back to the pool. If the pool is empty and we need an instance, it will allocate a new Kryo on-demand. This makes it safe for multiple OutputStreams to be opened from the same serializer. If we try to release a Kryo back to the pool but the pool already contains a Kryo, then we'll just discard the new Kryo. I don't think there's a clear benefit to having a larger pool since our usages tend to fall into two cases, a) where we only create a single OutputStream and b) where we create a huge number of OutputStreams with the same lifecycle, then destroy the KryoSerializerInstance (this is what's happening in the bypassMergeSort code path that my regression test hits). Author: Josh Rosen <joshrosen@databricks.com> Closes #6415 from JoshRosen/SPARK-7873 and squashes the following commits: 00b402e [Josh Rosen] Initialize eagerly to fix a failing test ba55d20 [Josh Rosen] Add explanatory comments 3f1da96 [Josh Rosen] Guard against duplicate close() ab457ca [Josh Rosen] Sketch a loan/release based solution. 9816e8f [Josh Rosen] Add a failing test showing how deserialize() and deserializeStream() can interfere. 7350886 [Josh Rosen] Add failing regression test for SPARK-7873

Diffstat (limited to 'tools/src')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: