aboutsummaryrefslogtreecommitdiff
path: root/tools/src
diff options
context:
space:
mode:
authorJosh Rosen <joshrosen@databricks.com>2015-05-27 20:19:53 -0700
committerPatrick Wendell <patrick@databricks.com>2015-05-27 20:19:53 -0700
commit852f4de2d3d0c5fff2fa66000a7a3088bb3dbe74 (patch)
tree96000d8013ffcf860e725fa2188a0aae924dc4f0 /tools/src
parent3c1f1baaf003d50786d3eee1e288f4bac69096f2 (diff)
downloadspark-852f4de2d3d0c5fff2fa66000a7a3088bb3dbe74.tar.gz
spark-852f4de2d3d0c5fff2fa66000a7a3088bb3dbe74.tar.bz2
spark-852f4de2d3d0c5fff2fa66000a7a3088bb3dbe74.zip
[SPARK-7873] Allow KryoSerializerInstance to create multiple streams at the same time
This is a somewhat obscure bug, but I think that it will seriously impact KryoSerializer users who use custom registrators which disabled auto-reset. When auto-reset is disabled, then this breaks things in some of our shuffle paths which actually end up creating multiple OutputStreams from the same shared SerializerInstance (which is unsafe). This was introduced by a patch (SPARK-3386) which enables serializer re-use in some of the shuffle paths, since constructing new serializer instances is actually pretty costly for KryoSerializer. We had already fixed another corner-case (SPARK-7766) bug related to this, but missed this one. I think that the root problem here is that KryoSerializerInstance can be used in a way which is unsafe even within a single thread, e.g. by creating multiple open OutputStreams from the same instance or by interleaving deserialize and deserializeStream calls. I considered a smaller patch which adds assertions to guard against this type of "misuse" but abandoned that approach after I realized how convoluted the Scaladoc became. This patch fixes this bug by making it legal to create multiple streams from the same KryoSerializerInstance. Internally, KryoSerializerInstance now implements a `borrowKryo()` / `releaseKryo()` API that's backed by a "pool" of capacity 1. Each call to a KryoSerializerInstance method will borrow the Kryo, do its work, then release the serializer instance back to the pool. If the pool is empty and we need an instance, it will allocate a new Kryo on-demand. This makes it safe for multiple OutputStreams to be opened from the same serializer. If we try to release a Kryo back to the pool but the pool already contains a Kryo, then we'll just discard the new Kryo. I don't think there's a clear benefit to having a larger pool since our usages tend to fall into two cases, a) where we only create a single OutputStream and b) where we create a huge number of OutputStreams with the same lifecycle, then destroy the KryoSerializerInstance (this is what's happening in the bypassMergeSort code path that my regression test hits). Author: Josh Rosen <joshrosen@databricks.com> Closes #6415 from JoshRosen/SPARK-7873 and squashes the following commits: 00b402e [Josh Rosen] Initialize eagerly to fix a failing test ba55d20 [Josh Rosen] Add explanatory comments 3f1da96 [Josh Rosen] Guard against duplicate close() ab457ca [Josh Rosen] Sketch a loan/release based solution. 9816e8f [Josh Rosen] Add a failing test showing how deserialize() and deserializeStream() can interfere. 7350886 [Josh Rosen] Add failing regression test for SPARK-7873
Diffstat (limited to 'tools/src')
0 files changed, 0 insertions, 0 deletions