[SPARK-1912] fix compress memory issue during reduce

When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block. Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem. Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily. Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com> Closes #860 from cloud-fan/fix-compress and squashes the following commits: 0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator' 07f32c2 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class 2c8adb2 [Wenchen Fan(Cloud)] add inline comment 8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce
author: Wenchen Fan(Cloud) <cloud0fan@gmail.com> 2014-06-03 13:18:20 -0700
committer: Patrick Wendell <pwendell@gmail.com> 2014-06-25 13:28:53 -0700
commit: abb62f0b98f0be77cb509cd0513faa8e9393bfff (patch)
tree: 404ea329063a5565c9e1cd27daf9ef24046a7940 /core
parent: b4b0a54cf285614b7926c3c32b8ec652cf6ccef3 (diff)
download: spark-abb62f0b98f0be77cb509cd0513faa8e9393bfff.tar.gz
spark-abb62f0b98f0be77cb509cd0513faa8e9393bfff.tar.bz2
spark-abb62f0b98f0be77cb509cd0513faa8e9393bfff.zip
1 files changed, 20 insertions, 2 deletions
diff --git a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
index 6e450081dc..a41286d3e4 100644
--- a/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
+++ b/core/src/main/scala/org/apache/spark/storage/BlockManager.scala
@@ -1015,8 +1015,26 @@ private[spark] class BlockManager(
       bytes: ByteBuffer,
       serializer: Serializer = defaultSerializer): Iterator[Any] = {
     bytes.rewind()
-    val stream = wrapForCompression(blockId, new ByteBufferInputStream(bytes, true))
-    serializer.newInstance().deserializeStream(stream).asIterator
+
+    def getIterator = {
+      val stream = wrapForCompression(blockId, new ByteBufferInputStream(bytes, true))
+      serializer.newInstance().deserializeStream(stream).asIterator
+    }
+
+    if (blockId.isShuffle) {
+      // Reducer may need to read many local shuffle blocks and will wrap them into Iterators
+      // at the beginning. The wrapping will cost some memory (compression instance
+      // initialization, etc.). Reducer read shuffle blocks one by one so we could do the
+      // wrapping lazily to save memory.
+      class LazyProxyIterator(f: => Iterator[Any]) extends Iterator[Any] {
+        lazy val proxy = f
+        override def hasNext: Boolean = proxy.hasNext
+        override def next(): Any = proxy.next()
+      }
+      new LazyProxyIterator(getIterator)
+    } else {
+      getIterator
+    }
   }
 
   def stop() {
author	Wenchen Fan(Cloud) <cloud0fan@gmail.com>	2014-06-03 13:18:20 -0700
committer	Patrick Wendell <pwendell@gmail.com>	2014-06-25 13:28:53 -0700
commit	abb62f0b98f0be77cb509cd0513faa8e9393bfff (patch)
tree	404ea329063a5565c9e1cd27daf9ef24046a7940 /core
parent	b4b0a54cf285614b7926c3c32b8ec652cf6ccef3 (diff)
download	spark-abb62f0b98f0be77cb509cd0513faa8e9393bfff.tar.gz spark-abb62f0b98f0be77cb509cd0513faa8e9393bfff.tar.bz2 spark-abb62f0b98f0be77cb509cd0513faa8e9393bfff.zip