[SPARK-4808] Removing minimum number of elements read before spill check

In the general case, Spillable's heuristic of checking for memory stress on every 32nd item after 1000 items are read is good enough. In general, we do not want to be enacting the spilling checks until later on in the job; checking for disk-spilling too early can produce unacceptable performance impact in trivial cases. However, there are non-trivial cases, particularly if each serialized object is large, where checking for the necessity to spill too late would allow the memory to overflow. Consider if every item is 1.5 MB in size, and the heap size is 1000 MB. Then clearly if we only try to spill the in-memory contents to disk after 1000 items are read, we would have already accumulated 1500 MB of RAM and overflowed the heap. Patch #3656 attempted to circumvent this by checking the need to spill on every single item read, but that would cause unacceptable performance in the general case. However, the convoluted cases above should not be forced to be refactored to shrink the data items. Therefore it makes sense that the memory spilling thresholds be configurable. Author: mcheah <mcheah@palantir.com> Closes #4420 from mingyukim/memory-spill-configurable and squashes the following commits: 6e2509f [mcheah] [SPARK-4808] Removing minimum number of elements read before spill check
author: mcheah <mcheah@palantir.com> 2015-02-19 18:09:22 -0800
committer: Andrew Or <andrew@databricks.com> 2015-02-19 18:09:22 -0800
commit: 3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099 (patch)
tree: ab3321beda48b03a5b5ca342d8ad02bb92a3d0c4 /core
parent: 0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f (diff)
download: spark-3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099.tar.gz
spark-3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099.tar.bz2
spark-3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099.zip
1 files changed, 1 insertions, 5 deletions
diff --git a/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala b/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala
index 9f54312074..747ecf075a 100644
--- a/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala
+++ b/core/src/main/scala/org/apache/spark/util/collection/Spillable.scala
@@ -42,9 +42,6 @@ private[spark] trait Spillable[C] extends Logging {
   // Memory manager that can be used to acquire/release memory
   private[this] val shuffleMemoryManager = SparkEnv.get.shuffleMemoryManager
 
-  // Threshold for `elementsRead` before we start tracking this collection's memory usage
-  private[this] val trackMemoryThreshold = 1000
-
   // Initial threshold for the size of a collection before we start tracking its memory usage
   // Exposed for testing
   private[this] val initialMemoryThreshold: Long =
@@ -72,8 +69,7 @@ private[spark] trait Spillable[C] extends Logging {
    * @return true if `collection` was spilled to disk; false otherwise
    */
   protected def maybeSpill(collection: C, currentMemory: Long): Boolean = {
-    if (elementsRead > trackMemoryThreshold && elementsRead % 32 == 0 &&
-        currentMemory >= myMemoryThreshold) {
+    if (elementsRead % 32 == 0 && currentMemory >= myMemoryThreshold) {
       // Claim up to double our current memory from the shuffle memory pool
       val amountToRequest = 2 * currentMemory - myMemoryThreshold
       val granted = shuffleMemoryManager.tryToAcquire(amountToRequest)
author	mcheah <mcheah@palantir.com>	2015-02-19 18:09:22 -0800
committer	Andrew Or <andrew@databricks.com>	2015-02-19 18:09:22 -0800
commit	3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099 (patch)
tree	ab3321beda48b03a5b5ca342d8ad02bb92a3d0c4 /core
parent	0cfd2cebde0b7fac3779eda80d6e42223f8a3d9f (diff)
download	spark-3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099.tar.gz spark-3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099.tar.bz2 spark-3be92cdac30cf488e09dbdaaa70e5c4cdaa9a099.zip