SPARK-2792. Fix reading too much or too little data from each stream in ExternalMap / Sorter

All these changes are from mridulm's work in #1609, but extracted here to fix this specific issue and make it easier to merge not 1.1. This particular set of changes is to make sure that we read exactly the right range of bytes from each spill file in EAOM: some serializers can write bytes after the last object (e.g. the TC_RESET flag in Java serialization) and that would confuse the previous code into reading it as part of the next batch. There are also improvements to cleanup to make sure files are closed. In addition to bringing in the changes to ExternalAppendOnlyMap, I also copied them to the corresponding code in ExternalSorter and updated its test suite to test for the same issues. Author: Matei Zaharia <matei@databricks.com> Closes #1722 from mateiz/spark-2792 and squashes the following commits: 5d4bfb5 [Matei Zaharia] Make objectStreamReset counter count the last object written too 18fe865 [Matei Zaharia] Update docs on objectStreamReset 576ee83 [Matei Zaharia] Allow objectStreamReset to be 0 0374217 [Matei Zaharia] Remove super paranoid code to close file handles bda37bb [Matei Zaharia] Implement Mridul's ExternalAppendOnlyMap fixes in ExternalSorter too 0d6dad7 [Matei Zaharia] Added Mridul's test changes for ExternalAppendOnlyMap 9a78e4b [Matei Zaharia] Add @mridulm's fixes to ExternalAppendOnlyMap for batch sizes
author: Matei Zaharia <matei@databricks.com> 2014-08-04 12:59:18 -0700
committer: Matei Zaharia <matei@databricks.com> 2014-08-04 12:59:18 -0700
commit: 8e7d5ba1a20a8a1f409e9d6472ae3e6c4bc948b4 (patch)
tree: 22128614dde4f880b0036bb7fa816d91318ee183 /docs/configuration.md
parent: 59f84a9531f7974a053fd4963ce9afd88273ea4c (diff)
download: spark-8e7d5ba1a20a8a1f409e9d6472ae3e6c4bc948b4.tar.gz
spark-8e7d5ba1a20a8a1f409e9d6472ae3e6c4bc948b4.tar.bz2
spark-8e7d5ba1a20a8a1f409e9d6472ae3e6c4bc948b4.zip
1 files changed, 1 insertions, 1 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 2a71d7b820..870343f1c0 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -385,7 +385,7 @@ Apart from these, the following properties are also available, and may be useful
     When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
     objects to prevent writing redundant data, however that stops garbage collection of those
     objects. By calling 'reset' you flush that info from the serializer, and allow old
-    objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
+    objects to be collected. To turn off this periodic reset set it to -1.
     By default it will reset the serializer every 100 objects.
   </td>
 </tr>
author	Matei Zaharia <matei@databricks.com>	2014-08-04 12:59:18 -0700
committer	Matei Zaharia <matei@databricks.com>	2014-08-04 12:59:18 -0700
commit	8e7d5ba1a20a8a1f409e9d6472ae3e6c4bc948b4 (patch)
tree	22128614dde4f880b0036bb7fa816d91318ee183 /docs/configuration.md
parent	59f84a9531f7974a053fd4963ce9afd88273ea4c (diff)
download	spark-8e7d5ba1a20a8a1f409e9d6472ae3e6c4bc948b4.tar.gz spark-8e7d5ba1a20a8a1f409e9d6472ae3e6c4bc948b4.tar.bz2 spark-8e7d5ba1a20a8a1f409e9d6472ae3e6c4bc948b4.zip