SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage

### Why and what? Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead. This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point. Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl ### Memory implications An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default. Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes). This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7). The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case. This results in a worst-case sorting overhead of 4N bytes. Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements. ### Performance implications As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation. Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter". <table> <tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr> <tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr> <tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr> <tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr> <tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr> </table> The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7. In short, this patch should significantly improve performance for users running either Java 6 or 7. Author: Aaron Davidson <aaron@databricks.com> Closes #1502 from aarondav/sort and squashes the following commits: 652d936 [Aaron Davidson] Update license, move Sorter to java src a7b5b1c [Aaron Davidson] fix licenses 5c0efaf [Aaron Davidson] Update tmpLength ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs 034bf10 [Aaron Davidson] Change to Apache v2 Timsort b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark] 6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
author: Aaron Davidson <aaron@databricks.com> 2014-07-22 11:58:53 -0700
committer: Matei Zaharia <matei@databricks.com> 2014-07-22 11:58:53 -0700
commit: 85d3596e65512d481f4be54df100be6bdc9c8e29 (patch)
tree: a40688197e285c2d623848861d831c05f1c95f70 /LICENSE
parent: 1407871733176c92c67ac547adf603c41a772f7f (diff)
download: spark-85d3596e65512d481f4be54df100be6bdc9c8e29.tar.gz
spark-85d3596e65512d481f4be54df100be6bdc9c8e29.tar.bz2
spark-85d3596e65512d481f4be54df100be6bdc9c8e29.zip
1 files changed, 19 insertions, 1 deletions
diff --git a/LICENSE b/LICENSE
index 383f079df8..65e1f480d9 100644
--- a/LICENSE
+++ b/LICENSE
@@ -442,7 +442,7 @@ Written by Pavel Binko, Dino Ferrero Merlino, Wolfgang Hoschek, Tony Johnson, An
 
 
 ========================================================================
-Fo SnapTree:
+For SnapTree:
 ========================================================================
 
 SNAPTREE LICENSE
@@ -483,6 +483,24 @@ SUCH DAMAGE.
 
 
 ========================================================================
+For Timsort (core/src/main/java/org/apache/spark/util/collection/Sorter.java):
+========================================================================
+Copyright (C) 2008 The Android Open Source Project
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+
+========================================================================
 BSD-style licenses
 ========================================================================
author	Aaron Davidson <aaron@databricks.com>	2014-07-22 11:58:53 -0700
committer	Matei Zaharia <matei@databricks.com>	2014-07-22 11:58:53 -0700
commit	85d3596e65512d481f4be54df100be6bdc9c8e29 (patch)
tree	a40688197e285c2d623848861d831c05f1c95f70 /LICENSE
parent	1407871733176c92c67ac547adf603c41a772f7f (diff)
download	spark-85d3596e65512d481f4be54df100be6bdc9c8e29.tar.gz spark-85d3596e65512d481f4be54df100be6bdc9c8e29.tar.bz2 spark-85d3596e65512d481f4be54df100be6bdc9c8e29.zip