diff options
author | Josh Rosen <joshrosen@databricks.com> | 2016-11-07 16:14:19 -0800 |
---|---|---|
committer | Josh Rosen <joshrosen@databricks.com> | 2016-11-07 16:14:19 -0800 |
commit | 3a710b94b0c853a2dd4c40dca446ecde4e7be959 (patch) | |
tree | 2714991c21133436d28bc726fc55078d7aa98f92 /sql | |
parent | 19cf208063f035d793d2306295a251a9af7e32f6 (diff) | |
download | spark-3a710b94b0c853a2dd4c40dca446ecde4e7be959.tar.gz spark-3a710b94b0c853a2dd4c40dca446ecde4e7be959.tar.bz2 spark-3a710b94b0c853a2dd4c40dca446ecde4e7be959.zip |
[SPARK-18236] Reduce duplicate objects in Spark UI and HistoryServer
## What changes were proposed in this pull request?
When profiling heap dumps from the HistoryServer and live Spark web UIs, I found a large amount of memory being wasted on duplicated objects and strings. This patch's changes remove most of this duplication, resulting in over 40% memory savings for some benchmarks.
- **Task metrics** (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously, every `TaskUIData` object would have its own instances of `InputMetricsUIData`, `OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for many tasks these metrics are irrelevant because they're all zero. This patch changes how we construct these metrics in order to re-use a single immutable "empty" value for the cases where these metrics are empty.
- **TaskInfo.accumulables** (ade86db901127bf13c0e0bdc3f09c933a093bb76): Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding updates from named accumulators. Tasks which didn't use named accumulators still paid for the cost of allocating and storing this empty buffer. To avoid this overhead, I changed the `val` with a mutable buffer into a `var` which holds an immutable Scala list, allowing tasks which do not have named accumulator updates to share the same singleton `Nil` object.
- **String.intern() in JSONProtocol** (7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor hostnames and ids are deserialized from JSON, leading to massive duplication of these string objects. By calling `String.intern()` on the deserialized values we can remove all of this duplication. Since Spark now requires Java 7+ we don't have to worry about string interning exhausting the permgen (see http://java-performance.info/string-intern-in-java-6-7-8/).
## How was this patch tested?
I ran
```
sc.parallelize(1 to 100000, 100000).count()
```
in `spark-shell` with event logging enabled, then loaded that event log in the HistoryServer, performed a full GC, and took a heap dump. According to YourKit, the changes in this patch reduced memory consumption by roughly 28 megabytes (or 770k Java objects):
![image](https://cloud.githubusercontent.com/assets/50748/19953276/4f3a28aa-a129-11e6-93df-d7fa91396f66.png)
Here's a table illustrating the drop in objects due to deduplication (the drop is <100k for some objects because some events were dropped from the listener bus; this is a separate, existing bug that I'll address separately after CPU-profiling):
![image](https://cloud.githubusercontent.com/assets/50748/19953290/6a271290-a129-11e6-93ad-b825f1448886.png)
Author: Josh Rosen <joshrosen@databricks.com>
Closes #15743 from JoshRosen/spark-ui-memory-usage.
Diffstat (limited to 'sql')
-rw-r--r-- | sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala index 19b6d26031..948a155457 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/ui/SQLListenerSuite.scala @@ -374,7 +374,7 @@ class SQLListenerSuite extends SparkFunSuite with SharedSQLContext { val sqlMetricInfo = sqlMetric.toInfo(Some(sqlMetric.value), None) val nonSqlMetricInfo = nonSqlMetric.toInfo(Some(nonSqlMetric.value), None) val taskInfo = createTaskInfo(0, 0) - taskInfo.accumulables ++= Seq(sqlMetricInfo, nonSqlMetricInfo) + taskInfo.setAccumulables(List(sqlMetricInfo, nonSqlMetricInfo)) val taskEnd = SparkListenerTaskEnd(0, 0, "just-a-task", null, taskInfo, null) listener.onOtherEvent(executionStart) listener.onJobStart(jobStart) |