aboutsummaryrefslogtreecommitdiff
path: root/scalastyle-config.xml
diff options
context:
space:
mode:
authorJosh Rosen <joshrosen@databricks.com>2016-11-04 19:32:26 -0700
committerYin Huai <yhuai@databricks.com>2016-11-04 19:32:26 -0700
commit0e3312ee72c44f4c9acafbd80d0c8a14f3aff875 (patch)
tree076f0dc87bad3fc9b1455c4e5adf7002febad7ae /scalastyle-config.xml
parent4cee2ce251110218e68c0f8f30363ec2f2498bea (diff)
downloadspark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.tar.gz
spark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.tar.bz2
spark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.zip
[SPARK-18256] Improve the performance of event log replay in HistoryServer
## What changes were proposed in this pull request? This patch significantly improves the performance of event log replay in the HistoryServer via two simple changes: - **Don't use `extractOpt`**: it turns out that `json4s`'s `extractOpt` method uses exceptions for control flow, causing huge performance bottlenecks due to the overhead of initializing exceptions. To avoid this overhead, we can simply use our own` Utils.jsonOption` method. This patch replaces all uses of `extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the use of the slow `extractOpt` method. - **Don't call `Utils.getFormattedClassName` for every event**: the old code called` Utils.getFormattedClassName` dozens of times per replayed event in order to match up class names in events with SparkListener event names. By simply storing the results of these calls in constants rather than recomputing them, we're able to eliminate a huge performance hotspot by removing thousands of expensive `Class.getSimpleName` calls. ## How was this patch tested? Tested by profiling the replay of a long event log using YourKit. For an event log containing 1000+ jobs, each of which had thousands of tasks, the changes in this patch cut the replay time in half: ![image](https://cloud.githubusercontent.com/assets/50748/19980953/31154622-a1bd-11e6-9be4-21fbb9b3f9a7.png) Prior to this patch's changes, the two slowest methods in log replay were internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`: ![image](https://cloud.githubusercontent.com/assets/50748/19981052/87416cce-a1bd-11e6-9f25-06a7cd391822.png) After this patch, these hotspots are completely eliminated. Author: Josh Rosen <joshrosen@databricks.com> Closes #15756 from JoshRosen/speed-up-jsonprotocol.
Diffstat (limited to 'scalastyle-config.xml')
-rw-r--r--scalastyle-config.xml6
1 files changed, 6 insertions, 0 deletions
diff --git a/scalastyle-config.xml b/scalastyle-config.xml
index 81d57d723a..48333851ef 100644
--- a/scalastyle-config.xml
+++ b/scalastyle-config.xml
@@ -217,6 +217,12 @@ This file is divided into 3 sections:
of Commons Lang 2 (package org.apache.commons.lang.*)</customMessage>
</check>
+ <check customId="extractopt" level="error" class="org.scalastyle.scalariform.TokenChecker" enabled="true">
+ <parameters><parameter name="regex">extractOpt</parameter></parameters>
+ <customMessage>Use Utils.jsonOption(x).map(.extract[T]) instead of .extractOpt[T], as the latter
+ is slower. </customMessage>
+ </check>
+
<check level="error" class="org.scalastyle.scalariform.ImportOrderChecker" enabled="true">
<parameters>
<parameter name="groups">java,scala,3rdParty,spark</parameter>