[SPARK-18256] Improve the performance of event log replay in HistoryServer

## What changes were proposed in this pull request? This patch significantly improves the performance of event log replay in the HistoryServer via two simple changes: - **Don't use `extractOpt`**: it turns out that `json4s`'s `extractOpt` method uses exceptions for control flow, causing huge performance bottlenecks due to the overhead of initializing exceptions. To avoid this overhead, we can simply use our own` Utils.jsonOption` method. This patch replaces all uses of `extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the use of the slow `extractOpt` method. - **Don't call `Utils.getFormattedClassName` for every event**: the old code called` Utils.getFormattedClassName` dozens of times per replayed event in order to match up class names in events with SparkListener event names. By simply storing the results of these calls in constants rather than recomputing them, we're able to eliminate a huge performance hotspot by removing thousands of expensive `Class.getSimpleName` calls. ## How was this patch tested? Tested by profiling the replay of a long event log using YourKit. For an event log containing 1000+ jobs, each of which had thousands of tasks, the changes in this patch cut the replay time in half: ![image](https://cloud.githubusercontent.com/assets/50748/19980953/31154622-a1bd-11e6-9be4-21fbb9b3f9a7.png) Prior to this patch's changes, the two slowest methods in log replay were internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`: ![image](https://cloud.githubusercontent.com/assets/50748/19981052/87416cce-a1bd-11e6-9f25-06a7cd391822.png) After this patch, these hotspots are completely eliminated. Author: Josh Rosen <joshrosen@databricks.com> Closes #15756 from JoshRosen/speed-up-jsonprotocol.
author: Josh Rosen <joshrosen@databricks.com> 2016-11-04 19:32:26 -0700
committer: Yin Huai <yhuai@databricks.com> 2016-11-04 19:32:26 -0700
commit: 0e3312ee72c44f4c9acafbd80d0c8a14f3aff875 (patch)
tree: 076f0dc87bad3fc9b1455c4e5adf7002febad7ae /scalastyle-config.xml
parent: 4cee2ce251110218e68c0f8f30363ec2f2498bea (diff)
download: spark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.tar.gz
spark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.tar.bz2
spark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.zip
1 files changed, 6 insertions, 0 deletions
diff --git a/scalastyle-config.xml b/scalastyle-config.xml
index 81d57d723a..48333851ef 100644
--- a/scalastyle-config.xml
+++ b/scalastyle-config.xml
@@ -217,6 +217,12 @@ This file is divided into 3 sections:
     of Commons Lang 2 (package org.apache.commons.lang.*)</customMessage>
   </check>
 
+  <check customId="extractopt" level="error" class="org.scalastyle.scalariform.TokenChecker" enabled="true">
+    <parameters><parameter name="regex">extractOpt</parameter></parameters>
+    <customMessage>Use Utils.jsonOption(x).map(.extract[T]) instead of .extractOpt[T], as the latter
+    is slower.  </customMessage>
+  </check>
+
   <check level="error" class="org.scalastyle.scalariform.ImportOrderChecker" enabled="true">
     <parameters>
       <parameter name="groups">java,scala,3rdParty,spark</parameter>
author	Josh Rosen <joshrosen@databricks.com>	2016-11-04 19:32:26 -0700
committer	Yin Huai <yhuai@databricks.com>	2016-11-04 19:32:26 -0700
commit	0e3312ee72c44f4c9acafbd80d0c8a14f3aff875 (patch)
tree	076f0dc87bad3fc9b1455c4e5adf7002febad7ae /scalastyle-config.xml
parent	4cee2ce251110218e68c0f8f30363ec2f2498bea (diff)
download	spark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.tar.gz spark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.tar.bz2 spark-0e3312ee72c44f4c9acafbd80d0c8a14f3aff875.zip