[SPARK-2321] Stable pull-based progress / status API

This pull request is a first step towards the implementation of a stable, pull-based progress / status API for Spark (see [SPARK-2321](https://issues.apache.org/jira/browse/SPARK-2321)). For now, I'd like to discuss the basic implementation, API names, and overall interface design. Once we arrive at a good design, I'll go back and add additional methods to expose more information via these API. #### Design goals: - Pull-based API - Usable from Java / Scala / Python (eventually, likely with a wrapper) - Can be extended to expose more information without introducing binary incompatibilities. - Returns immutable objects. - Don't leak any implementation details, preserving our freedom to change the implementation. #### Implementation: - Add public methods (`getJobInfo`, `getStageInfo`) to SparkContext to allow status / progress information to be retrieved. - Add public interfaces (`SparkJobInfo`, `SparkStageInfo`) for our API return values. These interfaces consist entirely of Java-style getter methods. The interfaces are currently implemented in Java. I decided to explicitly separate the interface from its implementation (`SparkJobInfoImpl`, `SparkStageInfoImpl`) in order to prevent users from constructing these responses themselves. -Allow an existing JobProgressListener to be used when constructing a live SparkUI. This allows us to re-use this listeners in the implementation of this status API. There are a few reasons why this listener re-use makes sense: - The status API and web UI are guaranteed to show consistent information. - These listeners are already well-tested. - The same garbage-collection / information retention configurations can apply to both this API and the web UI. - Extend JobProgressListener to maintain `jobId -> Job` and `stageId -> Stage` mappings. The progress API methods are implemented in a separate trait that's mixed into SparkContext. This helps to avoid SparkContext.scala from becoming larger and more difficult to read. Author: Josh Rosen <joshrosen@databricks.com> Author: Josh Rosen <joshrosen@apache.org> Closes #2696 from JoshRosen/progress-reporting-api and squashes the following commits: e6aa78d [Josh Rosen] Add tests. b585c16 [Josh Rosen] Accept SparkListenerBus instead of more specific subclasses. c96402d [Josh Rosen] Address review comments. 2707f98 [Josh Rosen] Expose current stage attempt id c28ba76 [Josh Rosen] Update demo code: 646ff1d [Josh Rosen] Document spark.ui.retainedJobs. 7f47d6d [Josh Rosen] Clean up SparkUI constructors, per Andrew's feedback. b77b3d8 [Josh Rosen] Merge remote-tracking branch 'origin/master' into progress-reporting-api 787444c [Josh Rosen] Move status API methods into trait that can be mixed into SparkContext. f9a9a00 [Josh Rosen] More review comments: 3dc79af [Josh Rosen] Remove creation of unused listeners in SparkContext. 249ca16 [Josh Rosen] Address several review comments: da5648e [Josh Rosen] Add example of basic progress reporting in Java. 7319ffd [Josh Rosen] Add getJobIdsForGroup() and num*Tasks() methods. cc568e5 [Josh Rosen] Add note explaining that interfaces should not be implemented outside of Spark. 6e840d4 [Josh Rosen] Remove getter-style names and "consistent snapshot" semantics: 08cbec9 [Josh Rosen] Begin to sketch the interfaces for a stable, public status API. ac2d13a [Josh Rosen] Add jobId->stage, stageId->stage mappings in JobProgressListener 24de263 [Josh Rosen] Create UI listeners in SparkContext instead of in Tabs:
author: Josh Rosen <joshrosen@databricks.com> 2014-10-25 00:06:57 -0700
committer: Josh Rosen <joshrosen@databricks.com> 2014-10-25 00:06:57 -0700
commit: 9530316887612dca060a128fca34dd5a6ab2a9a9 (patch)
tree: c88a0a75186a5f72676f28e131368f62de8f99b6 /docs/configuration.md
parent: 3a845d3c048eebb0bddb3937128746fde3e8e4d8 (diff)
download: spark-9530316887612dca060a128fca34dd5a6ab2a9a9.tar.gz
spark-9530316887612dca060a128fca34dd5a6ab2a9a9.tar.bz2
spark-9530316887612dca060a128fca34dd5a6ab2a9a9.zip
1 files changed, 10 insertions, 1 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 66738d3ca7..3007706a25 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -375,7 +375,16 @@ Apart from these, the following properties are also available, and may be useful
   <td><code>spark.ui.retainedStages</code></td>
   <td>1000</td>
   <td>
-    How many stages the Spark UI remembers before garbage collecting.
+    How many stages the Spark UI and status APIs remember before garbage
+    collecting.
+  </td>
+</tr>
+<tr>
+  <td><code>spark.ui.retainedJobs</code></td>
+  <td>1000</td>
+  <td>
+    How many stages the Spark UI and status APIs remember before garbage
+    collecting.
   </td>
 </tr>
 <tr>
author	Josh Rosen <joshrosen@databricks.com>	2014-10-25 00:06:57 -0700
committer	Josh Rosen <joshrosen@databricks.com>	2014-10-25 00:06:57 -0700
commit	9530316887612dca060a128fca34dd5a6ab2a9a9 (patch)
tree	c88a0a75186a5f72676f28e131368f62de8f99b6 /docs/configuration.md
parent	3a845d3c048eebb0bddb3937128746fde3e8e4d8 (diff)
download	spark-9530316887612dca060a128fca34dd5a6ab2a9a9.tar.gz spark-9530316887612dca060a128fca34dd5a6ab2a9a9.tar.bz2 spark-9530316887612dca060a128fca34dd5a6ab2a9a9.zip