aboutsummaryrefslogtreecommitdiff
path: root/core
Commit message (Collapse)AuthorAgeFilesLines
* Revert "Preparing Spark release v1.2.0-rc1"Patrick Wendell2014-12-041-1/+1
| | | | This reverts commit 1056e9ec13203d0c51564265e94d77a054498fdb.
* Revert "Preparing development version 1.2.1-SNAPSHOT"Patrick Wendell2014-12-041-1/+1
| | | | This reverts commit 00316cc87983b844f6603f351a8f0b84fe1f6035.
* [SPARK-4085] Propagate FetchFailedException when Spark fails to read local ↵Reynold Xin2014-12-033-13/+40
| | | | | | | | | | | | | | | | | | | shuffle file. cc aarondav kayousterhout pwendell This should go into 1.2? Author: Reynold Xin <rxin@databricks.com> Closes #3579 from rxin/SPARK-4085 and squashes the following commits: 255b4fd [Reynold Xin] Updated test. f9814d9 [Reynold Xin] Code review feedback. 2afaf35 [Reynold Xin] [SPARK-4085] Propagate FetchFailedException when Spark fails to read local shuffle file. (cherry picked from commit 1826372d0a1bc80db9015106dd5d2d155ada33f5) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-4498][core] Don't transition ExecutorInfo to RUNNING until Driver ↵Mark Hamstra2014-12-032-2/+1
| | | | | | | | | | | | adds Executor The ExecutorInfo only reaches the RUNNING state if the Driver is alive to send the ExecutorStateChanged message to master. Else, appInfo.resetRetryCount() is never called and failing Executors will eventually exceed ApplicationState.MAX_NUM_RETRY, resulting in the application being removed from the master's accounting. Author: Mark Hamstra <markhamstra@gmail.com> Closes #3550 from markhamstra/SPARK-4498 and squashes the following commits: 8f543b1 [Mark Hamstra] Don't transition ExecutorInfo to RUNNING until Executor is added by Driver
* [SPARK-4715][Core] Make sure tryToAcquire won't return a negative valuezsxwing2014-12-032-3/+19
| | | | | | | | | | | | | ShuffleMemoryManager.tryToAcquire may return a negative value. The unit test demonstrates this bug. It will output `0 did not equal -200 granted is negative`. Author: zsxwing <zsxwing@gmail.com> Closes #3575 from zsxwing/SPARK-4715 and squashes the following commits: a193ae6 [zsxwing] Make sure tryToAcquire won't return a negative value (cherry picked from commit edd3cd477c9d6016bd977c2fa692fdeff5a6e198) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4672][Core]Checkpoint() should clear f to shorten the serialization chainJerryLead2014-12-021-3/+6
| | | | | | | | | | | | | | | | | | The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 The f closure of `PartitionsRDD(ZippedPartitionsRDD2)` contains a `$outer` that references EdgeRDD/VertexRDD, which causes task's serialization chain become very long in iterative GraphX applications. As a result, StackOverflow error will occur. If we set "f = null" in `clearDependencies()`, checkpoint() can cut off the long serialization chain. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3545 from JerryLead/my_core and squashes the following commits: f7faea5 [JerryLead] checkpoint() should clear the f to avoid StackOverflow error c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit 77be8b986fd21b7bbe28aa8db1042cb22bc74fe7) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
* [SPARK-4661][Core] Minor code and docs cleanupzsxwing2014-12-012-2/+1
| | | | | | | | | | | Author: zsxwing <zsxwing@gmail.com> Closes #3521 from zsxwing/SPARK-4661 and squashes the following commits: 03cbe3f [zsxwing] Minor code and docs cleanup (cherry picked from commit 30a86acdefd5428af6d6264f59a037e0eefd74b4) Signed-off-by: Reynold Xin <rxin@databricks.com>
* SPARK-2143 [WEB UI] Add Spark version to UI footerSean Owen2014-11-301-0/+10
| | | | | | | | | | | | This PR adds the Spark version number to the UI footer; this is how it looks: ![screen shot 2014-11-21 at 22 58 40](https://cloud.githubusercontent.com/assets/822522/5157738/f4822094-7316-11e4-98f1-333a535fdcfa.png) Author: Sean Owen <sowen@cloudera.com> Closes #3410 from srowen/SPARK-2143 and squashes the following commits: e9b3a7a [Sean Owen] Add Spark version to footer
* [SPARK-4597] Use proper exception and reset variable in Utils.createTempDir()Liang-Chi Hsieh2014-11-281-1/+1
| | | | | | | | | | | | | `File.exists()` and `File.mkdirs()` only throw `SecurityException` instead of `IOException`. Then, when an exception is thrown, `dir` should be reset too. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #3449 from viirya/fix_createtempdir and squashes the following commits: 36cacbd [Liang-Chi Hsieh] Use proper exception and reset variable. (cherry picked from commit 49fe8797e64f10c574e0790b32a8c3fdc7e594a0) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* Preparing development version 1.2.1-SNAPSHOTPatrick Wendell2014-11-281-1/+1
|
* Preparing Spark release v1.2.0-rc1Patrick Wendell2014-11-281-1/+1
|
* Updating version in package.scalaPatrick Wendell2014-11-281-1/+1
|
* Revert "Preparing Spark release v1.2.0-rc1"Patrick Wendell2014-11-281-1/+1
| | | | This reverts commit 39c7d1c1f9a7785285cf4c20dfbffd96f72d5634.
* Revert "Preparing development version 1.2.1-SNAPSHOT"Patrick Wendell2014-11-281-1/+1
| | | | This reverts commit fc7bff00ac731d2632213a98cd92dc5e84ce7dcd.
* Preparing development version 1.2.1-SNAPSHOTPatrick Wendell2014-11-281-1/+1
|
* Preparing Spark release v1.2.0-rc1Patrick Wendell2014-11-281-1/+1
|
* [SPARK-4619][Storage]delete redundant time suffixmaji20142014-11-281-1/+1
| | | | | | | | | | | | | Time suffix exists in Utils.getUsedTimeMs(startTime), no need to append again, delete that Author: maji2014 <maji3@asiainfo.com> Closes #3475 from maji2014/SPARK-4619 and squashes the following commits: df0da4e [maji2014] delete redundant time suffix (cherry picked from commit ceb628197099e6c598cde1564ed9c1c3681ea955) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-4613][Core] Java API for JdbcRDDCheng Lian2014-11-273-5/+204
| | | | | | | | | | | | | | | | | | | | | | This PR introduces a set of Java APIs for using `JdbcRDD`: 1. Trait (interface) `JdbcRDD.ConnectionFactory`: equivalent to the `getConnection: () => Connection` parameter in `JdbcRDD` constructor. 2. Two overloaded versions of `Jdbc.create`: used to create `JavaRDD` that wraps a `JdbcRDD`. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3478) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3478 from liancheng/japi-jdbc-rdd and squashes the following commits: 9a54625 [Cheng Lian] Only shutdowns a single DB rather than the whole Derby driver d4cedc5 [Cheng Lian] Moves Java JdbcRDD test case to a separate test suite ffcdf2e [Cheng Lian] Java API for JdbcRDD (cherry picked from commit 120a350240f58196eafcb038ca3a353636d89239) Signed-off-by: Matei Zaharia <matei@databricks.com>
* [SPARK-4626] Kill a task only if the executorId is (still) registered with ↵roxchkplusony2014-11-271-1/+7
| | | | | | | | | | | | | | the scheduler Author: roxchkplusony <roxchkplusony@gmail.com> Closes #3483 from roxchkplusony/bugfix/4626 and squashes the following commits: aba9184 [roxchkplusony] replace warning message per review 5e7fdea [roxchkplusony] [SPARK-4626] Kill a task only if the executorId is (still) registered with the scheduler (cherry picked from commit 84376d31392858f7df215ddb3f05419181152e68) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulatorCodingCat2014-11-263-30/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-3628 In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive In this patch, I changed the way for the DAGScheduler to update the accumulator, DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt... Author: CodingCat <zhunansjtu@gmail.com> Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits: 701a1e8 [CodingCat] roll back change on Accumulator.scala 1433e6f [CodingCat] make MIMA happy b233737 [CodingCat] address Matei's comments 02261b8 [CodingCat] rollback some changes 6b0aff9 [CodingCat] update document 2b2e8cf [CodingCat] updateAccumulator 83b75f8 [CodingCat] style fix 84570d2 [CodingCat] re-enable the bad accumulator guard 1e9e14d [CodingCat] add NPE guard 21b6840 [CodingCat] simplify the patch 88d1f03 [CodingCat] fix rebase error f74266b [CodingCat] add test case for resubmitted result stage 5cf586f [CodingCat] de-duplicate on task level 138f9b3 [CodingCat] make MIMA happy 67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator (cherry picked from commit 5af53ada65f62e6b5987eada288fb48e9211ef9d) Signed-off-by: Matei Zaharia <matei@databricks.com>
* [SPARK-4612] Reduce task latency and increase scheduling throughput by ↵Tathagata Das2014-11-251-1/+1
| | | | | | | | | | | | | | | | | | making configuration initialization lazy https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L337 creates a configuration object for every task that is launched, even if there is no new dependent file/JAR to update. This is a heavy-weight creation that should be avoided if there is no new file/JAR to update. This PR makes that creation lazy. Quick local test in spark-perf scheduling throughput tests gives the following numbers in a local standalone scheduler mode. 1 job with 10000 tasks: before 7.8395 seconds, after 2.6415 seconds = 3x increase in task scheduling throughput pwendell JoshRosen Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #3463 from tdas/lazy-config and squashes the following commits: c791c1e [Tathagata Das] Reduce task latency by making configuration initialization lazy (cherry picked from commit e7f4d2534bb3361ec4b7af0d42bc798a7a425226) Signed-off-by: Reynold Xin <rxin@databricks.com>
* Revert "Preparing Spark release v1.2.0-rc1"Patrick Wendell2014-11-261-1/+1
| | | | This reverts commit cc2c05e4ee81d2f34873a2ebb9a5272867cb65c2.
* Revert "Preparing development version 1.2.1-SNAPSHOT"Patrick Wendell2014-11-261-1/+1
| | | | This reverts commit 380eba5f49eca1dbd4084e6c84e19866fffd4efa.
* Preparing development version 1.2.1-SNAPSHOTPatrick Wendell2014-11-261-1/+1
|
* Preparing Spark release v1.2.0-rc1Patrick Wendell2014-11-261-1/+1
|
* Revert "Preparing Spark release v1.2.0-rc1"Patrick Wendell2014-11-261-1/+1
| | | | This reverts commit 5247dd859b95a440baa562b9827bdeb26aa6530e.
* Revert "Preparing development version 1.2.1-SNAPSHOT"Patrick Wendell2014-11-261-1/+1
| | | | This reverts commit 79df6b43ae762263a8120f423ddb4a0811dd4b6f.
* Preparing development version 1.2.1-SNAPSHOTPatrick Wendell2014-11-261-1/+1
|
* Preparing Spark release v1.2.0-rc1Patrick Wendell2014-11-261-1/+1
|
* Revert "Preparing Spark release v1.2.0-rc1"Patrick Wendell2014-11-261-1/+1
| | | | This reverts commit db7f4a898af22a02b36428507f8ef2b429d78dc1.
* Revert "Preparing development version 1.2.1-SNAPSHOT"Patrick Wendell2014-11-261-1/+1
| | | | This reverts commit d7b1ecb25676d228deb6fe05efdb4e2ab9c3e30b.
* Preparing development version 1.2.1-SNAPSHOTUbuntu2014-11-261-1/+1
|
* Preparing Spark release v1.2.0-rc1Ubuntu2014-11-261-1/+1
|
* Revert "Preparing Spark release v1.2.0-snapshot1"Patrick Wendell2014-11-262-2/+2
| | | | This reverts commit 38c1fbd9694430cefd962c90bc36b0d108c6124b.
* Revert "Preparing development version 1.2.1-SNAPSHOT"Patrick Wendell2014-11-262-2/+2
| | | | This reverts commit d7ac6013483e83caff8ea54c228f37aeca159db8.
* [SPARK-4516] Cap default number of Netty threads at 8Aaron Davidson2014-11-251-7/+37
| | | | | | | | | | | | | | | In practice, only 2-4 cores should be required to transfer roughly 10 Gb/s, and each core that we use will have an initial overhead of roughly 32 MB of off-heap memory, which comes at a premium. Thus, this value should still retain maximum throughput and reduce wasted off-heap memory allocation. It can be overridden by setting the number of serverThreads and clientThreads manually in Spark's configuration. Author: Aaron Davidson <aaron@databricks.com> Closes #3469 from aarondav/fewer-pools2 and squashes the following commits: 087c59f [Aaron Davidson] [SPARK-4516] Cap default number of Netty threads at 8 (cherry picked from commit f5f2d27385c243959f03a9d78a149d5f405b2f50) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Fix SPARK-4471: blockManagerIdFromJson function throws exception while B...hushan[胡珊]2014-11-252-3/+16
| | | | | | | | | | | | | | | Fix [SPARK-4471](https://issues.apache.org/jira/browse/SPARK-4471): blockManagerIdFromJson function throws exception while BlockManagerId be null in MetadataFetchFailedException Author: hushan[胡珊] <hushan@xiaomi.com> Closes #3340 from suyanNone/fix-blockmanagerId-jnothing-2 and squashes the following commits: 159f9a3 [hushan[胡珊]] Refine test code for blockmanager is null 4380d73 [hushan[胡珊]] remove useless blank line 3ccf651 [hushan[胡珊]] Fix SPARK-4471: blockManagerIdFromJson function throws exception while metadata fetch failed (cherry picked from commit 9bdf5da59036c0b052df756fc4a28d64677072e7) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4546] Improve HistoryServer first time user experienceAndrew Or2014-11-253-21/+39
| | | | | | | | | | | | | | | | | | | | | | The documentation points the user to run the following ``` sbin/start-history-server.sh ``` The first thing this does is throw an exception that complains a log directory is not specified. The exception message itself does not say anything about what to set. Instead we should have a default and a landing page with a better message. The new default log directory is `file:/tmp/spark-events`. This is what it looks like as of this PR: ![after](https://issues.apache.org/jira/secure/attachment/12682985/after.png) Author: Andrew Or <andrew@databricks.com> Closes #3411 from andrewor14/minor-history-improvements and squashes the following commits: f33d6b3 [Andrew Or] Point user to set config if default log dir does not exist fc4c17a [Andrew Or] Improve HistoryServer UX (cherry picked from commit 9afcbe494a3535a9bf7958429b72e989972f82d9) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4592] Avoid duplicate worker registrations in standalone modeAndrew Or2014-11-252-7/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | **Summary.** On failover, the Master may receive duplicate registrations from the same worker, causing the worker to exit. This is caused by this commit https://github.com/apache/spark/commit/4afe9a4852ebeb4cc77322a14225cd3dec165f3f, which adds logic for the worker to re-register with the master in case of failures. However, the following race condition may occur: (1) Master A fails and Worker attempts to reconnect to all masters (2) Master B takes over and notifies Worker (3) Worker responds by registering with Master B (4) Meanwhile, Worker's previous reconnection attempt reaches Master B, causing the same Worker to register with Master B twice **Fix.** Instead of attempting to register with all known masters, the worker should re-register with only the one that it has been communicating with. This is safe because the fact that a failover has occurred means the old master must have died. Then, when the worker is finally notified of a new master, it gives up on the old one in favor of the new one. **Caveat.** Even this fix is subject to more obscure race conditions. For instance, if Master B fails and Master A recovers immediately, then Master A may still observe duplicate worker registrations. However, this and other potential race conditions summarized in [SPARK-4592](https://issues.apache.org/jira/browse/SPARK-4592), are much, much less likely than the one described above, which is deterministically reproducible. Author: Andrew Or <andrew@databricks.com> Closes #3447 from andrewor14/standalone-failover and squashes the following commits: 0d9716c [Andrew Or] Move re-registration logic to actor for thread-safety 79286dc [Andrew Or] Preserve old behavior for initial retries 83b321c [Andrew Or] Tweak wording 1fce6a9 [Andrew Or] Active master actor could be null in the beginning b6f269e [Andrew Or] Avoid duplicate worker registrations (cherry picked from commit 1b2ab1cd1b7cab9076f3c511188a610eda935701) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4525] Mesos should decline unused offersJongyoul Lee2014-11-242-21/+65
| | | | | | | | | | | | | | | | | | Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly. I've also done some minor renaming/clean-up of variables in this class and tests. Author: Patrick Wendell <pwendell@gmail.com> Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3436 from pwendell/mesos-issue and squashes the following commits: 58c35b5 [Patrick Wendell] Adding unit test for this situation c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers (cherry picked from commit b043c27424d05e3200e7ba99a1a65656b57fa2f0) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Revert "[SPARK-4525] Mesos should decline unused offers"Patrick Wendell2014-11-242-65/+21
| | | | | | | This reverts commit 4b4797309457b9301710b6e98550817337005eca. I accidentally committed this using my own authorship credential. However, I should have given authoriship to the original author: Jongyoul Lee.
* [SPARK-4525] Mesos should decline unused offersPatrick Wendell2014-11-242-21/+65
| | | | | | | | | | | | | | | | | | Functionally, this is just a small change on top of #3393 (by jongyoul). The issue being addressed is discussed in the comments there. I have not yet added a test for the bug there. I will add one shortly. I've also done some minor renaming/clean-up of variables in this class and tests. Author: Patrick Wendell <pwendell@gmail.com> Author: Jongyoul Lee <jongyoul@gmail.com> Closes #3436 from pwendell/mesos-issue and squashes the following commits: 58c35b5 [Patrick Wendell] Adding unit test for this situation c4f0697 [Patrick Wendell] Additional clean-up and fixes on top of existing fix f20f1b3 [Jongyoul Lee] [SPARK-4525] MesosSchedulerBackend.resourceOffers cannot decline unused offers from acceptedOffers - Added code for declining unused offers among acceptedOffers - Edited testCase for checking declining unused offers (cherry picked from commit b043c27424d05e3200e7ba99a1a65656b57fa2f0) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-4266] [Web-UI] Reduce stage page load time.Kay Ousterhout2014-11-248-27/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit changes the java script used to show/hide additional metrics in order to reduce page load time. SPARK-4016 significantly increased page load time for the stage page when stages had a lot (thousands or tens of thousands) of tasks, due to the additional Javascript to hide some metrics by default and stripe the tables. This commit reduces page load time in two ways: (1) Now, all of the metrics that are hidden by default are hidden by setting "display: none;" using CSS for the page, rather than hiding them using javascript after the page loads. Without this change, for stages with thousands of tasks, there was a few second delay after page load, where first the additional metrics were shown, and then after a delay were hidden once the relevant JS finished running. (2) CSS is used to stripe all of the tables except for the summary table. The summary table needs javascript to do the striping because some rows are hidden, but the javascript striping is slower, which again resulted in a delay when it was used for the task table (where for a few seconds after page load, all of the rows in the task table would be white, while the browser finished running the JS to stripe the table). cc pwendell This change is intended to be backported to 1.2 to avoid a regression in UI performance when users run large jobs. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #3328 from kayousterhout/SPARK-4266 and squashes the following commits: f964091 [Kay Ousterhout] [SPARK-4266] [Web-UI] Reduce stage page load time. (cherry picked from commit d24d5bf064572a2319627736b1fbf112b4a78edf) Signed-off-by: Kay Ousterhout <kayousterhout@gmail.com>
* [SPARK-4548] []SPARK-4517] improve performance of python broadcastDavies Liu2014-11-241-22/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Re-implement the Python broadcast using file: 1) serialize the python object using cPickle, write into disks. 2) Create a wrapper in JVM (for the dumped file), it read data from during serialization 3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors 4) During deserialization, writing the data into disk. 5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access. It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor). Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2): name | 1.1 | 1.2 with this patch | improvement ---------|--------|---------|-------- python-broadcast-w-bytes | 25.20 | 9.33 | 170.13% | python-broadcast-w-set | 4.13 | 4.50 | -8.35% | Testing with 100 tasks (16 CPUs): name | 1.1 | 1.2 with this patch | improvement ---------|--------|---------|-------- python-broadcast-w-bytes | 38.16 | 8.40 | 353.98% python-broadcast-w-set | 23.29 | 9.59 | 142.80% Author: Davies Liu <davies@databricks.com> Closes #3417 from davies/pybroadcast and squashes the following commits: 50a58e0 [Davies Liu] address comments b98de1d [Davies Liu] disable gc while unpickle e5ee6b9 [Davies Liu] support large string 09303b8 [Davies Liu] read all data into memory dde02dd [Davies Liu] improve performance of python broadcast (cherry picked from commit 6cf507685efd01df77d663145ae08e48c7f92948) Signed-off-by: Josh Rosen <joshrosen@databricks.com>
* [SPARK-4145] Web UI job pagesJosh Rosen2014-11-2421-75/+1054
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds two new pages to the Spark Web UI: - A jobs overview page, which shows details on running / completed / failed jobs. - A job details page, which displays information on an individual job's stages. The jobs overview page is now the default UI homepage; the old homepage is still accessible at `/stages`. ### Screenshots #### New UI homepage ![image](https://cloud.githubusercontent.com/assets/50748/5119035/fd0a69e6-701f-11e4-89cb-db7e9705714f.png) #### Job details page (This is effectively a per-job version of the stages page that can be extended later with other things, such as DAG visualizations) ![image](https://cloud.githubusercontent.com/assets/50748/5134910/50b340d4-70c7-11e4-88e1-6b73237ea7c8.png) ### Key changes in this PR - Rename `JobProgressPage` to `AllStagesPage` - Expose `StageInfo` objects in the ``SparkListenerJobStart` event; add backwards-compatibility tests to JsonProtocol. - Add additional data structures to `JobProgressListener` to map from stages to jobs. - Add several fields to `JobUIData`. I also added ~150 lines of Selenium tests as I uncovered UI issues while developing this patch. ### Limitations If a job contains stages that aren't run, then its overall job progress bar may be an underestimate of the total job progress; in other words, a completed job may appear to have a progress bar that's not at 100%. If stages or tasks fail, then the progress bar will not go backwards to reflect the true amount of remaining work. Author: Josh Rosen <joshrosen@databricks.com> Closes #3009 from JoshRosen/job-page and squashes the following commits: eb05e90 [Josh Rosen] Disable kill button in completed stages tables. f00c851 [Josh Rosen] Fix JsonProtocol compatibility b89c258 [Josh Rosen] More JSON protocol backwards-compatibility fixes. ff804cd [Josh Rosen] Don't write "Stage Ids" field in JobStartEvent JSON. 6f17f3f [Josh Rosen] Only store StageInfos in SparkListenerJobStart event. 2bbf41a [Josh Rosen] Update job progress bar to reflect skipped tasks/stages. 61c265a [Josh Rosen] Add “skipped stages” table; only display non-empty tables. 1f45d44 [Josh Rosen] Incorporate a bunch of minor review feedback. 0b77e3e [Josh Rosen] More bug fixes for phantom stages. 034aa8d [Josh Rosen] Use `.max()` to find result stage for job. eebdc2c [Josh Rosen] Don’t display pending stages for completed jobs. 67080ba [Josh Rosen] Ensure that "phantom stages" don't cause memory leaks. 7d10b97 [Josh Rosen] Merge remote-tracking branch 'apache/master' into job-page d69c775 [Josh Rosen] Fix table sorting on all jobs page. 5eb39dc [Josh Rosen] Add pending stages table to job page. f2a15da [Josh Rosen] Add status field to job details page. 171b53c [Josh Rosen] Move `startTime` to the start of SparkContext. e2f2c43 [Josh Rosen] Fix sorting of stages in job details page. 8955f4c [Josh Rosen] Display information for pending stages on jobs page. 8ab6c28 [Josh Rosen] Compute numTasks from job start stage infos. 5884f91 [Josh Rosen] Add StageInfos to SparkListenerJobStart event. 79793cd [Josh Rosen] Track indices of completed stage to avoid overcounting when failures occur. d62ea7b [Josh Rosen] Add failing Selenium test for stage overcounting issue. 1145c60 [Josh Rosen] Display text instead of progress bar for stages. 3d0a007 [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page 8a2351b [Josh Rosen] Add help tooltip to Spark Jobs page. b7bf30e [Josh Rosen] Add stages progress bar; fix bug where active stages show as completed. 4846ce4 [Josh Rosen] Hide "(Job Group") if no jobs were submitted in job groups. 4d58e55 [Josh Rosen] Change label to "Tasks (for all stages)" 85e9c85 [Josh Rosen] Extract startTime into separate variable. 1cf4987 [Josh Rosen] Fix broken kill links; add Selenium test to avoid future regressions. 56701fa [Josh Rosen] Move last stage name / description logic out of markup. a475ea1 [Josh Rosen] Add progress bars to jobs page. 45343b8 [Josh Rosen] More comments 4b206fb [Josh Rosen] Merge remote-tracking branch 'origin/master' into job-page bfce2b9 [Josh Rosen] Address review comments, except for progress bar. 4487dcb [Josh Rosen] [SPARK-4145] Web UI job pages 2568a6c [Josh Rosen] Rename JobProgressPage to AllStagesPage: (cherry picked from commit 4a90276ab22d6989dffb2ee2d8118d9253365646) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-4479][SQL] Avoids unnecessary defensive copies when sort based ↵Cheng Lian2014-11-242-9/+26
| | | | | | | | | | | | | | | | | | | | | | | | | shuffle is on This PR is a workaround for SPARK-4479. Two changes are introduced: when merge sort is bypassed in `ExternalSorter`, 1. also bypass RDD elements buffering as buffering is the reason that `MutableRow` backed row objects must be copied, and 2. avoids defensive copies in `Exchange` operator <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3422) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #3422 from liancheng/avoids-defensive-copies and squashes the following commits: 591f2e9 [Cheng Lian] Passes all shuffle suites 0c3c91e [Cheng Lian] Fixes shuffle write metrics when merge sort is bypassed ed5df3c [Cheng Lian] Fixes styling changes f75089b [Cheng Lian] Avoids unnecessary defensive copies when sort based shuffle is on (cherry picked from commit a6d7b61f92dc7c1f9632cecb232afa8040ab2b4d) Signed-off-by: Michael Armbrust <michael@databricks.com>
* [SPARK-4446] [SPARK CORE]Leolh2014-11-191-1/+1
| | | | | | | | | | | | | MetadataCleaner schedule task with a wrong param for delay time . Author: Leolh <leosandylh@gmail.com> Closes #3306 from Leolh/master and squashes the following commits: 4a21f4e [Leolh] Update MetadataCleaner.scala (cherry picked from commit e216ffaead983274428052caa992b20760b2c5e0) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4480] Avoid many small spills in external data structuresAndrew Or2014-11-192-12/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | **Summary.** Currently, we may spill many small files in `ExternalAppendOnlyMap` and `ExternalSorter`. The underlying root cause of this is summarized in [SPARK-4452](https://issues.apache.org/jira/browse/SPARK-4452). This PR does not address this root cause, but simply provides the guarantee that we never spill the in-memory data structure if its size is less than a configurable threshold of 5MB. This config is not documented because we don't want users to set it themselves, and it is not hard-coded because we need to change it in tests. **Symptom.** Each spill is orders of magnitude smaller than 1MB, and there are many spills. In environments where the ulimit is set, this frequently causes "too many open file" exceptions observed in [SPARK-3633](https://issues.apache.org/jira/browse/SPARK-3633). ``` 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292769 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4760 B to disk (292770 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4520 B to disk (292771 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4560 B to disk (292772 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4792 B to disk (292773 spills so far) 14/11/13 19:20:43 INFO collection.ExternalSorter: Thread 60 spilling in-memory batch of 4784 B to disk (292774 spills so far) ``` **Reproduction.** I ran the following on a small 4-node cluster with 512MB executors. Note that the back-to-back shuffle here is necessary for reasons described in [SPARK-4522](https://issues.apache.org/jira/browse/SPARK-4452). The second shuffle is a `reduceByKey` because it performs a map-side combine. ``` sc.parallelize(1 to 100000000, 100) .map { i => (i, i) } .groupByKey() .reduceByKey(_ ++ _) .count() ``` Before the change, I notice that each thread may spill up to 1000 times, and the size of each spill is on the order of 10KB. After the change, each thread spills only up to 20 times in the worst case, and the size of each spill is on the order of 1MB. Author: Andrew Or <andrew@databricks.com> Closes #3353 from andrewor14/avoid-small-spills and squashes the following commits: 49f380f [Andrew Or] Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/spark into avoid-small-spills 27d6966 [Andrew Or] Merge branch 'master' of github.com:apache/spark into avoid-small-spills f4736e3 [Andrew Or] Fix tests a919776 [Andrew Or] Avoid many small spills (cherry picked from commit 0eb4a7fb0fa1fa56677488cbd74eb39e65317621) Signed-off-by: Andrew Or <andrew@databricks.com>
* [Spark-4484] Treat maxResultSize as unlimited when set to 0; improve error ↵Nishkam Ravi2014-11-193-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | message The check for maxResultSize > 0 is missing, results in failures. Also, error message needs to be improved so the developers know that there is a new parameter to be configured Author: Nishkam Ravi <nravi@cloudera.com> Author: nravi <nravi@c1704.halxg.cloudera.com> Author: nishkamravi2 <nishkamravi@gmail.com> Closes #3360 from nishkamravi2/master_nravi and squashes the following commits: 5c9a4cb [nishkamravi2] Update TaskSetManagerSuite.scala 535295a [nishkamravi2] Update TaskSetManager.scala 3e1b616 [Nishkam Ravi] Modify test for maxResultSize 9f6583e [Nishkam Ravi] Changes to maxResultSize code (improve error message and add condition to check if maxResultSize > 0) 5f8f9ed [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi 636a9ff [nishkamravi2] Update YarnAllocator.scala 8f76c8b [Nishkam Ravi] Doc change for yarn memory overhead 35daa64 [Nishkam Ravi] Slight change in the doc for yarn memory overhead 5ac2ec1 [Nishkam Ravi] Remove out dac1047 [Nishkam Ravi] Additional documentation for yarn memory overhead issue 42c2c3d [Nishkam Ravi] Additional changes for yarn memory overhead issue 362da5e [Nishkam Ravi] Additional changes for yarn memory overhead c726bd9 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi f00fa31 [Nishkam Ravi] Improving logging for AM memoryOverhead 1cf2d1e [nishkamravi2] Update YarnAllocator.scala ebcde10 [Nishkam Ravi] Modify default YARN memory_overhead-- from an additive constant to a multiplier (redone to resolve merge conflicts) 2e69f11 [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark into master_nravi efd688a [Nishkam Ravi] Merge branch 'master' of https://github.com/apache/spark 2b630f9 [nravi] Accept memory input as "30g", "512M" instead of an int value, to be consistent with rest of Spark 3bf8fad [nravi] Merge branch 'master' of https://github.com/apache/spark 5423a03 [nravi] Merge branch 'master' of https://github.com/apache/spark eb663ca [nravi] Merge branch 'master' of https://github.com/apache/spark df2aeb1 [nravi] Improved fix for ConcurrentModificationIssue (Spark-1097, Hadoop-10456) 6b840f0 [nravi] Undo the fix for SPARK-1758 (the problem is fixed) 5108700 [nravi] Fix in Spark for the Concurrent thread modification issue (SPARK-1097, HADOOP-10456) 681b36f [nravi] Fix for SPARK-1758: failing test org.apache.spark.JavaAPISuite.wholeTextFiles (cherry picked from commit 73fedf5a6e662b640dfe29936753721988bff6ea) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-4478] Keep totalRegisteredExecutors up-to-dateAkshat Aranya2014-11-191-0/+2
| | | | | | | | | | | | | | | | This rebases PR 3368. This commit fixes totalRegisteredExecutors update [SPARK-4478], so that we can correctly keep track of number of registered executors. Author: Akshat Aranya <aaranya@quantcast.com> Closes #3373 from coolfrood/topic/SPARK-4478 and squashes the following commits: 8a4d1e4 [Akshat Aranya] Added comment 150ae93 [Akshat Aranya] [SPARK-4478] Keep totalRegisteredExecutors up-to-date (cherry picked from commit 9ccc53c72c5bcffcc121291710754e1e2d659341) Signed-off-by: Andrew Or <andrew@databricks.com>