aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark
diff options
context:
space:
mode:
authorImran Rashid <irashid@cloudera.com>2016-06-22 08:35:41 -0500
committerImran Rashid <irashid@cloudera.com>2016-06-22 08:35:41 -0500
commitcf1995a97645f0b44c997f4fdbba631fd6b91a16 (patch)
tree7b5ef038baa09c960ac60a058b57f9884a01c51a /python/pyspark
parent01277d4b259dcf9cad25eece1377162b7a8c946d (diff)
downloadspark-cf1995a97645f0b44c997f4fdbba631fd6b91a16.tar.gz
spark-cf1995a97645f0b44c997f4fdbba631fd6b91a16.tar.bz2
spark-cf1995a97645f0b44c997f4fdbba631fd6b91a16.zip
[SPARK-15783][CORE] Fix Flakiness in BlacklistIntegrationSuite
## What changes were proposed in this pull request? Three changes here -- first two were causing failures w/ BlacklistIntegrationSuite 1. The testing framework didn't include the reviveOffers thread, so the test which involved delay scheduling might never submit offers late enough for the delay scheduling to kick in. So added in the periodic revive offers, just like the real scheduler. 2. `assertEmptyDataStructures` would occasionally fail, because it appeared there was still an active job. This is because in DAGScheduler, the jobWaiter is notified of the job completion before the data structures are cleaned up. Most of the time the test code that is waiting on the jobWaiter won't become active until after the data structures are cleared, but occasionally the race goes the other way, and the assertions fail. 3. `DAGSchedulerSuite` was not stopping all the inner parts it was setting up, so each test was leaking a number of threads. So we stop those parts too. 4. Turns out that `assertMapOutputAvailable` is not terribly useful in this framework -- most of the places I was trying to use it suffer from some race. 5. When there is an exception in the backend, try to improve the error msg a little bit. Before the exception was printed to the console, but the test would fail w/ a timeout, and the logs wouldn't show anything. ## How was this patch tested? I ran all the tests in `BlacklistIntegrationSuite` 5k times and everything in `DAGSchedulerSuite` 1k times on my laptop. Also I ran a full jenkins build with `BlacklistIntegrationSuite` 500 times and `DAGSchedulerSuite` 50 times, see https://github.com/apache/spark/pull/13548. (I tried more times but jenkins timed out.) To check for more leaked threads, I added some code to dump the list of all threads at the end of each test in DAGSchedulerSuite, which is how I discovered the mapOutputTracker and eventLoop were leaking threads. (I removed that code from the final pr, just part of the testing.) And I'll run Jenkins on this a couple of times to do one more check. Author: Imran Rashid <irashid@cloudera.com> Closes #13565 from squito/blacklist_extra_tests.
Diffstat (limited to 'python/pyspark')
0 files changed, 0 insertions, 0 deletions