[SPARK-18846][SCHEDULER] Fix flakiness in SchedulerIntegrationSuite

There is a small race in SchedulerIntegrationSuite. The test assumes that the taskscheduler thread processing that last task will finish before the DAGScheduler processes the task event and notifies the job waiter, but that is not 100% guaranteed. ran the test locally a bunch of times, never failed, though admittedly it never failed locally for me before either. However I am nearly 100% certain this is what caused the failure of one jenkins build https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/68694/consoleFull (which is long gone now, sorry -- I fixed it as part of https://github.com/apache/spark/pull/14079 initially) Author: Imran Rashid <irashid@cloudera.com> Closes #16270 from squito/sched_integ_flakiness.
author: Imran Rashid <irashid@cloudera.com> 2016-12-14 12:26:49 -0600
committer: Imran Rashid <irashid@cloudera.com> 2016-12-14 12:27:01 -0600
commit: ac013ea58933a2057475f7accd197b9ed01b495e (patch)
tree: 3b03ea6731e439485127178035e0358c940312da /core/src/test/scala/org
parent: cccd64393ea633e29d4a505fb0a7c01b51a79af8 (diff)
download: spark-ac013ea58933a2057475f7accd197b9ed01b495e.tar.gz
spark-ac013ea58933a2057475f7accd197b9ed01b495e.tar.bz2
spark-ac013ea58933a2057475f7accd197b9ed01b495e.zip
1 files changed, 12 insertions, 2 deletions
diff --git a/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala b/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala
index c28aa06623..2ba63da881 100644
--- a/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala
+++ b/core/src/test/scala/org/apache/spark/scheduler/SchedulerIntegrationSuite.scala
@@ -28,6 +28,8 @@ import scala.reflect.ClassTag
 
 import org.scalactic.TripleEquals
 import org.scalatest.Assertions.AssertionsHelper
+import org.scalatest.concurrent.Eventually._
+import org.scalatest.time.SpanSugar._
 
 import org.apache.spark._
 import org.apache.spark.TaskState._
@@ -157,8 +159,16 @@ abstract class SchedulerIntegrationSuite[T <: MockBackend: ClassTag] extends Spa
       }
       // When a job fails, we terminate before waiting for all the task end events to come in,
       // so there might still be a running task set.  So we only check these conditions
-      // when the job succeeds
-      assert(taskScheduler.runningTaskSets.isEmpty)
+      // when the job succeeds.
+      // When the final task of a taskset completes, we post
+      // the event to the DAGScheduler event loop before we finish processing in the taskscheduler
+      // thread.  It's possible the DAGScheduler thread processes the event, finishes the job,
+      // and notifies the job waiter before our original thread in the task scheduler finishes
+      // handling the event and marks the taskset as complete.  So its ok if we need to wait a
+      // *little* bit longer for the original taskscheduler thread to finish up to deal w/ the race.
+      eventually(timeout(1 second), interval(10 millis)) {
+        assert(taskScheduler.runningTaskSets.isEmpty)
+      }
       assert(!backend.hasTasks)
     } else {
       assert(failure != null)
author	Imran Rashid <irashid@cloudera.com>	2016-12-14 12:26:49 -0600
committer	Imran Rashid <irashid@cloudera.com>	2016-12-14 12:27:01 -0600
commit	ac013ea58933a2057475f7accd197b9ed01b495e (patch)
tree	3b03ea6731e439485127178035e0358c940312da /core/src/test/scala/org
parent	cccd64393ea633e29d4a505fb0a7c01b51a79af8 (diff)
download	spark-ac013ea58933a2057475f7accd197b9ed01b495e.tar.gz spark-ac013ea58933a2057475f7accd197b9ed01b495e.tar.bz2 spark-ac013ea58933a2057475f7accd197b9ed01b495e.zip