aboutsummaryrefslogtreecommitdiff
path: root/launcher
diff options
context:
space:
mode:
authorSean Owen <sowen@cloudera.com>2016-03-16 16:11:24 +0000
committerSean Owen <sowen@cloudera.com>2016-03-16 16:11:24 +0000
commit9412547e7a58c40e7be5acacaf15cc48057cb18a (patch)
tree4c8e11c79c5cd3e8d6bb41da0c2a29ff068455bf /launcher
parent496d2a2b403ac83b390a90519217dd310b0013a4 (diff)
downloadspark-9412547e7a58c40e7be5acacaf15cc48057cb18a.tar.gz
spark-9412547e7a58c40e7be5acacaf15cc48057cb18a.tar.bz2
spark-9412547e7a58c40e7be5acacaf15cc48057cb18a.zip
[SPARK-13823][HOTFIX] Increase tryAcquire timeout and assert it succeeds to fix failure on slow machines
## What changes were proposed in this pull request? I'm seeing several PR builder builds fail after https://github.com/apache/spark/pull/11725/files. Example: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.4/lastFailedBuild/console ``` testCommunication(org.apache.spark.launcher.LauncherServerSuite) Time elapsed: 0.023 sec <<< FAILURE! java.lang.AssertionError: expected:<app-id> but was:<null> at org.apache.spark.launcher.LauncherServerSuite.testCommunication(LauncherServerSuite.java:93) ``` However, other builds pass this same test, including the test when run locally and on the Jenkins PR builder. The failure itself concerns a change to how the test waits on a condition, and the wait can time out; therefore I think this is due to fast/slow machine differences. This is an attempt at a hot fix; it's a little hard to verify since locally and on the PR builder, it passes anyway. The change itself should be harmless anyway. Why didn't this happen before, if the new logic was supposed to be equivalent to the old? I think this is the sequence: - First attempt to acquire semaphore for 10ms actually silently times out - The changed being waited for happens just after that, a bit too late - Assertion passes since condition became true just in time - `release()` fires from the listener - Next `tryAcquire` however immediately succeeds because the first `tryAcquire` didn't acquire anything, but its subsequent condition is not yet true; this would explain why the second one always fails Versus the original using `notifyAll()`, there's a small difference: `wait()`-ing after `notifyAll()` just results in another wait; it doesn't make it return immediately. So this was a tiny latent issue that was masked by the semantics. Now the test asserts that the event actually happened (semaphore was acquired). (The timeout is still here to prevent the test from hanging forever, and to detect really slow response.) The timeout is increased to a second to allow plenty of time anyway. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11763 from srowen/SPARK-13823.3.
Diffstat (limited to 'launcher')
-rw-r--r--launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java6
1 files changed, 3 insertions, 3 deletions
diff --git a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
index 13f72b757f..5bf2babdd1 100644
--- a/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
+++ b/launcher/src/test/java/org/apache/spark/launcher/LauncherServerSuite.java
@@ -83,17 +83,17 @@ public class LauncherServerSuite extends BaseSuite {
client = new TestClient(s);
client.send(new Hello(handle.getSecret(), "1.4.0"));
- semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
+ assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));
// Make sure the server matched the client to the handle.
assertNotNull(handle.getConnection());
client.send(new SetAppId("app-id"));
- semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
+ assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));
assertEquals("app-id", handle.getAppId());
client.send(new SetState(SparkAppHandle.State.RUNNING));
- semaphore.tryAcquire(10, TimeUnit.MILLISECONDS);
+ assertTrue(semaphore.tryAcquire(1, TimeUnit.SECONDS));
assertEquals(SparkAppHandle.State.RUNNING, handle.getState());
handle.stop();