[SPARK-19067][SS] Processing-time-based timeout in MapGroupsWithState

## What changes were proposed in this pull request? When a key does not get any new data in `mapGroupsWithState`, the mapping function is never called on it. So we need a timeout feature that calls the function again in such cases, so that the user can decide whether to continue waiting or clean up (remove state, save stuff externally, etc.). Timeouts can be either based on processing time or event time. This JIRA is for processing time, but defines the high level API design for both. The usage would look like this. ``` def stateFunction(key: K, value: Iterator[V], state: KeyedState[S]): U = { ... state.setTimeoutDuration(10000) ... } dataset // type is Dataset[T] .groupByKey[K](keyingFunc) // generates KeyValueGroupedDataset[K, T] .mapGroupsWithState[S, U]( func = stateFunction, timeout = KeyedStateTimeout.withProcessingTime) // returns Dataset[U] ``` Note the following design aspects. - The timeout type is provided as a param in mapGroupsWithState as a parameter global to all the keys. This is so that the planner knows this at planning time, and accordingly optimize the execution based on whether to saves extra info in state or not (e.g. timeout durations or timestamps). - The exact timeout duration is provided inside the function call so that it can be customized on a per key basis. - When the timeout occurs for a key, the function is called with no values, and KeyedState.isTimingOut() set to true. - The timeout is reset for key every time the function is called on the key, that is, when the key has new data, or the key has timed out. So the user has to set the timeout duration everytime the function is called, otherwise there will not be any timeout set. Guarantees provided on timeout of key, when timeout duration is D ms: - Timeout will never be called before real clock time has advanced by D ms - Timeout will be called eventually when there is a trigger with any data in it (i.e. after D ms). So there is a no strict upper bound on when the timeout would occur. For example, if there is no data in the stream (for any key) for a while, then the timeout will not be hit. Implementation details: - Added new param to `mapGroupsWithState` for timeout - Added new method to `StateStore` to filter data based on timeout timestamp - Changed the internal map type of `HDFSBackedStateStore` from Java's `HashMap` to `ConcurrentHashMap` as the latter allows weakly-consistent fail-safe iterators on the map data. See comments in code for more details. - Refactored logic of `MapGroupsWithStateExec` to - Save timeout info to state store for each key that has data. - Then, filter states that should be timed out based on the current batch processing timestamp. - Moved KeyedState for `o.a.s.sql` to `o.a.s.sql.streaming`. I remember that this was a feedback in the MapGroupsWithState PR that I had forgotten to address. ## How was this patch tested? New unit tests in - MapGroupsWithStateSuite for timeouts. - StateStoreSuite for new APIs in StateStore. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #17179 from tdas/mapgroupwithstate-timeout.
author: Tathagata Das <tathagata.das1565@gmail.com> 2017-03-19 14:07:49 -0700
committer: Tathagata Das <tathagata.das1565@gmail.com> 2017-03-19 14:07:49 -0700
commit: 990af630d0d569880edd9c7ce9932e10037a28ab (patch)
tree: 3c25483808ca877693f42d3d10ebd49987e86645 /sql/catalyst/src/test/java/org/apache
parent: 0ee9fbf51ac863e015d57ae7824a39bd3b36141a (diff)
download: spark-990af630d0d569880edd9c7ce9932e10037a28ab.tar.gz
spark-990af630d0d569880edd9c7ce9932e10037a28ab.tar.bz2
spark-990af630d0d569880edd9c7ce9932e10037a28ab.zip
1 files changed, 29 insertions, 0 deletions
diff --git a/sql/catalyst/src/test/java/org/apache/spark/sql/streaming/JavaKeyedStateTimeoutSuite.java b/sql/catalyst/src/test/java/org/apache/spark/sql/streaming/JavaKeyedStateTimeoutSuite.java
new file mode 100644
index 0000000000..02c94b0b32
--- /dev/null
+++ b/sql/catalyst/src/test/java/org/apache/spark/sql/streaming/JavaKeyedStateTimeoutSuite.java
@@ -0,0 +1,29 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *    http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.sql.streaming;
+
+import org.apache.spark.sql.catalyst.plans.logical.ProcessingTimeTimeout$;
+import org.junit.Test;
+
+public class JavaKeyedStateTimeoutSuite {
+
+  @Test
+  public void testTimeouts() {
+    assert(KeyedStateTimeout.ProcessingTimeTimeout() == ProcessingTimeTimeout$.MODULE$);
+  }
+}
author	Tathagata Das <tathagata.das1565@gmail.com>	2017-03-19 14:07:49 -0700
committer	Tathagata Das <tathagata.das1565@gmail.com>	2017-03-19 14:07:49 -0700
commit	990af630d0d569880edd9c7ce9932e10037a28ab (patch)
tree	3c25483808ca877693f42d3d10ebd49987e86645 /sql/catalyst/src/test/java/org/apache
parent	0ee9fbf51ac863e015d57ae7824a39bd3b36141a (diff)
download	spark-990af630d0d569880edd9c7ce9932e10037a28ab.tar.gz spark-990af630d0d569880edd9c7ce9932e10037a28ab.tar.bz2 spark-990af630d0d569880edd9c7ce9932e10037a28ab.zip