aboutsummaryrefslogtreecommitdiff
path: root/graphx
diff options
context:
space:
mode:
authorTyson Condie <tcondie@gmail.com>2016-11-28 23:07:17 -0800
committerTathagata Das <tathagata.das1565@gmail.com>2016-11-28 23:07:17 -0800
commit3c0beea4752d39ee630a107316f40aff4a1b4ae7 (patch)
tree9136929f1c607ed7f0218afb56e18b19fe7e1a0b /graphx
parente2318ede04fa7a756d1c8151775e1f2406a176ca (diff)
downloadspark-3c0beea4752d39ee630a107316f40aff4a1b4ae7.tar.gz
spark-3c0beea4752d39ee630a107316f40aff4a1b4ae7.tar.bz2
spark-3c0beea4752d39ee630a107316f40aff4a1b4ae7.zip
[SPARK-18339][SPARK-18513][SQL] Don't push down current_timestamp for filters in StructuredStreaming and persist batch and watermark timestamps to offset log.
## What changes were proposed in this pull request? For the following workflow: 1. I have a column called time which is at minute level precision in a Streaming DataFrame 2. I want to perform groupBy time, count 3. Then I want my MemorySink to only have the last 30 minutes of counts and I perform this by .where('time >= current_timestamp().cast("long") - 30 * 60) what happens is that the `filter` gets pushed down before the aggregation, and the filter happens on the source data for the aggregation instead of the result of the aggregation (where I actually want to filter). I guess the main issue here is that `current_timestamp` is non-deterministic in the streaming context and shouldn't be pushed down the filter. Does this require us to store the `current_timestamp` for each trigger of the streaming job, that is something to discuss. Furthermore, we want to persist current batch timestamp and watermark timestamp to the offset log so that these values are consistent across multiple executions of the same batch. brkyvz zsxwing tdas ## How was this patch tested? A test was added to StreamingAggregationSuite ensuring the above use case is handled. The test injects a stream of time values (in seconds) to a query that runs in complete mode and only outputs the (count) aggregation results for the past 10 seconds. Author: Tyson Condie <tcondie@gmail.com> Closes #15949 from tcondie/SPARK-18339.
Diffstat (limited to 'graphx')
0 files changed, 0 insertions, 0 deletions