diff options
author | Shixiong Zhu <shixiong@databricks.com> | 2017-02-23 11:25:39 -0800 |
---|---|---|
committer | Tathagata Das <tathagata.das1565@gmail.com> | 2017-02-23 11:25:39 -0800 |
commit | 9bf4e2baad0e2851da554d85223ffaa029cfa490 (patch) | |
tree | a08773e6a82e7d5fa78ca2f71d707e74be36a9cc /bin/run-example | |
parent | 7bf09433f5c5e08154ba106be21fe24f17cd282b (diff) | |
download | spark-9bf4e2baad0e2851da554d85223ffaa029cfa490.tar.gz spark-9bf4e2baad0e2851da554d85223ffaa029cfa490.tar.bz2 spark-9bf4e2baad0e2851da554d85223ffaa029cfa490.zip |
[SPARK-19497][SS] Implement streaming deduplication
## What changes were proposed in this pull request?
This PR adds a special streaming deduplication operator to support `dropDuplicates` with `aggregation` and watermark. It reuses the `dropDuplicates` API but creates new logical plan `Deduplication` and new physical plan `DeduplicationExec`.
The following cases are supported:
- one or multiple `dropDuplicates()` without aggregation (with or without watermark)
- `dropDuplicates` before aggregation
Not supported cases:
- `dropDuplicates` after aggregation
Breaking changes:
- `dropDuplicates` without aggregation doesn't work with `complete` or `update` mode.
## How was this patch tested?
The new unit tests.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #16970 from zsxwing/dedup.
Diffstat (limited to 'bin/run-example')
0 files changed, 0 insertions, 0 deletions