[SPARK-13809][SQL] State store for streaming aggregations - spark

diff options

author	Tathagata Das <tathagata.das1565@gmail.com>	2016-03-23 12:48:05 -0700
committer	Tathagata Das <tathagata.das1565@gmail.com>	2016-03-23 12:48:05 -0700
commit	8c826880f5eaa3221c4e9e7d3fece54e821a0b98 (patch)
tree	b6dbe3670844bac231b787ccd9a97d2797f0a181 /core
parent	0a64294fcb4b64bfe095c63c3a494e0f40e22743 (diff)
download	spark-8c826880f5eaa3221c4e9e7d3fece54e821a0b98.tar.gz spark-8c826880f5eaa3221c4e9e7d3fece54e821a0b98.tar.bz2 spark-8c826880f5eaa3221c4e9e7d3fece54e821a0b98.zip

[SPARK-13809][SQL] State store for streaming aggregations

## What changes were proposed in this pull request? In this PR, I am implementing a new abstraction for management of streaming state data - State Store. It is a key-value store for persisting running aggregates for aggregate operations in streaming dataframes. The motivation and design is discussed here. https://docs.google.com/document/d/1-ncawFx8JS5Zyfq1HAEGBx56RDet9wfVp_hDM8ZL254/edit# ## How was this patch tested? - [x] Unit tests - [x] Cluster tests **Coverage from unit tests** <img width="952" alt="screen shot 2016-03-21 at 3 09 40 pm" src="https://cloud.githubusercontent.com/assets/663212/13935872/fdc8ba86-ef76-11e5-93e8-9fa310472c7b.png"> ## TODO - [x] Fix updates() iterator to avoid duplicate updates for same key - [x] Use Coordinator in ContinuousQueryManager - [x] Plugging in hadoop conf and other confs - [x] Unit tests - [x] StateStore object lifecycle and methods - [x] StateStoreCoordinator communication and logic - [x] StateStoreRDD fault-tolerance - [x] StateStoreRDD preferred location using StateStoreCoordinator - [ ] Cluster tests - [ ] Whether preferred locations are set correctly - [ ] Whether recovery works correctly with distributed storage - [x] Basic performance tests - [x] Docs Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #11645 from tdas/state-store.

Diffstat (limited to 'core')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: