[SPARK-14078] Streaming Parquet Based FileSink - spark

diff options

author	Michael Armbrust <michael@databricks.com>	2016-03-23 13:02:40 -0700
committer	Michael Armbrust <michael@databricks.com>	2016-03-23 13:03:25 -0700
commit	6bc4be64f86afcb38e4444c80c9400b7b6b745de (patch)
tree	b4a671d489eef1e850590cf078764fc77e392870 /project
parent	919bf321987712d9143cae3c4e064fcb077ded1f (diff)
download	spark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.tar.gz spark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.tar.bz2 spark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.zip

[SPARK-14078] Streaming Parquet Based FileSink

This PR adds a new `Sink` implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present. Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures. Author: Michael Armbrust <michael@databricks.com> Closes #11897 from marmbrus/fileSink.

Diffstat (limited to 'project')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: