diff options
author | Michael Armbrust <michael@databricks.com> | 2016-03-23 13:02:40 -0700 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2016-03-23 13:03:25 -0700 |
commit | 6bc4be64f86afcb38e4444c80c9400b7b6b745de (patch) | |
tree | b4a671d489eef1e850590cf078764fc77e392870 /project | |
parent | 919bf321987712d9143cae3c4e064fcb077ded1f (diff) | |
download | spark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.tar.gz spark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.tar.bz2 spark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.zip |
[SPARK-14078] Streaming Parquet Based FileSink
This PR adds a new `Sink` implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present.
Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures.
Author: Michael Armbrust <michael@databricks.com>
Closes #11897 from marmbrus/fileSink.
Diffstat (limited to 'project')
0 files changed, 0 insertions, 0 deletions