aboutsummaryrefslogtreecommitdiff
path: root/project
diff options
context:
space:
mode:
authorMichael Armbrust <michael@databricks.com>2016-03-23 13:02:40 -0700
committerMichael Armbrust <michael@databricks.com>2016-03-23 13:03:25 -0700
commit6bc4be64f86afcb38e4444c80c9400b7b6b745de (patch)
treeb4a671d489eef1e850590cf078764fc77e392870 /project
parent919bf321987712d9143cae3c4e064fcb077ded1f (diff)
downloadspark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.tar.gz
spark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.tar.bz2
spark-6bc4be64f86afcb38e4444c80c9400b7b6b745de.zip
[SPARK-14078] Streaming Parquet Based FileSink
This PR adds a new `Sink` implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present. Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures. Author: Michael Armbrust <michael@databricks.com> Closes #11897 from marmbrus/fileSink.
Diffstat (limited to 'project')
0 files changed, 0 insertions, 0 deletions