diff options
author | Reynold Xin <rxin@databricks.com> | 2016-12-21 23:50:35 +0100 |
---|---|---|
committer | Herman van Hovell <hvanhovell@databricks.com> | 2016-12-21 23:50:35 +0100 |
commit | 354e936187708a404c0349e3d8815a47953123ec (patch) | |
tree | 7173f87541395152d3b3f7086fdd820f553c4c08 /dev/sparktestsupport | |
parent | 078c71c2dcbb1470d22f8eb8138fb17e3d7c2414 (diff) | |
download | spark-354e936187708a404c0349e3d8815a47953123ec.tar.gz spark-354e936187708a404c0349e3d8815a47953123ec.tar.bz2 spark-354e936187708a404c0349e3d8815a47953123ec.zip |
[SPARK-18775][SQL] Limit the max number of records written per file
## What changes were proposed in this pull request?
Currently, Spark writes a single file out per task, sometimes leading to very large files. It would be great to have an option to limit the max number of records written per file in a task, to avoid humongous files.
This patch introduces a new write config option `maxRecordsPerFile` (default to a session-wide setting `spark.sql.files.maxRecordsPerFile`) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag).
## How was this patch tested?
Added test cases in PartitionedWriteSuite for both dynamic partition insert and non-dynamic partition insert.
Author: Reynold Xin <rxin@databricks.com>
Closes #16204 from rxin/SPARK-18775.
Diffstat (limited to 'dev/sparktestsupport')
0 files changed, 0 insertions, 0 deletions