[SPARK-18775][SQL] Limit the max number of records written per file - spark

diff options

author	Reynold Xin <rxin@databricks.com>	2016-12-21 23:50:35 +0100
committer	Herman van Hovell <hvanhovell@databricks.com>	2016-12-21 23:50:35 +0100
commit	354e936187708a404c0349e3d8815a47953123ec (patch)
tree	7173f87541395152d3b3f7086fdd820f553c4c08 /dev/sparktestsupport
parent	078c71c2dcbb1470d22f8eb8138fb17e3d7c2414 (diff)
download	spark-354e936187708a404c0349e3d8815a47953123ec.tar.gz spark-354e936187708a404c0349e3d8815a47953123ec.tar.bz2 spark-354e936187708a404c0349e3d8815a47953123ec.zip

[SPARK-18775][SQL] Limit the max number of records written per file

## What changes were proposed in this pull request? Currently, Spark writes a single file out per task, sometimes leading to very large files. It would be great to have an option to limit the max number of records written per file in a task, to avoid humongous files. This patch introduces a new write config option `maxRecordsPerFile` (default to a session-wide setting `spark.sql.files.maxRecordsPerFile`) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag). ## How was this patch tested? Added test cases in PartitionedWriteSuite for both dynamic partition insert and non-dynamic partition insert. Author: Reynold Xin <rxin@databricks.com> Closes #16204 from rxin/SPARK-18775.

Diffstat (limited to 'dev/sparktestsupport')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: