aboutsummaryrefslogtreecommitdiff
path: root/dev/sparktestsupport
diff options
context:
space:
mode:
authorReynold Xin <rxin@databricks.com>2016-12-21 23:50:35 +0100
committerHerman van Hovell <hvanhovell@databricks.com>2016-12-21 23:50:35 +0100
commit354e936187708a404c0349e3d8815a47953123ec (patch)
tree7173f87541395152d3b3f7086fdd820f553c4c08 /dev/sparktestsupport
parent078c71c2dcbb1470d22f8eb8138fb17e3d7c2414 (diff)
downloadspark-354e936187708a404c0349e3d8815a47953123ec.tar.gz
spark-354e936187708a404c0349e3d8815a47953123ec.tar.bz2
spark-354e936187708a404c0349e3d8815a47953123ec.zip
[SPARK-18775][SQL] Limit the max number of records written per file
## What changes were proposed in this pull request? Currently, Spark writes a single file out per task, sometimes leading to very large files. It would be great to have an option to limit the max number of records written per file in a task, to avoid humongous files. This patch introduces a new write config option `maxRecordsPerFile` (default to a session-wide setting `spark.sql.files.maxRecordsPerFile`) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag). ## How was this patch tested? Added test cases in PartitionedWriteSuite for both dynamic partition insert and non-dynamic partition insert. Author: Reynold Xin <rxin@databricks.com> Closes #16204 from rxin/SPARK-18775.
Diffstat (limited to 'dev/sparktestsupport')
0 files changed, 0 insertions, 0 deletions