diff options
author | Michael Armbrust <michael@databricks.com> | 2015-08-07 16:24:50 -0700 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2015-08-07 16:24:50 -0700 |
commit | 49702bd738de681255a7177339510e0e1b25a8db (patch) | |
tree | 4f982b881a15175228b659f917eb1e12e345d6ff /ec2 | |
parent | 902334fd55bbe40a57c1de2a9bdb25eddf1c8cf6 (diff) | |
download | spark-49702bd738de681255a7177339510e0e1b25a8db.tar.gz spark-49702bd738de681255a7177339510e0e1b25a8db.tar.bz2 spark-49702bd738de681255a7177339510e0e1b25a8db.zip |
[SPARK-8890] [SQL] Fallback on sorting when writing many dynamic partitions
Previously, we would open a new file for each new dynamic written out using `HadoopFsRelation`. For formats like parquet this is very costly due to the buffers required to get good compression. In this PR I refactor the code allowing us to fall back on an external sort when many partitions are seen. As such each task will open no more than `spark.sql.sources.maxFiles` files. I also did the following cleanup:
- Instead of keying the file HashMap on an expensive to compute string representation of the partition, we now use a fairly cheap UnsafeProjection that avoids heap allocations.
- The control flow for instantiating and invoking a writer container has been simplified. Now instead of switching in two places based on the use of partitioning, the specific writer container must implement a single method `writeRows` that is invoked using `runJob`.
- `InternalOutputWriter` has been removed. Instead we have a `private[sql]` method `writeInternal` that converts and calls the public method. This method can be overridden by internal datasources to avoid the conversion. This change remove a lot of code duplication and per-row `asInstanceOf` checks.
- `commands.scala` has been split up.
Author: Michael Armbrust <michael@databricks.com>
Closes #8010 from marmbrus/fsWriting and squashes the following commits:
00804fe [Michael Armbrust] use shuffleMemoryManager.pageSizeBytes
775cc49 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into fsWriting
17b690e [Michael Armbrust] remove comment
40f0372 [Michael Armbrust] address comments
f5675bd [Michael Armbrust] char -> string
7e2d0a4 [Michael Armbrust] make sure we close current writer
8100100 [Michael Armbrust] delete empty commands.scala
71cc717 [Michael Armbrust] update comment
8ec75ac [Michael Armbrust] [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions
Diffstat (limited to 'ec2')
0 files changed, 0 insertions, 0 deletions