[SPARK-8890] [SQL] Fallback on sorting when writing many dynamic partitions - spark

diff options

author	Michael Armbrust <michael@databricks.com>	2015-08-07 16:24:50 -0700
committer	Michael Armbrust <michael@databricks.com>	2015-08-07 16:24:50 -0700
commit	49702bd738de681255a7177339510e0e1b25a8db (patch)
tree	4f982b881a15175228b659f917eb1e12e345d6ff /mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala
parent	902334fd55bbe40a57c1de2a9bdb25eddf1c8cf6 (diff)
download	spark-49702bd738de681255a7177339510e0e1b25a8db.tar.gz spark-49702bd738de681255a7177339510e0e1b25a8db.tar.bz2 spark-49702bd738de681255a7177339510e0e1b25a8db.zip

[SPARK-8890] [SQL] Fallback on sorting when writing many dynamic partitions

Previously, we would open a new file for each new dynamic written out using `HadoopFsRelation`. For formats like parquet this is very costly due to the buffers required to get good compression. In this PR I refactor the code allowing us to fall back on an external sort when many partitions are seen. As such each task will open no more than `spark.sql.sources.maxFiles` files. I also did the following cleanup: - Instead of keying the file HashMap on an expensive to compute string representation of the partition, we now use a fairly cheap UnsafeProjection that avoids heap allocations. - The control flow for instantiating and invoking a writer container has been simplified. Now instead of switching in two places based on the use of partitioning, the specific writer container must implement a single method `writeRows` that is invoked using `runJob`. - `InternalOutputWriter` has been removed. Instead we have a `private[sql]` method `writeInternal` that converts and calls the public method. This method can be overridden by internal datasources to avoid the conversion. This change remove a lot of code duplication and per-row `asInstanceOf` checks. - `commands.scala` has been split up. Author: Michael Armbrust <michael@databricks.com> Closes #8010 from marmbrus/fsWriting and squashes the following commits: 00804fe [Michael Armbrust] use shuffleMemoryManager.pageSizeBytes 775cc49 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into fsWriting 17b690e [Michael Armbrust] remove comment 40f0372 [Michael Armbrust] address comments f5675bd [Michael Armbrust] char -> string 7e2d0a4 [Michael Armbrust] make sure we close current writer 8100100 [Michael Armbrust] delete empty commands.scala 71cc717 [Michael Armbrust] update comment 8ec75ac [Michael Armbrust] [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions

Diffstat (limited to 'mllib/src/main/scala/org/apache/spark/ml/classification/DecisionTreeClassifier.scala')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: