[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter - spark

diff options

author	Wenchen Fan <wenchen@databricks.com>	2017-02-19 18:13:12 -0800
committer	Wenchen Fan <wenchen@databricks.com>	2017-02-19 18:13:12 -0800
commit	776b8f17cfc687a57c005a421a81e591c8d44a3f (patch)
tree	7b034741adc5f765674e7ff6d3f303950a20c2cd /streaming
parent	65fe902e13153ad73a3026a66e73c93393df1abb (diff)
download	spark-776b8f17cfc687a57c005a421a81e591c8d44a3f.tar.gz spark-776b8f17cfc687a57c005a421a81e591c8d44a3f.tar.bz2 spark-776b8f17cfc687a57c005a421a81e591c8d44a3f.zip

[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter

## What changes were proposed in this pull request? In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed. However, if the data is already sorted, we will sort it again, which is unnecssary. This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted. ## How was this patch tested? I did a micro benchmark manually ``` val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part") spark.time(df.write.partitionBy("part").parquet("/tmp/test")) ``` The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards. close https://github.com/apache/spark/pull/16724 Author: Wenchen Fan <wenchen@databricks.com> Closes #16898 from cloud-fan/writer.

Diffstat (limited to 'streaming')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: