aboutsummaryrefslogtreecommitdiff
path: root/streaming
diff options
context:
space:
mode:
authorWenchen Fan <wenchen@databricks.com>2017-02-19 18:13:12 -0800
committerWenchen Fan <wenchen@databricks.com>2017-02-19 18:13:12 -0800
commit776b8f17cfc687a57c005a421a81e591c8d44a3f (patch)
tree7b034741adc5f765674e7ff6d3f303950a20c2cd /streaming
parent65fe902e13153ad73a3026a66e73c93393df1abb (diff)
downloadspark-776b8f17cfc687a57c005a421a81e591c8d44a3f.tar.gz
spark-776b8f17cfc687a57c005a421a81e591c8d44a3f.tar.bz2
spark-776b8f17cfc687a57c005a421a81e591c8d44a3f.zip
[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter
## What changes were proposed in this pull request? In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed. However, if the data is already sorted, we will sort it again, which is unnecssary. This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted. ## How was this patch tested? I did a micro benchmark manually ``` val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part") spark.time(df.write.partitionBy("part").parquet("/tmp/test")) ``` The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards. close https://github.com/apache/spark/pull/16724 Author: Wenchen Fan <wenchen@databricks.com> Closes #16898 from cloud-fan/writer.
Diffstat (limited to 'streaming')
0 files changed, 0 insertions, 0 deletions