diff options
author | Wenchen Fan <wenchen@databricks.com> | 2017-02-19 18:13:12 -0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2017-02-19 18:13:12 -0800 |
commit | 776b8f17cfc687a57c005a421a81e591c8d44a3f (patch) | |
tree | 7b034741adc5f765674e7ff6d3f303950a20c2cd /sql/core/pom.xml | |
parent | 65fe902e13153ad73a3026a66e73c93393df1abb (diff) | |
download | spark-776b8f17cfc687a57c005a421a81e591c8d44a3f.tar.gz spark-776b8f17cfc687a57c005a421a81e591c8d44a3f.tar.bz2 spark-776b8f17cfc687a57c005a421a81e591c8d44a3f.zip |
[SPARK-19563][SQL] avoid unnecessary sort in FileFormatWriter
## What changes were proposed in this pull request?
In `FileFormatWriter`, we will sort the input rows by partition columns and bucket id and sort columns, if we want to write data out partitioned or bucketed.
However, if the data is already sorted, we will sort it again, which is unnecssary.
This PR removes the sorting logic in `FileFormatWriter` and use `SortExec` instead. We will not add `SortExec` if the data is already sorted.
## How was this patch tested?
I did a micro benchmark manually
```
val df = spark.range(10000000).select($"id", $"id" % 10 as "part").sort("part")
spark.time(df.write.partitionBy("part").parquet("/tmp/test"))
```
The result was about 6.4 seconds before this PR, and is 5.7 seconds afterwards.
close https://github.com/apache/spark/pull/16724
Author: Wenchen Fan <wenchen@databricks.com>
Closes #16898 from cloud-fan/writer.
Diffstat (limited to 'sql/core/pom.xml')
0 files changed, 0 insertions, 0 deletions