[SPARK-16052][SQL] Improve `CollapseRepartition` optimizer for Repartition/RepartitionBy

## What changes were proposed in this pull request? This PR improves `CollapseRepartition` to optimize the adjacent combinations of **Repartition** and **RepartitionBy**. Also, this PR adds a testsuite for this optimizer. **Target Scenario** ```scala scala> val dsView1 = spark.range(8).repartition(8, $"id") scala> dsView1.createOrReplaceTempView("dsView1") scala> sql("select id from dsView1 distribute by id").explain(true) ``` **Before** ```scala scala> sql("select id from dsView1 distribute by id").explain(true) == Parsed Logical Plan == 'RepartitionByExpression ['id] +- 'Project ['id] +- 'UnresolvedRelation `dsView1` == Analyzed Logical Plan == id: bigint RepartitionByExpression [id#0L] +- Project [id#0L] +- SubqueryAlias dsview1 +- RepartitionByExpression [id#0L], 8 +- Range (0, 8, splits=8) == Optimized Logical Plan == RepartitionByExpression [id#0L] +- RepartitionByExpression [id#0L], 8 +- Range (0, 8, splits=8) == Physical Plan == Exchange hashpartitioning(id#0L, 200) +- Exchange hashpartitioning(id#0L, 8) +- *Range (0, 8, splits=8) ``` **After** ```scala scala> sql("select id from dsView1 distribute by id").explain(true) == Parsed Logical Plan == 'RepartitionByExpression ['id] +- 'Project ['id] +- 'UnresolvedRelation `dsView1` == Analyzed Logical Plan == id: bigint RepartitionByExpression [id#0L] +- Project [id#0L] +- SubqueryAlias dsview1 +- RepartitionByExpression [id#0L], 8 +- Range (0, 8, splits=8) == Optimized Logical Plan == RepartitionByExpression [id#0L] +- Range (0, 8, splits=8) == Physical Plan == Exchange hashpartitioning(id#0L, 200) +- *Range (0, 8, splits=8) ``` ## How was this patch tested? Pass the Jenkins tests (including a new testsuite). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13765 from dongjoon-hyun/SPARK-16052.
author: Dongjoon Hyun <dongjoon@apache.org> 2016-07-08 16:44:53 +0800
committer: Wenchen Fan <wenchen@databricks.com> 2016-07-08 16:44:53 +0800
commit: dff73bfa5e08c4c065584cfa9655a7839d28ad49 (patch)
tree: a32f2729fe6b5885c1b5a01286fb5ef3cfa50e6c /python
parent: 5bce4580939c27876f11cd75f0dc2190fb9fa908 (diff)
download: spark-dff73bfa5e08c4c065584cfa9655a7839d28ad49.tar.gz
spark-dff73bfa5e08c4c065584cfa9655a7839d28ad49.tar.bz2
spark-dff73bfa5e08c4c065584cfa9655a7839d28ad49.zip
1 files changed, 2 insertions, 2 deletions
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index a0ac7a9342..dd670a9b3d 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -464,10 +464,10 @@ class DataFrame(object):
         +---+-----+
         |age| name|
         +---+-----+
-        |  5|  Bob|
-        |  5|  Bob|
         |  2|Alice|
+        |  5|  Bob|
         |  2|Alice|
+        |  5|  Bob|
         +---+-----+
         >>> data.rdd.getNumPartitions()
         7
author	Dongjoon Hyun <dongjoon@apache.org>	2016-07-08 16:44:53 +0800
committer	Wenchen Fan <wenchen@databricks.com>	2016-07-08 16:44:53 +0800
commit	dff73bfa5e08c4c065584cfa9655a7839d28ad49 (patch)
tree	a32f2729fe6b5885c1b5a01286fb5ef3cfa50e6c /python
parent	5bce4580939c27876f11cd75f0dc2190fb9fa908 (diff)
download	spark-dff73bfa5e08c4c065584cfa9655a7839d28ad49.tar.gz spark-dff73bfa5e08c4c065584cfa9655a7839d28ad49.tar.bz2 spark-dff73bfa5e08c4c065584cfa9655a7839d28ad49.zip