diff options
author | Dongjoon Hyun <dongjoon@apache.org> | 2016-06-09 22:46:51 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2016-06-09 22:46:51 -0700 |
commit | 5a3533e779d8e43ce0980203dfd3cbe343cc7d0a (patch) | |
tree | 77c967b1db3c2afe2cb140e76807ef4883854441 /dev/sparktestsupport | |
parent | 6c5fd977fbcb821a57cb4a13bc3d413a695fbc32 (diff) | |
download | spark-5a3533e779d8e43ce0980203dfd3cbe343cc7d0a.tar.gz spark-5a3533e779d8e43ce0980203dfd3cbe343cc7d0a.tar.bz2 spark-5a3533e779d8e43ce0980203dfd3cbe343cc7d0a.zip |
[SPARK-15696][SQL] Improve `crosstab` to have a consistent column order
## What changes were proposed in this pull request?
Currently, `crosstab` returns a Dataframe having **random-order** columns obtained by just `distinct`. Also, the documentation of `crosstab` shows the result in a sorted order which is different from the current implementation. This PR explicitly constructs the columns in a sorted order in order to improve user experience. Also, this implementation gives the same result with the documentation.
**Before**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value| 3| 2| 1|
+---------+---+---+---+
| 2| 1| 0| 2|
| 1| 0| 1| 1|
| 3| 1| 1| 0|
+---------+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value| c| a| b|
+---------+---+---+---+
| 2| 1| 2| 0|
| 1| 0| 1| 1|
| 3| 1| 0| 1|
+---------+---+---+---+
```
**After**
```scala
scala> spark.createDataFrame(Seq((1, 1), (1, 2), (2, 1), (2, 1), (2, 3), (3, 2), (3, 3))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value| 1| 2| 3|
+---------+---+---+---+
| 2| 2| 0| 1|
| 1| 1| 1| 0|
| 3| 0| 1| 1|
+---------+---+---+---+
scala> spark.createDataFrame(Seq((1, "a"), (1, "b"), (2, "a"), (2, "a"), (2, "c"), (3, "b"), (3, "c"))).toDF("key", "value").stat.crosstab("key", "value").show()
+---------+---+---+---+
|key_value| a| b| c|
+---------+---+---+---+
| 2| 2| 0| 1|
| 1| 1| 1| 0|
| 3| 0| 1| 1|
+---------+---+---+---+
```
## How was this patch tested?
Pass the Jenkins tests with updated testcases.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #13436 from dongjoon-hyun/SPARK-15696.
Diffstat (limited to 'dev/sparktestsupport')
0 files changed, 0 insertions, 0 deletions