diff options
author | Burak Yavuz <brkyvz@gmail.com> | 2015-08-06 10:29:40 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-08-06 10:29:40 -0700 |
commit | 98e69467d4fda2c26a951409b5b7c6f1e9345ce4 (patch) | |
tree | 79802e82268885bacdc4b0e4aecaaf4e936e52b5 /python/pyspark/sql/column.py | |
parent | 076ec056818a65216eaf51aa5b3bd8f697c34748 (diff) | |
download | spark-98e69467d4fda2c26a951409b5b7c6f1e9345ce4.tar.gz spark-98e69467d4fda2c26a951409b5b7c6f1e9345ce4.tar.bz2 spark-98e69467d4fda2c26a951409b5b7c6f1e9345ce4.zip |
[SPARK-9615] [SPARK-9616] [SQL] [MLLIB] Bugs related to FrequentItems when merging and with Tungsten
In short:
1- FrequentItems should not use the InternalRow representation, because the keys in the map get messed up. For example, every key in the Map correspond to the very last element observed in the partition, when the elements are strings.
2- Merging two partitions had a bug:
**Existing behavior with size 3**
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map()
**Correct Behavior:**
Partition A -> Map(1 -> 3, 2 -> 3, 3 -> 4)
Partition B -> Map(4 -> 25)
Result -> Map(3 -> 1, 4 -> 22)
cc mengxr rxin JoshRosen
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #7945 from brkyvz/freq-fix and squashes the following commits:
07fa001 [Burak Yavuz] address 2
1dc61a8 [Burak Yavuz] address 1
506753e [Burak Yavuz] fixed and added reg test
47bfd50 [Burak Yavuz] pushing
Diffstat (limited to 'python/pyspark/sql/column.py')
0 files changed, 0 insertions, 0 deletions