aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorZhenhua Wang <wzh_zju@163.com>2016-11-25 05:02:48 -0800
committerHerman van Hovell <hvanhovell@databricks.com>2016-11-25 05:02:48 -0800
commit5ecdc7c5c019acc6b1f9c2e6c5b7d35957eadb88 (patch)
tree463100fa970ea587179106de11c1837be55913e1 /mllib
parent51b1c1551d3a7147403b9e821fcc7c8f57b4824c (diff)
downloadspark-5ecdc7c5c019acc6b1f9c2e6c5b7d35957eadb88.tar.gz
spark-5ecdc7c5c019acc6b1f9c2e6c5b7d35957eadb88.tar.bz2
spark-5ecdc7c5c019acc6b1f9c2e6c5b7d35957eadb88.zip
[SPARK-18559][SQL] Fix HLL++ with small relative error
## What changes were proposed in this pull request? In `HyperLogLogPlusPlus`, if the relative error is so small that p >= 19, it will cause ArrayIndexOutOfBoundsException in `THRESHOLDS(p-4)` . We should check `p` and when p >= 19, regress to the original HLL result and use the small range correction they use. The pr also fixes the upper bound in the log info in `require()`. The upper bound is computed by: ``` val relativeSD = 1.106d / Math.pow(Math.E, p * Math.log(2.0d) / 2.0d) ``` which is derived from the equation for computing `p`: ``` val p = 2.0d * Math.log(1.106d / relativeSD) / Math.log(2.0d) ``` ## How was this patch tested? add test cases for: 1. checking validity of parameter relatvieSD 2. estimation with smaller relative error so that p >= 19 Author: Zhenhua Wang <wzh_zju@163.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes #15990 from wzhfy/hllppRsd.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions