diff options
author | Zhenhua Wang <wzh_zju@163.com> | 2016-11-25 05:02:48 -0800 |
---|---|---|
committer | Herman van Hovell <hvanhovell@databricks.com> | 2016-11-25 05:02:48 -0800 |
commit | 5ecdc7c5c019acc6b1f9c2e6c5b7d35957eadb88 (patch) | |
tree | 463100fa970ea587179106de11c1837be55913e1 /mllib | |
parent | 51b1c1551d3a7147403b9e821fcc7c8f57b4824c (diff) | |
download | spark-5ecdc7c5c019acc6b1f9c2e6c5b7d35957eadb88.tar.gz spark-5ecdc7c5c019acc6b1f9c2e6c5b7d35957eadb88.tar.bz2 spark-5ecdc7c5c019acc6b1f9c2e6c5b7d35957eadb88.zip |
[SPARK-18559][SQL] Fix HLL++ with small relative error
## What changes were proposed in this pull request?
In `HyperLogLogPlusPlus`, if the relative error is so small that p >= 19, it will cause ArrayIndexOutOfBoundsException in `THRESHOLDS(p-4)` . We should check `p` and when p >= 19, regress to the original HLL result and use the small range correction they use.
The pr also fixes the upper bound in the log info in `require()`.
The upper bound is computed by:
```
val relativeSD = 1.106d / Math.pow(Math.E, p * Math.log(2.0d) / 2.0d)
```
which is derived from the equation for computing `p`:
```
val p = 2.0d * Math.log(1.106d / relativeSD) / Math.log(2.0d)
```
## How was this patch tested?
add test cases for:
1. checking validity of parameter relatvieSD
2. estimation with smaller relative error so that p >= 19
Author: Zhenhua Wang <wzh_zju@163.com>
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes #15990 from wzhfy/hllppRsd.
Diffstat (limited to 'mllib')
0 files changed, 0 insertions, 0 deletions