diff options
author | larvaboy <larvaboy@gmail.com> | 2014-05-13 21:26:08 -0700 |
---|---|---|
committer | Reynold Xin <rxin@apache.org> | 2014-05-13 21:26:08 -0700 |
commit | c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f (patch) | |
tree | 497a31ae116b285966699ef51ca975160b3845de /docs/cluster-overview.md | |
parent | 92cebada09a7e5a00ab48bcb350a9462949c33eb (diff) | |
download | spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.gz spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.bz2 spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.zip |
Implement ApproximateCountDistinct for SparkSql
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.
A simple serializer and test cases are added as well.
Author: larvaboy <larvaboy@gmail.com>
Closes #737 from larvaboy/master and squashes the following commits:
bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct.
9ba8360 [larvaboy] Fix alignment and null handling issues.
95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct.
f57917d [larvaboy] Add the parser for the approximate count.
a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions.
7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog.
1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class.
653542b [larvaboy] Fix a couple of minor typos.
Diffstat (limited to 'docs/cluster-overview.md')
0 files changed, 0 insertions, 0 deletions