Implement ApproximateCountDistinct for SparkSql - spark

diff options

author	larvaboy <larvaboy@gmail.com>	2014-05-13 21:26:08 -0700
committer	Reynold Xin <rxin@apache.org>	2014-05-13 21:26:08 -0700
commit	c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f (patch)
tree	497a31ae116b285966699ef51ca975160b3845de /core/src/test
parent	92cebada09a7e5a00ab48bcb350a9462949c33eb (diff)
download	spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.gz spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.bz2 spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.zip

Implement ApproximateCountDistinct for SparkSql

Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <larvaboy@gmail.com> Closes #737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos.

Diffstat (limited to 'core/src/test')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: