aboutsummaryrefslogtreecommitdiff
path: root/core/src/test
diff options
context:
space:
mode:
authorlarvaboy <larvaboy@gmail.com>2014-05-13 21:26:08 -0700
committerReynold Xin <rxin@apache.org>2014-05-13 21:26:08 -0700
commitc33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f (patch)
tree497a31ae116b285966699ef51ca975160b3845de /core/src/test
parent92cebada09a7e5a00ab48bcb350a9462949c33eb (diff)
downloadspark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.gz
spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.bz2
spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.zip
Implement ApproximateCountDistinct for SparkSql
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <larvaboy@gmail.com> Closes #737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos.
Diffstat (limited to 'core/src/test')
0 files changed, 0 insertions, 0 deletions