Implement ApproximateCountDistinct for SparkSql

Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <larvaboy@gmail.com> Closes #737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos.
author: larvaboy <larvaboy@gmail.com> 2014-05-13 21:26:08 -0700
committer: Reynold Xin <rxin@apache.org> 2014-05-13 21:26:08 -0700
commit: c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f (patch)
tree: 497a31ae116b285966699ef51ca975160b3845de /core
parent: 92cebada09a7e5a00ab48bcb350a9462949c33eb (diff)
download: spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.gz
spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.bz2
spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.zip
1 files changed, 3 insertions, 3 deletions
diff --git a/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala b/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
index 5efb4388f6..bc6d204434 100644
--- a/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
+++ b/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
@@ -217,7 +217,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
    * Return approximate number of distinct values for each key in this RDD.
    * The accuracy of approximation can be controlled through the relative standard deviation
    * (relativeSD) parameter, which also controls the amount of memory used. Lower values result in
-   * more accurate counts but increase the memory footprint and vise versa. Uses the provided
+   * more accurate counts but increase the memory footprint and vice versa. Uses the provided
    * Partitioner to partition the output RDD.
    */
   def countApproxDistinctByKey(relativeSD: Double, partitioner: Partitioner): RDD[(K, Long)] = {
@@ -232,7 +232,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
    * Return approximate number of distinct values for each key in this RDD.
    * The accuracy of approximation can be controlled through the relative standard deviation
    * (relativeSD) parameter, which also controls the amount of memory used. Lower values result in
-   * more accurate counts but increase the memory footprint and vise versa. HashPartitions the
+   * more accurate counts but increase the memory footprint and vice versa. HashPartitions the
    * output RDD into numPartitions.
    *
    */
@@ -244,7 +244,7 @@ class PairRDDFunctions[K, V](self: RDD[(K, V)])
    * Return approximate number of distinct values for each key this RDD.
    * The accuracy of approximation can be controlled through the relative standard deviation
    * (relativeSD) parameter, which also controls the amount of memory used. Lower values result in
-   * more accurate counts but increase the memory footprint and vise versa. The default value of
+   * more accurate counts but increase the memory footprint and vice versa. The default value of
    * relativeSD is 0.05. Hash-partitions the output RDD using the existing partitioner/parallelism
    * level.
    */
author	larvaboy <larvaboy@gmail.com>	2014-05-13 21:26:08 -0700
committer	Reynold Xin <rxin@apache.org>	2014-05-13 21:26:08 -0700
commit	c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f (patch)
tree	497a31ae116b285966699ef51ca975160b3845de /core
parent	92cebada09a7e5a00ab48bcb350a9462949c33eb (diff)
download	spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.gz spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.tar.bz2 spark-c33b8dcbf65a3a0c5ee5e65cd1dcdbc7da36aa5f.zip