diff options
author | Reynold Xin <rxin@apache.org> | 2013-12-31 17:48:24 -0800 |
---|---|---|
committer | Reynold Xin <rxin@apache.org> | 2013-12-31 17:48:24 -0800 |
commit | 8b8e70ebde880d08ebb3816b2f4003247559c7f8 (patch) | |
tree | aa984e1263c1e825b50c80e6651a35d686bf2c7d /project/SparkBuild.scala | |
parent | 63b411dd8664c27ac55586d8345733afad80961f (diff) | |
parent | bee445c927586136673f39259f23642a5a6e8efe (diff) | |
download | spark-8b8e70ebde880d08ebb3816b2f4003247559c7f8.tar.gz spark-8b8e70ebde880d08ebb3816b2f4003247559c7f8.tar.bz2 spark-8b8e70ebde880d08ebb3816b2f4003247559c7f8.zip |
Merge pull request #73 from falaki/ApproximateDistinctCount
Approximate distinct count
Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
Diffstat (limited to 'project/SparkBuild.scala')
-rw-r--r-- | project/SparkBuild.scala | 3 |
1 files changed, 2 insertions, 1 deletions
diff --git a/project/SparkBuild.scala b/project/SparkBuild.scala index 1df1abc9a3..b3b5fc788f 100644 --- a/project/SparkBuild.scala +++ b/project/SparkBuild.scala @@ -247,7 +247,8 @@ object SparkBuild extends Build { "com.codahale.metrics" % "metrics-ganglia" % "3.0.0", "com.codahale.metrics" % "metrics-graphite" % "3.0.0", "com.twitter" %% "chill" % "0.3.1", - "com.twitter" % "chill-java" % "0.3.1" + "com.twitter" % "chill-java" % "0.3.1", + "com.clearspring.analytics" % "stream" % "2.5.1" ) ) |