diff options
author | Reynold Xin <rxin@apache.org> | 2013-12-31 17:48:24 -0800 |
---|---|---|
committer | Reynold Xin <rxin@apache.org> | 2013-12-31 17:48:24 -0800 |
commit | 8b8e70ebde880d08ebb3816b2f4003247559c7f8 (patch) | |
tree | aa984e1263c1e825b50c80e6651a35d686bf2c7d /mllib/pom.xml | |
parent | 63b411dd8664c27ac55586d8345733afad80961f (diff) | |
parent | bee445c927586136673f39259f23642a5a6e8efe (diff) | |
download | spark-8b8e70ebde880d08ebb3816b2f4003247559c7f8.tar.gz spark-8b8e70ebde880d08ebb3816b2f4003247559c7f8.tar.bz2 spark-8b8e70ebde880d08ebb3816b2f4003247559c7f8.zip |
Merge pull request #73 from falaki/ApproximateDistinctCount
Approximate distinct count
Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
Diffstat (limited to 'mllib/pom.xml')
0 files changed, 0 insertions, 0 deletions