diff options
author | Cheng Lian <lian@databricks.com> | 2016-01-23 00:34:55 -0800 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2016-01-23 00:34:55 -0800 |
commit | 1c690ddafa8376c55cbc5b7a7a750200abfbe2a6 (patch) | |
tree | 1be95d50cb9c14eb6051c1f068f6f708b1a34e9c /pom.xml | |
parent | 5af5a02160b42115579003b749c4d1831bf9d48e (diff) | |
download | spark-1c690ddafa8376c55cbc5b7a7a750200abfbe2a6.tar.gz spark-1c690ddafa8376c55cbc5b7a7a750200abfbe2a6.tar.bz2 spark-1c690ddafa8376c55cbc5b7a7a750200abfbe2a6.zip |
[SPARK-12933][SQL] Initial implementation of Count-Min sketch
This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1].
As required by the [design doc][2], spark-sketch should have no external dependency.
Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation.
The following features will be added in future follow-up PRs:
- Serialization support
- DataFrame API integration
[1]: https://github.com/addthis/stream-lib/blob/aac6b4d23a8686b000f80baa447e0922ecac3bcb/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java
[2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf
Author: Cheng Lian <lian@databricks.com>
Closes #10851 from liancheng/count-min-sketch.
Diffstat (limited to 'pom.xml')
-rw-r--r-- | pom.xml | 1 |
1 files changed, 1 insertions, 0 deletions
@@ -86,6 +86,7 @@ </mailingLists> <modules> + <module>common/sketch</module> <module>tags</module> <module>core</module> <module>graphx</module> |