aboutsummaryrefslogtreecommitdiff
path: root/common/sketch/src
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12818] Polishes spark-sketch moduleCheng Lian2016-01-296-83/+110
| | | | | | | | Fixes various minor code and Javadoc styling issues. Author: Cheng Lian <lian@databricks.com> Closes #10985 from liancheng/sketch-polishing.
* [SPARK-12818][SQL] Specialized integral and string types for Count-min SketchCheng Lian2016-01-282-9/+60
| | | | | | | | This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <lian@databricks.com> Closes #10968 from liancheng/cms-specialized.
* [SPARK-12938][SQL] DataFrame API for Bloom filterWenchen Fan2016-01-274-91/+179
| | | | | | | | | | This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs. This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10937 from cloud-fan/bloom-filter.
* [SPARK-12935][SQL] DataFrame API for Count-Min SketchCheng Lian2016-01-263-36/+56
| | | | | | | | This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs. Author: Cheng Lian <lian@databricks.com> Closes #10911 from liancheng/cms-df-api.
* [SPARK-12937][SQL] bloom filter serializationWenchen Fan2016-01-266-44/+159
| | | | | | | | | | This PR adds serialization support for BloomFilter. A version number is added to version the serialized binary format. Author: Wenchen Fan <wenchen@databricks.com> Closes #10920 from cloud-fan/bloom-filter.
* [SPARK-12934] use try-with-resources for streamstedyu2016-01-251-0/+2
| | | | | | | | liancheng please take a look Author: tedyu <yuzhihong@gmail.com> Closes #10906 from tedyu/master.
* [SPARK-12936][SQL] Initial bloom filter implementationWenchen Fan2016-01-255-0/+602
| | | | | | | | | | | | | This PR adds an initial implementation of bloom filter in the newly added sketch module. The implementation is based on the [`BloomFilter` class in guava](https://code.google.com/p/guava-libraries/source/browse/guava/src/com/google/common/hash/BloomFilter.java). Some difference from the design doc: * expose `bitSize` instead of `sizeInBytes` to user. * always need the `expectedInsertions` parameter when create bloom filter. Author: Wenchen Fan <wenchen@databricks.com> Closes #10883 from cloud-fan/bloom-filter.
* [SPARK-12934][SQL] Count-min sketch serializationCheng Lian2016-01-254-19/+213
| | | | | | | | | | This PR adds serialization support for `CountMinSketch`. A version number is added to version the serialized binary format. Author: Cheng Lian <lian@databricks.com> Closes #10893 from liancheng/cms-serialization.
* [SPARK-12933][SQL] Initial implementation of Count-Min sketchCheng Lian2016-01-235-0/+810
This PR adds an initial implementation of count min sketch, contained in a new module spark-sketch under `common/sketch`. The implementation is based on the [`CountMinSketch` class in stream-lib][1]. As required by the [design doc][2], spark-sketch should have no external dependency. Two classes, `Murmur3_x86_32` and `Platform` are copied to spark-sketch from spark-unsafe for hashing facilities. They'll also be used in the upcoming bloom filter implementation. The following features will be added in future follow-up PRs: - Serialization support - DataFrame API integration [1]: https://github.com/addthis/stream-lib/blob/aac6b4d23a8686b000f80baa447e0922ecac3bcb/src/main/java/com/clearspring/analytics/stream/frequency/CountMinSketch.java [2]: https://issues.apache.org/jira/secure/attachment/12782378/BloomFilterandCount-MinSketchinSpark2.0.pdf Author: Cheng Lian <lian@databricks.com> Closes #10851 from liancheng/count-min-sketch.