diff options
author | Yun Ni <yunn@uber.com> | 2016-11-28 15:14:46 -0800 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2016-11-28 15:14:46 -0800 |
commit | 05f7c6ffab2a6be548375cd624dc27092677232f (patch) | |
tree | 27a954222f507a44273df13222d0946a7b485eed /dev/run-pip-tests | |
parent | 8b1609bebe489b2ef78db4be6e9836687089fe3d (diff) | |
download | spark-05f7c6ffab2a6be548375cd624dc27092677232f.tar.gz spark-05f7c6ffab2a6be548375cd624dc27092677232f.tar.bz2 spark-05f7c6ffab2a6be548375cd624dc27092677232f.zip |
[SPARK-18408][ML] API Improvements for LSH
## What changes were proposed in this pull request?
(1) Change output schema to `Array of Vector` instead of `Vectors`
(2) Use `numHashTables` as the dimension of Array
(3) Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH`
(4) Make `randUnitVectors/randCoefficients` private
(5) Make Multi-Probe NN Search and `hashDistance` private for future discussion
Saved for future PRs:
(1) AND-amplification and `numHashFunctions` as the dimension of Vector are saved for a future PR.
(2) `hashDistance` and MultiProbe NN Search needs more discussion. The current implementation is just a backward compatible one.
## How was this patch tested?
Related unit tests are modified to make sure the performance of LSH are ensured, and the outputs of the APIs meets expectation.
Author: Yun Ni <yunn@uber.com>
Author: Yunni <Euler57721@gmail.com>
Closes #15874 from Yunni/SPARK-18408-yunn-api-improvements.
Diffstat (limited to 'dev/run-pip-tests')
0 files changed, 0 insertions, 0 deletions