diff options
author | Xiangrui Meng <meng@databricks.com> | 2015-08-04 22:28:49 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-08-04 22:28:49 -0700 |
commit | a02bcf20c4fc9e2e182630d197221729e996afc2 (patch) | |
tree | addf5a311acafad2849dd32c7a8c47f88f1f702f /sql/catalyst | |
parent | f7abd6bec9d51ed4ab6359e50eac853e64ecae86 (diff) | |
download | spark-a02bcf20c4fc9e2e182630d197221729e996afc2.tar.gz spark-a02bcf20c4fc9e2e182630d197221729e996afc2.tar.bz2 spark-a02bcf20c4fc9e2e182630d197221729e996afc2.zip |
[SPARK-9540] [MLLIB] optimize PrefixSpan implementation
This is a major refactoring of the PrefixSpan implementation. It contains the following changes:
1. Expand prefix with one item at a time. The existing implementation generates all subsets for each itemset, which might have scalability issue when the itemset is large.
2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0, 1, 3, 0]` internally. We use `0` because negative numbers are used to indicates partial prefix items, e.g., `_2` is represented by `-2`.
3. Remember the start indices of all partial projections in the projected postfix to help next projection.
4. Reuse the original sequence array for projected postfixes.
5. Use `Prefix` IDs in aggregation rather than its content.
6. Use `ArrayBuilder` for building primitive arrays.
7. Expose `maxLocalProjDBSize`.
8. Tests are not changed except using `0` instead of `-1` as the delimiter.
`Postfix`'s API doc should be a good place to start.
Closes #7594
feynmanliang zhangjiajin
Author: Xiangrui Meng <meng@databricks.com>
Closes #7937 from mengxr/SPARK-9540 and squashes the following commits:
2d0ec31 [Xiangrui Meng] address more comments
48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project and added a test
65f90e8 [Xiangrui Meng] naming and documentation
8afc86a [Xiangrui Meng] refactor impl
Diffstat (limited to 'sql/catalyst')
0 files changed, 0 insertions, 0 deletions