[SPARK-9540] [MLLIB] optimize PrefixSpan implementation - spark

diff options

author	Xiangrui Meng <meng@databricks.com>	2015-08-04 22:28:49 -0700
committer	Xiangrui Meng <meng@databricks.com>	2015-08-04 22:28:49 -0700
commit	a02bcf20c4fc9e2e182630d197221729e996afc2 (patch)
tree	addf5a311acafad2849dd32c7a8c47f88f1f702f /sql/catalyst
parent	f7abd6bec9d51ed4ab6359e50eac853e64ecae86 (diff)
download	spark-a02bcf20c4fc9e2e182630d197221729e996afc2.tar.gz spark-a02bcf20c4fc9e2e182630d197221729e996afc2.tar.bz2 spark-a02bcf20c4fc9e2e182630d197221729e996afc2.zip

[SPARK-9540] [MLLIB] optimize PrefixSpan implementation

This is a major refactoring of the PrefixSpan implementation. It contains the following changes: 1. Expand prefix with one item at a time. The existing implementation generates all subsets for each itemset, which might have scalability issue when the itemset is large. 2. Use a new internal format. `<(12)(31)>` is represented by `[0, 1, 2, 0, 1, 3, 0]` internally. We use `0` because negative numbers are used to indicates partial prefix items, e.g., `_2` is represented by `-2`. 3. Remember the start indices of all partial projections in the projected postfix to help next projection. 4. Reuse the original sequence array for projected postfixes. 5. Use `Prefix` IDs in aggregation rather than its content. 6. Use `ArrayBuilder` for building primitive arrays. 7. Expose `maxLocalProjDBSize`. 8. Tests are not changed except using `0` instead of `-1` as the delimiter. `Postfix`'s API doc should be a good place to start. Closes #7594 feynmanliang zhangjiajin Author: Xiangrui Meng <meng@databricks.com> Closes #7937 from mengxr/SPARK-9540 and squashes the following commits: 2d0ec31 [Xiangrui Meng] address more comments 48f450c [Xiangrui Meng] address comments from Feynman; fixed a bug in project and added a test 65f90e8 [Xiangrui Meng] naming and documentation 8afc86a [Xiangrui Meng] refactor impl

Diffstat (limited to 'sql/catalyst')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: