[SPARK-8998] [MLLIB] Distribute PrefixSpan computation for large projected databases

Continuation of work by zhangjiajin Closes #7412 Author: zhangjiajin <zhangjiajin@huawei.com> Author: Feynman Liang <fliang@databricks.com> Author: zhang jiajin <zhangjiajin@huawei.com> Closes #7783 from feynmanliang/SPARK-8998-improve-distributed and squashes the following commits: a61943d [Feynman Liang] Collect small patterns to local 4ddf479 [Feynman Liang] Parallelize freqItemCounts ad23aa9 [zhang jiajin] Merge pull request #1 from feynmanliang/SPARK-8998-collectBeforeLocal 87fa021 [Feynman Liang] Improve extend prefix readability c2caa5c [Feynman Liang] Readability improvements and comments 1235cfc [Feynman Liang] Use Iterable[Array[_]] over Array[Array[_]] for database da0091b [Feynman Liang] Use lists for prefixes to reuse data cb2a4fc [Feynman Liang] Inline code for readability 01c9ae9 [Feynman Liang] Add getters 6e149fa [Feynman Liang] Fix splitPrefixSuffixPairs 64271b3 [zhangjiajin] Modified codes according to comments. d2250b7 [zhangjiajin] remove minPatternsBeforeLocalProcessing, add maxSuffixesBeforeLocalProcessing. b07e20c [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark into CollectEnoughPrefixes 095aa3a [zhangjiajin] Modified the code according to the review comments. baa2885 [zhangjiajin] Modified the code according to the review comments. 6560c69 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixeSpan a8fde87 [zhangjiajin] Merge branch 'master' of https://github.com/apache/spark 4dd1c8a [zhangjiajin] initialize file before rebase. 078d410 [zhangjiajin] fix a scala style error. 22b0ef4 [zhangjiajin] Add feature: Collect enough frequent prefixes before projection in PrefixSpan. ca9c4c8 [zhangjiajin] Modified the code according to the review comments. 574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization. ba5df34 [zhangjiajin] Fix a Scala style error. 4c60fb3 [zhangjiajin] Fix some Scala style errors. 1dd33ad [zhangjiajin] Modified the code according to the review comments. 89bc368 [zhangjiajin] Fixed a Scala style error. a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala 951fd42 [zhang jiajin] Delete Prefixspan.scala 575995f [zhangjiajin] Modified the code according to the review comments. 91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
author: zhangjiajin <zhangjiajin@huawei.com> 2015-07-30 08:14:09 -0700
committer: Xiangrui Meng <meng@databricks.com> 2015-07-30 08:14:09 -0700
commit: d212a314227dec26c0dbec8ed3422d0ec8f818f9 (patch)
tree: 32775371b13cab56481318e6133bb6e136e63ad0 /mllib/src/test
parent: c5815930be46a89469440b7c61b59764fb67a54c (diff)
download: spark-d212a314227dec26c0dbec8ed3422d0ec8f818f9.tar.gz
spark-d212a314227dec26c0dbec8ed3422d0ec8f818f9.tar.bz2
spark-d212a314227dec26c0dbec8ed3422d0ec8f818f9.zip
1 files changed, 11 insertions, 10 deletions
diff --git a/mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala
index 9f107c89f6..6dd2dc926a 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/fpm/PrefixSpanSuite.scala
@@ -44,13 +44,6 @@ class PrefixSpanSuite extends SparkFunSuite with MLlibTestSparkContext {
 
     val rdd = sc.parallelize(sequences, 2).cache()
 
-    def compareResult(
-        expectedValue: Array[(Array[Int], Long)],
-        actualValue: Array[(Array[Int], Long)]): Boolean = {
-      expectedValue.map(x => (x._1.toSeq, x._2)).toSet ==
-        actualValue.map(x => (x._1.toSeq, x._2)).toSet
-    }
-
     val prefixspan = new PrefixSpan()
       .setMinSupport(0.33)
       .setMaxPatternLength(50)
@@ -76,7 +69,7 @@ class PrefixSpanSuite extends SparkFunSuite with MLlibTestSparkContext {
       (Array(4, 5), 2L),
       (Array(5), 3L)
     )
-    assert(compareResult(expectedValue1, result1.collect()))
+    assert(compareResults(expectedValue1, result1.collect()))
 
     prefixspan.setMinSupport(0.5).setMaxPatternLength(50)
     val result2 = prefixspan.run(rdd)
@@ -87,7 +80,7 @@ class PrefixSpanSuite extends SparkFunSuite with MLlibTestSparkContext {
       (Array(4), 4L),
       (Array(5), 3L)
     )
-    assert(compareResult(expectedValue2, result2.collect()))
+    assert(compareResults(expectedValue2, result2.collect()))
 
     prefixspan.setMinSupport(0.33).setMaxPatternLength(2)
     val result3 = prefixspan.run(rdd)
@@ -107,6 +100,14 @@ class PrefixSpanSuite extends SparkFunSuite with MLlibTestSparkContext {
       (Array(4, 5), 2L),
       (Array(5), 3L)
     )
-    assert(compareResult(expectedValue3, result3.collect()))
+    assert(compareResults(expectedValue3, result3.collect()))
+  }
+
+  private def compareResults(
+    expectedValue: Array[(Array[Int], Long)],
+    actualValue: Array[(Array[Int], Long)]): Boolean = {
+    expectedValue.map(x => (x._1.toSeq, x._2)).toSet ==
+      actualValue.map(x => (x._1.toSeq, x._2)).toSet
   }
+
 }
author	zhangjiajin <zhangjiajin@huawei.com>	2015-07-30 08:14:09 -0700
committer	Xiangrui Meng <meng@databricks.com>	2015-07-30 08:14:09 -0700
commit	d212a314227dec26c0dbec8ed3422d0ec8f818f9 (patch)
tree	32775371b13cab56481318e6133bb6e136e63ad0 /mllib/src/test
parent	c5815930be46a89469440b7c61b59764fb67a54c (diff)
download	spark-d212a314227dec26c0dbec8ed3422d0ec8f818f9.tar.gz spark-d212a314227dec26c0dbec8ed3422d0ec8f818f9.tar.bz2 spark-d212a314227dec26c0dbec8ed3422d0ec8f818f9.zip