aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-frequent-pattern-mining.md
diff options
context:
space:
mode:
authorFeynman Liang <fliang@databricks.com>2015-08-17 17:53:24 -0700
committerXiangrui Meng <meng@databricks.com>2015-08-17 17:53:24 -0700
commit0b6b01761370629ce387c143a25d41f3a334ff28 (patch)
treed7cc1c1213dd71b84b0841dcbd22fd2d4cabc628 /docs/mllib-frequent-pattern-mining.md
parent18523c130548f0438dff8d1f25531fd2ed36e517 (diff)
downloadspark-0b6b01761370629ce387c143a25d41f3a334ff28.tar.gz
spark-0b6b01761370629ce387c143a25d41f3a334ff28.tar.bz2
spark-0b6b01761370629ce387c143a25d41f3a334ff28.zip
[SPARK-9898] [MLLIB] Prefix Span user guide
Adds user guide for `PrefixSpan`, including Scala and Java example code. mengxr zhangjiajin Author: Feynman Liang <fliang@databricks.com> Closes #8253 from feynmanliang/SPARK-9898.
Diffstat (limited to 'docs/mllib-frequent-pattern-mining.md')
-rw-r--r--docs/mllib-frequent-pattern-mining.md96
1 files changed, 96 insertions, 0 deletions
diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md
index bcc066a185..8ea4389266 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -96,3 +96,99 @@ for (FPGrowth.FreqItemset<String> itemset: model.freqItemsets().toJavaRDD().coll
</div>
</div>
+
+## PrefixSpan
+
+PrefixSpan is a sequential pattern mining algorithm described in
+[Pei et al., Mining Sequential Patterns by Pattern-Growth: The
+PrefixSpan Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer
+the reader to the referenced paper for formalizing the sequential
+pattern mining problem.
+
+MLlib's PrefixSpan implementation takes the following parameters:
+
+* `minSupport`: the minimum support required to be considered a frequent
+ sequential pattern.
+* `maxPatternLength`: the maximum length of a frequent sequential
+ pattern. Any frequent pattern exceeding this length will not be
+ included in the results.
+* `maxLocalProjDBSize`: the maximum number of items allowed in a
+ prefix-projected database before local iterative processing of the
+ projected databse begins. This parameter should be tuned with respect
+ to the size of your executors.
+
+**Examples**
+
+The following example illustrates PrefixSpan running on the sequences
+(using same notation as Pei et al):
+
+~~~
+ <(12)3>
+ <1(32)(12)>
+ <(12)5>
+ <6>
+~~~
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+[`PrefixSpan`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpan) implements the
+PrefixSpan algorithm.
+Calling `PrefixSpan.run` returns a
+[`PrefixSpanModel`](api/scala/index.html#org.apache.spark.mllib.fpm.PrefixSpanModel)
+that stores the frequent sequences with their frequencies.
+
+{% highlight scala %}
+import org.apache.spark.mllib.fpm.PrefixSpan
+
+val sequences = sc.parallelize(Seq(
+ Array(Array(1, 2), Array(3)),
+ Array(Array(1), Array(3, 2), Array(1, 2)),
+ Array(Array(1, 2), Array(5)),
+ Array(Array(6))
+ ), 2).cache()
+val prefixSpan = new PrefixSpan()
+ .setMinSupport(0.5)
+ .setMaxPatternLength(5)
+val model = prefixSpan.run(sequences)
+model.freqSequences.collect().foreach { freqSequence =>
+println(
+ freqSequence.sequence.map(_.mkString("[", ", ", "]")).mkString("[", ", ", "]") + ", " + freqSequence.freq)
+}
+{% endhighlight %}
+
+</div>
+
+<div data-lang="java" markdown="1">
+
+[`PrefixSpan`](api/java/org/apache/spark/mllib/fpm/PrefixSpan.html) implements the
+PrefixSpan algorithm.
+Calling `PrefixSpan.run` returns a
+[`PrefixSpanModel`](api/java/org/apache/spark/mllib/fpm/PrefixSpanModel.html)
+that stores the frequent sequences with their frequencies.
+
+{% highlight java %}
+import java.util.Arrays;
+import java.util.List;
+
+import org.apache.spark.mllib.fpm.PrefixSpan;
+import org.apache.spark.mllib.fpm.PrefixSpanModel;
+
+JavaRDD<List<List<Integer>>> sequences = sc.parallelize(Arrays.asList(
+ Arrays.asList(Arrays.asList(1, 2), Arrays.asList(3)),
+ Arrays.asList(Arrays.asList(1), Arrays.asList(3, 2), Arrays.asList(1, 2)),
+ Arrays.asList(Arrays.asList(1, 2), Arrays.asList(5)),
+ Arrays.asList(Arrays.asList(6))
+), 2);
+PrefixSpan prefixSpan = new PrefixSpan()
+ .setMinSupport(0.5)
+ .setMaxPatternLength(5);
+PrefixSpanModel<Integer> model = prefixSpan.run(sequences);
+for (PrefixSpan.FreqSequence<Integer> freqSeq: model.freqSequences().toJavaRDD().collect()) {
+ System.out.println(freqSeq.javaSequence() + ", " + freqSeq.freq());
+}
+{% endhighlight %}
+
+</div>
+</div>
+