aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-frequent-pattern-mining.md
diff options
context:
space:
mode:
authorTimothy Hunter <timhunter@databricks.com>2015-12-10 12:50:46 -0800
committerJoseph K. Bradley <joseph@databricks.com>2015-12-10 12:50:46 -0800
commit2ecbe02d5b28ee562d10c1735244b90a08532c9e (patch)
treec589a01a2900513aa1b277303ed7cdffc1961ba4 /docs/mllib-frequent-pattern-mining.md
parentec5f9ed5de2218938dba52152475daafd4dc4786 (diff)
downloadspark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.tar.gz
spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.tar.bz2
spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.zip
[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.
Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.
Diffstat (limited to 'docs/mllib-frequent-pattern-mining.md')
-rw-r--r--docs/mllib-frequent-pattern-mining.md12
1 files changed, 6 insertions, 6 deletions
diff --git a/docs/mllib-frequent-pattern-mining.md b/docs/mllib-frequent-pattern-mining.md
index fe42896a05..2c8a8f2361 100644
--- a/docs/mllib-frequent-pattern-mining.md
+++ b/docs/mllib-frequent-pattern-mining.md
@@ -1,7 +1,7 @@
---
layout: global
-title: Frequent Pattern Mining - MLlib
-displayTitle: <a href="mllib-guide.html">MLlib</a> - Frequent Pattern Mining
+title: Frequent Pattern Mining - spark.mllib
+displayTitle: Frequent Pattern Mining - spark.mllib
---
Mining frequent items, itemsets, subsequences, or other substructures is usually among the
@@ -9,7 +9,7 @@ first steps to analyze a large-scale dataset, which has been an active research
data mining for years.
We refer users to Wikipedia's [association rule learning](http://en.wikipedia.org/wiki/Association_rule_learning)
for more information.
-MLlib provides a parallel implementation of FP-growth,
+`spark.mllib` provides a parallel implementation of FP-growth,
a popular algorithm to mining frequent itemsets.
## FP-growth
@@ -22,13 +22,13 @@ Different from [Apriori-like](http://en.wikipedia.org/wiki/Apriori_algorithm) al
the second step of FP-growth uses a suffix tree (FP-tree) structure to encode transactions without generating candidate sets
explicitly, which are usually expensive to generate.
After the second step, the frequent itemsets can be extracted from the FP-tree.
-In MLlib, we implemented a parallel version of FP-growth called PFP,
+In `spark.mllib`, we implemented a parallel version of FP-growth called PFP,
as described in [Li et al., PFP: Parallel FP-growth for query recommendation](http://dx.doi.org/10.1145/1454008.1454027).
PFP distributes the work of growing FP-trees based on the suffices of transactions,
and hence more scalable than a single-machine implementation.
We refer users to the papers for more details.
-MLlib's FP-growth implementation takes the following (hyper-)parameters:
+`spark.mllib`'s FP-growth implementation takes the following (hyper-)parameters:
* `minSupport`: the minimum support for an itemset to be identified as frequent.
For example, if an item appears 3 out of 5 transactions, it has a support of 3/5=0.6.
@@ -126,7 +126,7 @@ PrefixSpan Approach](http://dx.doi.org/10.1109%2FTKDE.2004.77). We refer
the reader to the referenced paper for formalizing the sequential
pattern mining problem.
-MLlib's PrefixSpan implementation takes the following parameters:
+`spark.mllib`'s PrefixSpan implementation takes the following parameters:
* `minSupport`: the minimum support required to be considered a frequent
sequential pattern.