From 26d35f3fd942761b0adecd1a720e1fa834db4de9 Mon Sep 17 00:00:00 2001
From: Xiangrui Meng <meng@databricks.com>
Date: Tue, 22 Apr 2014 11:20:47 -0700
Subject: [SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0

Preview: http://54.82.240.23:4000/mllib-guide.html

Table of contents:

* Basics
  * Data types
  * Summary statistics
* Classification and regression
  * linear support vector machine (SVM)
  * logistic regression
  * linear linear squares, Lasso, and ridge regression
  * decision tree
  * naive Bayes
* Collaborative Filtering
  * alternating least squares (ALS)
* Clustering
  * k-means
* Dimensionality reduction
  * singular value decomposition (SVD)
  * principal component analysis (PCA)
* Optimization
  * stochastic gradient descent
  * limited-memory BFGS (L-BFGS)

Author: Xiangrui Meng <meng@databricks.com>

Closes #422 from mengxr/mllib-doc and squashes the following commits:

944e3a9 [Xiangrui Meng] merge master
f9fda28 [Xiangrui Meng] minor
9474065 [Xiangrui Meng] add alpha to ALS examples
928e630 [Xiangrui Meng] initialization_mode -> initializationMode
5bbff49 [Xiangrui Meng] add imports to labeled point examples
c17440d [Xiangrui Meng] fix python nb example
28f40dc [Xiangrui Meng] remove localhost:4000
369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc
7dc95cc [Xiangrui Meng] update linear methods
053ad8a [Xiangrui Meng] add links to go back to the main page
abbbf7e [Xiangrui Meng] update ALS argument names
648283e [Xiangrui Meng] level down statistics
14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide
8cd2441 [Xiangrui Meng] minor updates
186ab07 [Xiangrui Meng] update section names
6568d65 [Xiangrui Meng] update toc, level up lr and svm
162ee12 [Xiangrui Meng] rename section names
5c1e1b1 [Xiangrui Meng] minor
8aeaba1 [Xiangrui Meng] wrap long lines
6ce6a6f [Xiangrui Meng] add summary statistics to toc
5760045 [Xiangrui Meng] claim beta
cc604bf [Xiangrui Meng] remove classification and regression
92747b3 [Xiangrui Meng] make section titles consistent
e605dd6 [Xiangrui Meng] add LIBSVM loader
f639674 [Xiangrui Meng] add python section to migration guide
c82ffb4 [Xiangrui Meng] clean optimization
31660eb [Xiangrui Meng] update linear algebra and stat
0a40837 [Xiangrui Meng] first pass over linear methods
1fc8271 [Xiangrui Meng] update toc
906ed0a [Xiangrui Meng] add a python example to naive bayes
5f0a700 [Xiangrui Meng] update collaborative filtering
656d416 [Xiangrui Meng] update mllib-clustering
86e143a [Xiangrui Meng] remove data types section from main page
8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples
d1b5cbf [Xiangrui Meng] merge master
72e4804 [Xiangrui Meng] one pass over tree guide
64f8995 [Xiangrui Meng] move decision tree guide to a separate file
9fca001 [Xiangrui Meng] add first version of linear algebra guide
53c9552 [Xiangrui Meng] update dependencies
f316ec2 [Xiangrui Meng] add migration guide
f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction
182460f [Xiangrui Meng] add guide for naive Bayes
137fd1d [Xiangrui Meng] re-organize toc
a61e434 [Xiangrui Meng] update mllib's toc
---
 docs/mllib-naive-bayes.md | 115 ++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 115 insertions(+)
 create mode 100644 docs/mllib-naive-bayes.md

(limited to 'docs/mllib-naive-bayes.md')
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
new file mode 100644
index 0000000000..6160fe5b2f
--- /dev/null
+++ b/docs/mllib-naive-bayes.md
@@ -0,0 +1,115 @@
+---
+layout: global
+title: <a href="mllib-guide.html">MLlib</a> - Naive Bayes
+---
+
+Naive Bayes is a simple multiclass classification algorithm with the assumption of independence
+between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to
+the training data, it computes the conditional probability distribution of each feature given label,
+and then it applies Bayes' theorem to compute the conditional probability distribution of label
+given an observation and use it for prediction. For more details, please visit the wikipedia page
+[Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier).
+
+In MLlib, we implemented multinomial naive Bayes, which is typically used for document
+classification. Within that context, each observation is a document, each feature represents a term,
+whose value is the frequency of the term. For its formulation, please visit the wikipedia page
+[Multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
+or the section
+[Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
+from the book Introduction to Information
+Retrieval. [Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
+setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
+vectors are usually sparse. Please supply sparse vectors as input to take advantage of
+sparsity. Since the training data is only used once, it is not necessary to cache it.
+
+## Examples
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+[NaiveBayes](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
+multinomial naive Bayes. It takes an RDD of
+[LabeledPoint](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional
+smoothing parameter `lambda` as input, and output a
+[NaiveBayesModel](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
+can be used for evaluation and prediction.
+
+{% highlight scala %}
+import org.apache.spark.mllib.classification.NaiveBayes
+
+val training: RDD[LabeledPoint] = ... // training set
+val test: RDD[LabeledPoint] = ... // test set
+
+val model = NaiveBayes.train(training, lambda = 1.0)
+val prediction = model.predict(test.map(_.features))
+
+val predictionAndLabel = prediction.zip(test.map(_.label))
+val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+[NaiveBayes](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
+multinomial naive Bayes. It takes a Scala RDD of
+[LabeledPoint](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an
+optionally smoothing parameter `lambda` as input, and output a
+[NaiveBayesModel](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
+can be used for evaluation and prediction.
+
+{% highlight java %}
+import org.apache.spark.mllib.classification.NaiveBayes;
+
+JavaRDD<LabeledPoint> training = ... // training set
+JavaRDD<LabeledPoint> test = ... // test set
+
+NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
+
+JavaRDD<Double> prediction = model.predict(test.map(new Function<LabeledPoint, Vector>() {
+    public Vector call(LabeledPoint p) {
+      return p.features();
+    }
+  })
+JavaPairRDD<Double, Double> predictionAndLabel = 
+  prediction.zip(test.map(new Function<LabeledPoint, Double>() {
+    public Double call(LabeledPoint p) {
+      return p.label();
+    }
+  })
+double accuracy = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
+    public Boolean call(Tuple2<Double, Double> pl) {
+      return pl._1() == pl._2();
+    }
+  }).count() / test.count()
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+[NaiveBayes](api/pyspark/pyspark.mllib.classification.NaiveBayes-class.html) implements multinomial
+naive Bayes. It takes an RDD of
+[LabeledPoint](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html) and an optionally
+smoothing parameter `lambda` as input, and output a
+[NaiveBayesModel](api/pyspark/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be
+used for evaluation and prediction.
+
+<!--- TODO: Make Python's example consistent with Scala's and Java's. --->
+{% highlight python %}
+from pyspark.mllib.regression import LabeledPoint
+from pyspark.mllib.classification import NaiveBayes
+
+# an RDD of LabeledPoint
+data = sc.parallelize([
+  LabeledPoint(0.0, [0.0, 0.0])
+  ... # more labeled points
+])
+
+# Train a naive Bayes model.
+model = NaiveBayes.train(data, 1.0)
+
+# Make prediction.
+prediction = model.predict([0.0, 0.0])
+{% endhighlight %}
+
+</div>
+</div>
-- 
cgit v1.2.3