aboutsummaryrefslogtreecommitdiff
path: root/docs/mllib-naive-bayes.md
diff options
context:
space:
mode:
authorXiangrui Meng <meng@databricks.com>2014-04-22 11:20:47 -0700
committerPatrick Wendell <pwendell@gmail.com>2014-04-22 11:20:47 -0700
commit26d35f3fd942761b0adecd1a720e1fa834db4de9 (patch)
tree16e57e2ff01e7cd2d7a1a3c1f3bf98c9cf98a082 /docs/mllib-naive-bayes.md
parentbf9d49b6d1f668b49795c2d380ab7d64ec0029da (diff)
downloadspark-26d35f3fd942761b0adecd1a720e1fa834db4de9.tar.gz
spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.tar.bz2
spark-26d35f3fd942761b0adecd1a720e1fa834db4de9.zip
[SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0
Preview: http://54.82.240.23:4000/mllib-guide.html Table of contents: * Basics * Data types * Summary statistics * Classification and regression * linear support vector machine (SVM) * logistic regression * linear linear squares, Lasso, and ridge regression * decision tree * naive Bayes * Collaborative Filtering * alternating least squares (ALS) * Clustering * k-means * Dimensionality reduction * singular value decomposition (SVD) * principal component analysis (PCA) * Optimization * stochastic gradient descent * limited-memory BFGS (L-BFGS) Author: Xiangrui Meng <meng@databricks.com> Closes #422 from mengxr/mllib-doc and squashes the following commits: 944e3a9 [Xiangrui Meng] merge master f9fda28 [Xiangrui Meng] minor 9474065 [Xiangrui Meng] add alpha to ALS examples 928e630 [Xiangrui Meng] initialization_mode -> initializationMode 5bbff49 [Xiangrui Meng] add imports to labeled point examples c17440d [Xiangrui Meng] fix python nb example 28f40dc [Xiangrui Meng] remove localhost:4000 369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc 7dc95cc [Xiangrui Meng] update linear methods 053ad8a [Xiangrui Meng] add links to go back to the main page abbbf7e [Xiangrui Meng] update ALS argument names 648283e [Xiangrui Meng] level down statistics 14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide 8cd2441 [Xiangrui Meng] minor updates 186ab07 [Xiangrui Meng] update section names 6568d65 [Xiangrui Meng] update toc, level up lr and svm 162ee12 [Xiangrui Meng] rename section names 5c1e1b1 [Xiangrui Meng] minor 8aeaba1 [Xiangrui Meng] wrap long lines 6ce6a6f [Xiangrui Meng] add summary statistics to toc 5760045 [Xiangrui Meng] claim beta cc604bf [Xiangrui Meng] remove classification and regression 92747b3 [Xiangrui Meng] make section titles consistent e605dd6 [Xiangrui Meng] add LIBSVM loader f639674 [Xiangrui Meng] add python section to migration guide c82ffb4 [Xiangrui Meng] clean optimization 31660eb [Xiangrui Meng] update linear algebra and stat 0a40837 [Xiangrui Meng] first pass over linear methods 1fc8271 [Xiangrui Meng] update toc 906ed0a [Xiangrui Meng] add a python example to naive bayes 5f0a700 [Xiangrui Meng] update collaborative filtering 656d416 [Xiangrui Meng] update mllib-clustering 86e143a [Xiangrui Meng] remove data types section from main page 8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples d1b5cbf [Xiangrui Meng] merge master 72e4804 [Xiangrui Meng] one pass over tree guide 64f8995 [Xiangrui Meng] move decision tree guide to a separate file 9fca001 [Xiangrui Meng] add first version of linear algebra guide 53c9552 [Xiangrui Meng] update dependencies f316ec2 [Xiangrui Meng] add migration guide f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction 182460f [Xiangrui Meng] add guide for naive Bayes 137fd1d [Xiangrui Meng] re-organize toc a61e434 [Xiangrui Meng] update mllib's toc
Diffstat (limited to 'docs/mllib-naive-bayes.md')
-rw-r--r--docs/mllib-naive-bayes.md115
1 files changed, 115 insertions, 0 deletions
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
new file mode 100644
index 0000000000..6160fe5b2f
--- /dev/null
+++ b/docs/mllib-naive-bayes.md
@@ -0,0 +1,115 @@
+---
+layout: global
+title: <a href="mllib-guide.html">MLlib</a> - Naive Bayes
+---
+
+Naive Bayes is a simple multiclass classification algorithm with the assumption of independence
+between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to
+the training data, it computes the conditional probability distribution of each feature given label,
+and then it applies Bayes' theorem to compute the conditional probability distribution of label
+given an observation and use it for prediction. For more details, please visit the wikipedia page
+[Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier).
+
+In MLlib, we implemented multinomial naive Bayes, which is typically used for document
+classification. Within that context, each observation is a document, each feature represents a term,
+whose value is the frequency of the term. For its formulation, please visit the wikipedia page
+[Multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
+or the section
+[Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
+from the book Introduction to Information
+Retrieval. [Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by
+setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature
+vectors are usually sparse. Please supply sparse vectors as input to take advantage of
+sparsity. Since the training data is only used once, it is not necessary to cache it.
+
+## Examples
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+[NaiveBayes](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
+multinomial naive Bayes. It takes an RDD of
+[LabeledPoint](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional
+smoothing parameter `lambda` as input, and output a
+[NaiveBayesModel](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
+can be used for evaluation and prediction.
+
+{% highlight scala %}
+import org.apache.spark.mllib.classification.NaiveBayes
+
+val training: RDD[LabeledPoint] = ... // training set
+val test: RDD[LabeledPoint] = ... // test set
+
+val model = NaiveBayes.train(training, lambda = 1.0)
+val prediction = model.predict(test.map(_.features))
+
+val predictionAndLabel = prediction.zip(test.map(_.label))
+val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()
+{% endhighlight %}
+</div>
+
+<div data-lang="java" markdown="1">
+
+[NaiveBayes](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements
+multinomial naive Bayes. It takes a Scala RDD of
+[LabeledPoint](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an
+optionally smoothing parameter `lambda` as input, and output a
+[NaiveBayesModel](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which
+can be used for evaluation and prediction.
+
+{% highlight java %}
+import org.apache.spark.mllib.classification.NaiveBayes;
+
+JavaRDD<LabeledPoint> training = ... // training set
+JavaRDD<LabeledPoint> test = ... // test set
+
+NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
+
+JavaRDD<Double> prediction = model.predict(test.map(new Function<LabeledPoint, Vector>() {
+ public Vector call(LabeledPoint p) {
+ return p.features();
+ }
+ })
+JavaPairRDD<Double, Double> predictionAndLabel =
+ prediction.zip(test.map(new Function<LabeledPoint, Double>() {
+ public Double call(LabeledPoint p) {
+ return p.label();
+ }
+ })
+double accuracy = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
+ public Boolean call(Tuple2<Double, Double> pl) {
+ return pl._1() == pl._2();
+ }
+ }).count() / test.count()
+{% endhighlight %}
+</div>
+
+<div data-lang="python" markdown="1">
+
+[NaiveBayes](api/pyspark/pyspark.mllib.classification.NaiveBayes-class.html) implements multinomial
+naive Bayes. It takes an RDD of
+[LabeledPoint](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html) and an optionally
+smoothing parameter `lambda` as input, and output a
+[NaiveBayesModel](api/pyspark/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be
+used for evaluation and prediction.
+
+<!--- TODO: Make Python's example consistent with Scala's and Java's. --->
+{% highlight python %}
+from pyspark.mllib.regression import LabeledPoint
+from pyspark.mllib.classification import NaiveBayes
+
+# an RDD of LabeledPoint
+data = sc.parallelize([
+ LabeledPoint(0.0, [0.0, 0.0])
+ ... # more labeled points
+])
+
+# Train a naive Bayes model.
+model = NaiveBayes.train(data, 1.0)
+
+# Make prediction.
+prediction = model.predict([0.0, 0.0])
+{% endhighlight %}
+
+</div>
+</div>