From 26d35f3fd942761b0adecd1a720e1fa834db4de9 Mon Sep 17 00:00:00 2001 From: Xiangrui Meng Date: Tue, 22 Apr 2014 11:20:47 -0700 Subject: [SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0 Preview: http://54.82.240.23:4000/mllib-guide.html Table of contents: * Basics * Data types * Summary statistics * Classification and regression * linear support vector machine (SVM) * logistic regression * linear linear squares, Lasso, and ridge regression * decision tree * naive Bayes * Collaborative Filtering * alternating least squares (ALS) * Clustering * k-means * Dimensionality reduction * singular value decomposition (SVD) * principal component analysis (PCA) * Optimization * stochastic gradient descent * limited-memory BFGS (L-BFGS) Author: Xiangrui Meng Closes #422 from mengxr/mllib-doc and squashes the following commits: 944e3a9 [Xiangrui Meng] merge master f9fda28 [Xiangrui Meng] minor 9474065 [Xiangrui Meng] add alpha to ALS examples 928e630 [Xiangrui Meng] initialization_mode -> initializationMode 5bbff49 [Xiangrui Meng] add imports to labeled point examples c17440d [Xiangrui Meng] fix python nb example 28f40dc [Xiangrui Meng] remove localhost:4000 369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc 7dc95cc [Xiangrui Meng] update linear methods 053ad8a [Xiangrui Meng] add links to go back to the main page abbbf7e [Xiangrui Meng] update ALS argument names 648283e [Xiangrui Meng] level down statistics 14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide 8cd2441 [Xiangrui Meng] minor updates 186ab07 [Xiangrui Meng] update section names 6568d65 [Xiangrui Meng] update toc, level up lr and svm 162ee12 [Xiangrui Meng] rename section names 5c1e1b1 [Xiangrui Meng] minor 8aeaba1 [Xiangrui Meng] wrap long lines 6ce6a6f [Xiangrui Meng] add summary statistics to toc 5760045 [Xiangrui Meng] claim beta cc604bf [Xiangrui Meng] remove classification and regression 92747b3 [Xiangrui Meng] make section titles consistent e605dd6 [Xiangrui Meng] add LIBSVM loader f639674 [Xiangrui Meng] add python section to migration guide c82ffb4 [Xiangrui Meng] clean optimization 31660eb [Xiangrui Meng] update linear algebra and stat 0a40837 [Xiangrui Meng] first pass over linear methods 1fc8271 [Xiangrui Meng] update toc 906ed0a [Xiangrui Meng] add a python example to naive bayes 5f0a700 [Xiangrui Meng] update collaborative filtering 656d416 [Xiangrui Meng] update mllib-clustering 86e143a [Xiangrui Meng] remove data types section from main page 8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples d1b5cbf [Xiangrui Meng] merge master 72e4804 [Xiangrui Meng] one pass over tree guide 64f8995 [Xiangrui Meng] move decision tree guide to a separate file 9fca001 [Xiangrui Meng] add first version of linear algebra guide 53c9552 [Xiangrui Meng] update dependencies f316ec2 [Xiangrui Meng] add migration guide f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction 182460f [Xiangrui Meng] add guide for naive Bayes 137fd1d [Xiangrui Meng] re-organize toc a61e434 [Xiangrui Meng] update mllib's toc --- docs/mllib-naive-bayes.md | 115 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 115 insertions(+) create mode 100644 docs/mllib-naive-bayes.md (limited to 'docs/mllib-naive-bayes.md') diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md new file mode 100644 index 0000000000..6160fe5b2f --- /dev/null +++ b/docs/mllib-naive-bayes.md @@ -0,0 +1,115 @@ +--- +layout: global +title: MLlib - Naive Bayes +--- + +Naive Bayes is a simple multiclass classification algorithm with the assumption of independence +between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to +the training data, it computes the conditional probability distribution of each feature given label, +and then it applies Bayes' theorem to compute the conditional probability distribution of label +given an observation and use it for prediction. For more details, please visit the wikipedia page +[Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier). + +In MLlib, we implemented multinomial naive Bayes, which is typically used for document +classification. Within that context, each observation is a document, each feature represents a term, +whose value is the frequency of the term. For its formulation, please visit the wikipedia page +[Multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes) +or the section +[Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html) +from the book Introduction to Information +Retrieval. [Additive smoothing](http://en.wikipedia.org/wiki/Lidstone_smoothing) can be used by +setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature +vectors are usually sparse. Please supply sparse vectors as input to take advantage of +sparsity. Since the training data is only used once, it is not necessary to cache it. + +## Examples + +
+
+ +[NaiveBayes](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements +multinomial naive Bayes. It takes an RDD of +[LabeledPoint](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an optional +smoothing parameter `lambda` as input, and output a +[NaiveBayesModel](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which +can be used for evaluation and prediction. + +{% highlight scala %} +import org.apache.spark.mllib.classification.NaiveBayes + +val training: RDD[LabeledPoint] = ... // training set +val test: RDD[LabeledPoint] = ... // test set + +val model = NaiveBayes.train(training, lambda = 1.0) +val prediction = model.predict(test.map(_.features)) + +val predictionAndLabel = prediction.zip(test.map(_.label)) +val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count() +{% endhighlight %} +
+ +
+ +[NaiveBayes](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayes$) implements +multinomial naive Bayes. It takes a Scala RDD of +[LabeledPoint](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint) and an +optionally smoothing parameter `lambda` as input, and output a +[NaiveBayesModel](api/mllib/index.html#org.apache.spark.mllib.classification.NaiveBayesModel), which +can be used for evaluation and prediction. + +{% highlight java %} +import org.apache.spark.mllib.classification.NaiveBayes; + +JavaRDD training = ... // training set +JavaRDD test = ... // test set + +NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0); + +JavaRDD prediction = model.predict(test.map(new Function() { + public Vector call(LabeledPoint p) { + return p.features(); + } + }) +JavaPairRDD predictionAndLabel = + prediction.zip(test.map(new Function() { + public Double call(LabeledPoint p) { + return p.label(); + } + }) +double accuracy = 1.0 * predictionAndLabel.filter(new Function, Boolean>() { + public Boolean call(Tuple2 pl) { + return pl._1() == pl._2(); + } + }).count() / test.count() +{% endhighlight %} +
+ +
+ +[NaiveBayes](api/pyspark/pyspark.mllib.classification.NaiveBayes-class.html) implements multinomial +naive Bayes. It takes an RDD of +[LabeledPoint](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html) and an optionally +smoothing parameter `lambda` as input, and output a +[NaiveBayesModel](api/pyspark/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be +used for evaluation and prediction. + + +{% highlight python %} +from pyspark.mllib.regression import LabeledPoint +from pyspark.mllib.classification import NaiveBayes + +# an RDD of LabeledPoint +data = sc.parallelize([ + LabeledPoint(0.0, [0.0, 0.0]) + ... # more labeled points +]) + +# Train a naive Bayes model. +model = NaiveBayes.train(data, 1.0) + +# Make prediction. +prediction = model.predict([0.0, 0.0]) +{% endhighlight %} + +
+
-- cgit v1.2.3