+ +

Naive Bayes - spark.mllib

+ + +

Naive Bayes is a simple +multiclass classification algorithm with the assumption of independence between +every pair of features. Naive Bayes can be trained very efficiently. Within a +single pass to the training data, it computes the conditional probability +distribution of each feature given label, and then it applies Bayes’ theorem to +compute the conditional probability distribution of label given an observation +and use it for prediction.

+ +

spark.mllib supports multinomial naive +Bayes +and Bernoulli naive Bayes. +These models are typically used for document classification. +Within that context, each observation is a document and each +feature represents a term whose value is the frequency of the term (in multinomial naive Bayes) or +a zero or one indicating whether the term was found in the document (in Bernoulli naive Bayes). +Feature values must be nonnegative. The model type is selected with an optional parameter +“multinomial” or “bernoulli” with “multinomial” as the default. +Additive smoothing can be used by +setting the parameter $\lambda$ (default to $1.0$). For document classification, the input feature +vectors are usually sparse, and sparse vectors should be supplied as input to take advantage of +sparsity. Since the training data is only used once, it is not necessary to cache it.

+ +

Examples

+ +

NaiveBayes implements +multinomial naive Bayes. It takes an RDD of +LabeledPoint and an optional +smoothing parameter lambda as input, an optional model type parameter (default is “multinomial”), and outputs a +NaiveBayesModel, which +can be used for evaluation and prediction.

+ +

Refer to the NaiveBayes Scala docs and NaiveBayesModel Scala docs for details on the API.

+ +

import org.apache.spark.mllib.classification.{NaiveBayes, NaiveBayesModel}
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.regression.LabeledPoint
+
+val data = sc.textFile("data/mllib/sample_naive_bayes_data.txt")
+val parsedData = data.map { line =>
+  val parts = line.split(',')
+  LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
+}
+
+// Split data into training (60%) and test (40%).
+val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
+val training = splits(0)
+val test = splits(1)
+
+val model = NaiveBayes.train(training, lambda = 1.0, modelType = "multinomial")
+
+val predictionAndLabel = test.map(p => (model.predict(p.features), p.label))
+val accuracy = 1.0 * predictionAndLabel.filter(x => x._1 == x._2).count() / test.count()
+
+// Save and load model
+model.save(sc, "target/tmp/myNaiveBayesModel")
+val sameModel = NaiveBayesModel.load(sc, "target/tmp/myNaiveBayesModel")
+

Find full example code at "examples/src/main/scala/org/apache/spark/examples/mllib/NaiveBayesExample.scala" in the Spark repo.

+ +

NaiveBayes implements +multinomial naive Bayes. It takes a Scala RDD of +LabeledPoint and an +optionally smoothing parameter lambda as input, and output a +NaiveBayesModel, which +can be used for evaluation and prediction.

+ +

Refer to the NaiveBayes Java docs and NaiveBayesModel Java docs for details on the API.

+ +

import scala.Tuple2;
+import org.apache.spark.api.java.function.Function;
+import org.apache.spark.api.java.function.PairFunction;
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.JavaSparkContext;
+import org.apache.spark.mllib.classification.NaiveBayes;
+import org.apache.spark.mllib.classification.NaiveBayesModel;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import org.apache.spark.mllib.util.MLUtils;
+
+String path = "data/mllib/sample_naive_bayes_data.txt";
+JavaRDD<LabeledPoint> inputData = MLUtils.loadLibSVMFile(jsc.sc(), path).toJavaRDD();
+JavaRDD<LabeledPoint>[] tmp = inputData.randomSplit(new double[]{0.6, 0.4}, 12345);
+JavaRDD<LabeledPoint> training = tmp[0]; // training set
+JavaRDD<LabeledPoint> test = tmp[1]; // test set
+final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
+JavaPairRDD<Double, Double> predictionAndLabel =
+  test.mapToPair(new PairFunction<LabeledPoint, Double, Double>() {
+    @Override
+    public Tuple2<Double, Double> call(LabeledPoint p) {
+      return new Tuple2<Double, Double>(model.predict(p.features()), p.label());
+    }
+  });
+double accuracy = predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
+  @Override
+  public Boolean call(Tuple2<Double, Double> pl) {
+    return pl._1().equals(pl._2());
+  }
+}).count() / (double) test.count();
+
+// Save and load model
+model.save(jsc.sc(), "target/tmp/myNaiveBayesModel");
+NaiveBayesModel sameModel = NaiveBayesModel.load(jsc.sc(), "target/tmp/myNaiveBayesModel");
+

Find full example code at "examples/src/main/java/org/apache/spark/examples/mllib/JavaNaiveBayesExample.java" in the Spark repo.

+ +

NaiveBayes implements multinomial +naive Bayes. It takes an RDD of +LabeledPoint and an optionally +smoothing parameter lambda as input, and output a +NaiveBayesModel, which can be +used for evaluation and prediction.

+ +

Note that the Python API does not yet support model save/load but will in the future.

+ +

Refer to the NaiveBayes Python docs and NaiveBayesModel Python docs for more details on the API.

+ +

from pyspark.mllib.classification import NaiveBayes, NaiveBayesModel
+from pyspark.mllib.linalg import Vectors
+from pyspark.mllib.regression import LabeledPoint
+
+
+def parseLine(line):
+    parts = line.split(',')
+    label = float(parts[0])
+    features = Vectors.dense([float(x) for x in parts[1].split(' ')])
+    return LabeledPoint(label, features)
+
+data = sc.textFile('data/mllib/sample_naive_bayes_data.txt').map(parseLine)
+
+# Split data aproximately into training (60%) and test (40%)
+training, test = data.randomSplit([0.6, 0.4], seed=0)
+
+# Train a naive Bayes model.
+model = NaiveBayes.train(training, 1.0)
+
+# Make prediction and test accuracy.
+predictionAndLabel = test.map(lambda p: (model.predict(p.features), p.label))
+accuracy = 1.0 * predictionAndLabel.filter(lambda (x, v): x == v).count() / test.count()
+
+# Save and load model
+model.save(sc, "target/tmp/myNaiveBayesModel")
+sameModel = NaiveBayesModel.load(sc, "target/tmp/myNaiveBayesModel")
+

Find full example code at "examples/src/main/python/mllib/naive_bayes_example.py" in the Spark repo.

+ + +

spark.ml package

spark.mllib package

Naive Bayes - spark.mllib

Examples