aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorSean Owen <sowen@cloudera.com>2014-05-06 20:07:22 -0700
committerPatrick Wendell <pwendell@gmail.com>2014-05-06 20:07:22 -0700
commit25ad8f93012730115a8a1fac649fe3e842c045b3 (patch)
tree6bc0dfec7014289e39f4c5c9070ed121e00c4398
parenta000b5c3b0438c17e9973df4832c320210c29c27 (diff)
downloadspark-25ad8f93012730115a8a1fac649fe3e842c045b3.tar.gz
spark-25ad8f93012730115a8a1fac649fe3e842c045b3.tar.bz2
spark-25ad8f93012730115a8a1fac649fe3e842c045b3.zip
SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs
While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen <sowen@cloudera.com> Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs
-rw-r--r--docs/README.md9
-rw-r--r--docs/_config.yml2
-rw-r--r--docs/bagel-programming-guide.md2
-rw-r--r--docs/cluster-overview.md2
-rw-r--r--docs/configuration.md10
-rw-r--r--docs/java-programming-guide.md20
-rw-r--r--docs/mllib-basics.md14
-rw-r--r--docs/mllib-clustering.md4
-rw-r--r--docs/mllib-collaborative-filtering.md2
-rw-r--r--docs/mllib-decision-tree.md8
-rw-r--r--docs/mllib-dimensionality-reduction.md7
-rw-r--r--docs/mllib-guide.md2
-rw-r--r--docs/mllib-linear-methods.md13
-rw-r--r--docs/mllib-naive-bayes.md48
-rw-r--r--docs/scala-programming-guide.md9
-rw-r--r--docs/sql-programming-guide.md1
-rw-r--r--mllib/data/sample_naive_bayes_data.txt12
17 files changed, 97 insertions, 68 deletions
diff --git a/docs/README.md b/docs/README.md
index 75b1811ba9..f1eb644f93 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -14,9 +14,10 @@ The markdown code can be compiled to HTML using the
[Jekyll tool](http://jekyllrb.com).
To use the `jekyll` command, you will need to have Jekyll installed.
The easiest way to do this is via a Ruby Gem, see the
-[jekyll installation instructions](http://jekyllrb.com/docs/installation).
-Compiling the site with Jekyll will create a directory called
-_site containing index.html as well as the rest of the compiled files.
+[jekyll installation instructions](http://jekyllrb.com/docs/installation).
+If not already installed, you need to install `kramdown` with `sudo gem install kramdown`.
+Execute `jekyll` from the `docs/` directory. Compiling the site with Jekyll will create a directory called
+`_site` containing index.html as well as the rest of the compiled files.
You can modify the default Jekyll build as follows:
@@ -44,6 +45,6 @@ You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PR
Similarly, you can build just the PySpark epydoc by running `epydoc --config epydoc.conf` from the SPARK_PROJECT_ROOT/pyspark directory. Documentation is only generated for classes that are listed as public in `__init__.py`.
-When you run `jekyll` in the docs directory, it will also copy over the scaladoc for the various Spark subprojects into the docs directory (and then also into the _site directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
+When you run `jekyll` in the `docs` directory, it will also copy over the scaladoc for the various Spark subprojects into the `docs` directory (and then also into the `_site` directory). We use a jekyll plugin to run `sbt/sbt doc` before building the site so if you haven't run it (recently) it may take some time as it generates all of the scaladoc. The jekyll plugin also generates the PySpark docs using [epydoc](http://epydoc.sourceforge.net/).
NOTE: To skip the step of building and copying over the Scala and Python API docs, run `SKIP_API=1 jekyll`.
diff --git a/docs/_config.yml b/docs/_config.yml
index d585b8c5ea..d177e38f88 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -8,5 +8,5 @@ SPARK_VERSION_SHORT: 1.0.0
SCALA_BINARY_VERSION: "2.10"
SCALA_VERSION: "2.10.4"
MESOS_VERSION: 0.13.0
-SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net
+SPARK_ISSUE_TRACKER_URL: https://issues.apache.org/jira/browse/SPARK
SPARK_GITHUB_URL: https://github.com/apache/spark
diff --git a/docs/bagel-programming-guide.md b/docs/bagel-programming-guide.md
index da6d0c9dcd..14f43cb6d3 100644
--- a/docs/bagel-programming-guide.md
+++ b/docs/bagel-programming-guide.md
@@ -46,7 +46,7 @@ import org.apache.spark.bagel.Bagel._
Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.
{% highlight scala %}
-val input = sc.textFile("pagerank_data.txt")
+val input = sc.textFile("data/pagerank_data.txt")
val numVerts = input.count()
diff --git a/docs/cluster-overview.md b/docs/cluster-overview.md
index 79b0061e2c..162c415b58 100644
--- a/docs/cluster-overview.md
+++ b/docs/cluster-overview.md
@@ -181,7 +181,7 @@ The following table summarizes terms you'll see used to refer to cluster concept
<td>Distinguishes where the driver process runs. In "cluster" mode, the framework launches
the driver inside of the cluster. In "client" mode, the submitter launches the driver
outside of the cluster.</td>
- <tr>
+ </tr>
<tr>
<td>Worker node</td>
<td>Any node that can run application code in the cluster</td>
diff --git a/docs/configuration.md b/docs/configuration.md
index d6f316ba5f..5b034e3cb3 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -26,10 +26,10 @@ application name), as well as arbitrary key-value pairs through the `set()` meth
initialize an application as follows:
{% highlight scala %}
-val conf = new SparkConf()
- .setMaster("local")
- .setAppName("My application")
- .set("spark.executor.memory", "1g")
+val conf = new SparkConf().
+ setMaster("local").
+ setAppName("My application").
+ set("spark.executor.memory", "1g")
val sc = new SparkContext(conf)
{% endhighlight %}
@@ -318,7 +318,7 @@ Apart from these, the following properties are also available, and may be useful
When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
objects to prevent writing redundant data, however that stops garbage collection of those
objects. By calling 'reset' you flush that info from the serializer, and allow old
- objects to be collected. To turn off this periodic reset set it to a value of <= 0.
+ objects to be collected. To turn off this periodic reset set it to a value &lt;= 0.
By default it will reset the serializer every 10,000 objects.
</td>
</tr>
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md
index 07c8512bf9..c34eb28fc0 100644
--- a/docs/java-programming-guide.md
+++ b/docs/java-programming-guide.md
@@ -55,7 +55,7 @@ classes. RDD methods like `map` are overloaded by specialized `PairFunction`
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
types. Common methods like `filter` and `sample` are implemented by
each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`,
-etc (this acheives the "same-result-type" principle used by the [Scala collections
+etc (this achieves the "same-result-type" principle used by the [Scala collections
framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)).
## Function Interfaces
@@ -102,7 +102,7 @@ the following changes:
`Function` classes will need to use `implements` rather than `extends`.
* Certain transformation functions now have multiple versions depending
on the return type. In Spark core, the map functions (`map`, `flatMap`, and
- `mapPartitons`) have type-specific versions, e.g.
+ `mapPartitions`) have type-specific versions, e.g.
[`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
@@ -115,11 +115,11 @@ As an example, we will implement word count using the Java API.
import org.apache.spark.api.java.*;
import org.apache.spark.api.java.function.*;
-JavaSparkContext sc = new JavaSparkContext(...);
-JavaRDD<String> lines = ctx.textFile("hdfs://...");
+JavaSparkContext jsc = new JavaSparkContext(...);
+JavaRDD<String> lines = jsc.textFile("hdfs://...");
JavaRDD<String> words = lines.flatMap(
new FlatMapFunction<String, String>() {
- public Iterable<String> call(String s) {
+ @Override public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
}
@@ -140,10 +140,10 @@ Here, the `FlatMapFunction` was created inline; another option is to subclass
{% highlight java %}
class Split extends FlatMapFunction<String, String> {
- public Iterable<String> call(String s) {
+ @Override public Iterable<String> call(String s) {
return Arrays.asList(s.split(" "));
}
-);
+}
JavaRDD<String> words = lines.flatMap(new Split());
{% endhighlight %}
@@ -162,8 +162,8 @@ Continuing with the word count example, we map each word to a `(word, 1)` pair:
import scala.Tuple2;
JavaPairRDD<String, Integer> ones = words.mapToPair(
new PairFunction<String, String, Integer>() {
- public Tuple2<String, Integer> call(String s) {
- return new Tuple2(s, 1);
+ @Override public Tuple2<String, Integer> call(String s) {
+ return new Tuple2<String, Integer>(s, 1);
}
}
);
@@ -178,7 +178,7 @@ occurrences of each word:
{% highlight java %}
JavaPairRDD<String, Integer> counts = ones.reduceByKey(
new Function2<Integer, Integer, Integer>() {
- public Integer call(Integer i1, Integer i2) {
+ @Override public Integer call(Integer i1, Integer i2) {
return i1 + i2;
}
}
diff --git a/docs/mllib-basics.md b/docs/mllib-basics.md
index 710ce1721f..704308802d 100644
--- a/docs/mllib-basics.md
+++ b/docs/mllib-basics.md
@@ -9,7 +9,7 @@ title: <a href="mllib-guide.html">MLlib</a> - Basics
MLlib supports local vectors and matrices stored on a single machine,
as well as distributed matrices backed by one or more RDDs.
In the current implementation, local vectors and matrices are simple data models
-to serve public interfaces. The underly linear algebra operations are provided by
+to serve public interfaces. The underlying linear algebra operations are provided by
[Breeze](http://www.scalanlp.org/) and [jblas](http://jblas.org/).
A training example used in supervised learning is called "labeled point" in MLlib.
@@ -205,7 +205,7 @@ import org.apache.spark.mllib.regression.LabeledPoint;
import org.apache.spark.mllib.util.MLUtils;
import org.apache.spark.rdd.RDDimport;
-RDD[LabeledPoint] training = MLUtils.loadLibSVMData(sc, "mllib/data/sample_libsvm_data.txt")
+RDD<LabeledPoint> training = MLUtils.loadLibSVMData(jsc, "mllib/data/sample_libsvm_data.txt");
{% endhighlight %}
</div>
</div>
@@ -307,6 +307,7 @@ A [`RowMatrix`](api/mllib/index.html#org.apache.spark.mllib.linalg.distributed.R
created from a `JavaRDD<Vector>` instance. Then we can compute its column summary statistics.
{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
@@ -348,10 +349,10 @@ val mat: RowMatrix = ... // a RowMatrix
val summary: MultivariateStatisticalSummary = mat.computeColumnSummaryStatistics()
println(summary.mean) // a dense vector containing the mean value for each column
println(summary.variance) // column-wise variance
-println(summary.numNonzers) // number of nonzeros in each column
+println(summary.numNonzeros) // number of nonzeros in each column
// Compute the covariance matrix.
-val Cov: Matrix = mat.computeCovariance()
+val cov: Matrix = mat.computeCovariance()
{% endhighlight %}
</div>
</div>
@@ -397,11 +398,12 @@ wrapper over `(long, Vector)`. An `IndexedRowMatrix` can be converted to a `Row
its row indices.
{% highlight java %}
+import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.distributed.IndexedRow;
import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
import org.apache.spark.mllib.linalg.distributed.RowMatrix;
-JavaRDD[IndexedRow] rows = ... // a JavaRDD of indexed rows
+JavaRDD<IndexedRow> rows = ... // a JavaRDD of indexed rows
// Create an IndexedRowMatrix from a JavaRDD<IndexedRow>.
IndexedRowMatrix mat = new IndexedRowMatrix(rows.rdd());
@@ -458,7 +460,9 @@ wrapper over `(long, long, double)`. A `CoordinateMatrix` can be converted to a
with sparse rows by calling `toIndexedRowMatrix`.
{% highlight scala %}
+import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.linalg.distributed.CoordinateMatrix;
+import org.apache.spark.mllib.linalg.distributed.IndexedRowMatrix;
import org.apache.spark.mllib.linalg.distributed.MatrixEntry;
JavaRDD<MatrixEntry> entries = ... // a JavaRDD of matrix entries
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index b3293afe40..276868fa84 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -18,7 +18,7 @@ models are trained for each cluster).
MLlib supports
[k-means](http://en.wikipedia.org/wiki/K-means_clustering) clustering, one of
the most commonly used clustering algorithms that clusters the data points into
-predfined number of clusters. The MLlib implementation includes a parallelized
+predefined number of clusters. The MLlib implementation includes a parallelized
variant of the [k-means++](http://en.wikipedia.org/wiki/K-means%2B%2B) method
called [kmeans||](http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf).
The implementation in MLlib has the following parameters:
@@ -30,7 +30,7 @@ initialization via k-means\|\|.
* *runs* is the number of times to run the k-means algorithm (k-means is not
guaranteed to find a globally optimal solution, and when run multiple times on
a given dataset, the algorithm returns the best clustering result).
-* *initializiationSteps* determines the number of steps in the k-means\|\| algorithm.
+* *initializationSteps* determines the number of steps in the k-means\|\| algorithm.
* *epsilon* determines the distance threshold within which we consider k-means to have converged.
## Examples
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index 79f5e3a7ca..f486c56e55 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -77,7 +77,7 @@ val ratesAndPreds = ratings.map{
}.join(predictions)
val MSE = ratesAndPreds.map{
case ((user, product), (r1, r2)) => math.pow((r1- r2), 2)
-}.reduce(_ + _)/ratesAndPreds.count
+}.mean()
println("Mean Squared Error = " + MSE)
{% endhighlight %}
diff --git a/docs/mllib-decision-tree.md b/docs/mllib-decision-tree.md
index 0693766990..296277e58b 100644
--- a/docs/mllib-decision-tree.md
+++ b/docs/mllib-decision-tree.md
@@ -83,19 +83,19 @@ Section 9.2.4 in
[Elements of Statistical Machine Learning](http://statweb.stanford.edu/~tibs/ElemStatLearn/) for
details). For example, for a binary classification problem with one categorical feature with three
categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical
-features are orded as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
+features are ordered as A followed by C followed B or A, B, C. The two split candidates are A \| C, B
and A , B \| C where \| denotes the split.
### Stopping rule
The recursive tree construction is stopped at a node when one of the two conditions is met:
-1. The node depth is equal to the `maxDepth` training parammeter
+1. The node depth is equal to the `maxDepth` training parameter
2. No split candidate leads to an information gain at the node.
### Practical limitations
-1. The tree implementation stores an Array[Double] of size *O(#features \* #splits \* 2^maxDepth)*
+1. The tree implementation stores an `Array[Double]` of size *O(#features \* #splits \* 2^maxDepth)*
in memory for aggregating histograms over partitions. The current implementation might not scale
to very deep trees since the memory requirement grows exponentially with tree depth.
2. The implemented algorithm reads both sparse and dense data. However, it is not optimized for
@@ -178,7 +178,7 @@ val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
-val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.reduce(_ + _)/valuesAndPreds.count
+val MSE = valuesAndPreds.map{ case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
{% endhighlight %}
</div>
diff --git a/docs/mllib-dimensionality-reduction.md b/docs/mllib-dimensionality-reduction.md
index 4e9ecf7c00..ab24663cfe 100644
--- a/docs/mllib-dimensionality-reduction.md
+++ b/docs/mllib-dimensionality-reduction.md
@@ -44,6 +44,10 @@ say, less than $1000$, but many rows, which we call *tall-and-skinny*.
<div class="codetabs">
<div data-lang="scala" markdown="1">
{% highlight scala %}
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+import org.apache.spark.mllib.linalg.SingularValueDecomposition
+
val mat: RowMatrix = ...
// Compute the top 20 singular values and corresponding singular vectors.
@@ -74,6 +78,9 @@ and use them to project the vectors into a low-dimensional space.
The number of columns should be small, e.g, less than 1000.
{% highlight scala %}
+import org.apache.spark.mllib.linalg.Matrix
+import org.apache.spark.mllib.linalg.distributed.RowMatrix
+
val mat: RowMatrix = ...
// Compute the top 10 principal components.
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index c49f857d07..842ca5c8c6 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -94,7 +94,7 @@ import org.apache.spark.mllib.linalg.Vector;
import org.apache.spark.mllib.linalg.Vectors;
double[] array = ... // a double array
-Vector vector = Vectors.dense(array) // a dense vector
+Vector vector = Vectors.dense(array); // a dense vector
{% endhighlight %}
[`Vectors`](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) provides factory methods to
diff --git a/docs/mllib-linear-methods.md b/docs/mllib-linear-methods.md
index ebb555f974..40b7a7f807 100644
--- a/docs/mllib-linear-methods.md
+++ b/docs/mllib-linear-methods.md
@@ -63,7 +63,7 @@ methods MLlib supports:
<tbody>
<tr>
<td>hinge loss</td><td>$\max \{0, 1-y \wv^T \x \}, \quad y \in \{-1, +1\}$</td>
- <td>$\begin{cases}-y \cdot \x & \text{if $y \wv^T \x <1$}, \\ 0 &
+ <td>$\begin{cases}-y \cdot \x &amp; \text{if $y \wv^T \x &lt;1$}, \\ 0 &amp;
\text{otherwise}.\end{cases}$</td>
</tr>
<tr>
@@ -225,10 +225,11 @@ algorithm for 200 iterations.
import org.apache.spark.mllib.optimization.L1Updater
val svmAlg = new SVMWithSGD()
-svmAlg.optimizer.setNumIterations(200)
- .setRegParam(0.1)
- .setUpdater(new L1Updater)
-val modelL1 = svmAlg.run(parsedData)
+svmAlg.optimizer.
+ setNumIterations(200).
+ setRegParam(0.1).
+ setUpdater(new L1Updater)
+val modelL1 = svmAlg.run(training)
{% endhighlight %}
Similarly, you can use replace `SVMWithSGD` by
@@ -322,7 +323,7 @@ val valuesAndPreds = parsedData.map { point =>
val prediction = model.predict(point.features)
(point.label, prediction)
}
-val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.reduce(_ + _) / valuesAndPreds.count
+val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean()
println("training Mean Squared Error = " + MSE)
{% endhighlight %}
diff --git a/docs/mllib-naive-bayes.md b/docs/mllib-naive-bayes.md
index 6160fe5b2f..c47508b7da 100644
--- a/docs/mllib-naive-bayes.md
+++ b/docs/mllib-naive-bayes.md
@@ -7,13 +7,13 @@ Naive Bayes is a simple multiclass classification algorithm with the assumption
between every pair of features. Naive Bayes can be trained very efficiently. Within a single pass to
the training data, it computes the conditional probability distribution of each feature given label,
and then it applies Bayes' theorem to compute the conditional probability distribution of label
-given an observation and use it for prediction. For more details, please visit the wikipedia page
+given an observation and use it for prediction. For more details, please visit the Wikipedia page
[Naive Bayes classifier](http://en.wikipedia.org/wiki/Naive_Bayes_classifier).
In MLlib, we implemented multinomial naive Bayes, which is typically used for document
classification. Within that context, each observation is a document, each feature represents a term,
-whose value is the frequency of the term. For its formulation, please visit the wikipedia page
-[Multinomial naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
+whose value is the frequency of the term. For its formulation, please visit the Wikipedia page
+[Multinomial Naive Bayes](http://en.wikipedia.org/wiki/Naive_Bayes_classifier#Multinomial_naive_Bayes)
or the section
[Naive Bayes text classification](http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html)
from the book Introduction to Information
@@ -36,9 +36,18 @@ can be used for evaluation and prediction.
{% highlight scala %}
import org.apache.spark.mllib.classification.NaiveBayes
-
-val training: RDD[LabeledPoint] = ... // training set
-val test: RDD[LabeledPoint] = ... // test set
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.mllib.regression.LabeledPoint
+
+val data = sc.textFile("mllib/data/sample_naive_bayes_data.txt")
+val parsedData = data.map { line =>
+ val parts = line.split(',')
+ LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble)))
+}
+// Split data into training (60%) and test (40%).
+val splits = parsedData.randomSplit(Array(0.6, 0.4), seed = 11L)
+val training = splits(0)
+val test = splits(1)
val model = NaiveBayes.train(training, lambda = 1.0)
val prediction = model.predict(test.map(_.features))
@@ -58,29 +67,36 @@ optionally smoothing parameter `lambda` as input, and output a
can be used for evaluation and prediction.
{% highlight java %}
+import org.apache.spark.api.java.JavaPairRDD;
+import org.apache.spark.api.java.JavaRDD;
+import org.apache.spark.api.java.function.Function;
import org.apache.spark.mllib.classification.NaiveBayes;
+import org.apache.spark.mllib.classification.NaiveBayesModel;
+import org.apache.spark.mllib.regression.LabeledPoint;
+import scala.Tuple2;
JavaRDD<LabeledPoint> training = ... // training set
JavaRDD<LabeledPoint> test = ... // test set
-NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
+final NaiveBayesModel model = NaiveBayes.train(training.rdd(), 1.0);
-JavaRDD<Double> prediction = model.predict(test.map(new Function<LabeledPoint, Vector>() {
- public Vector call(LabeledPoint p) {
- return p.features();
+JavaRDD<Double> prediction =
+ test.map(new Function<LabeledPoint, Double>() {
+ @Override public Double call(LabeledPoint p) {
+ return model.predict(p.features());
}
- })
+ });
JavaPairRDD<Double, Double> predictionAndLabel =
prediction.zip(test.map(new Function<LabeledPoint, Double>() {
- public Double call(LabeledPoint p) {
+ @Override public Double call(LabeledPoint p) {
return p.label();
}
- })
+ }));
double accuracy = 1.0 * predictionAndLabel.filter(new Function<Tuple2<Double, Double>, Boolean>() {
- public Boolean call(Tuple2<Double, Double> pl) {
+ @Override public Boolean call(Tuple2<Double, Double> pl) {
return pl._1() == pl._2();
}
- }).count() / test.count()
+ }).count() / test.count();
{% endhighlight %}
</div>
@@ -93,7 +109,7 @@ smoothing parameter `lambda` as input, and output a
[NaiveBayesModel](api/pyspark/pyspark.mllib.classification.NaiveBayesModel-class.html), which can be
used for evaluation and prediction.
-<!--- TODO: Make Python's example consistent with Scala's and Java's. --->
+<!-- TODO: Make Python's example consistent with Scala's and Java's. -->
{% highlight python %}
from pyspark.mllib.regression import LabeledPoint
from pyspark.mllib.classification import NaiveBayes
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index b8d89cf00f..e7ceaa22c3 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -48,12 +48,12 @@ how to access a cluster. To create a `SparkContext` you first need to build a `S
that contains information about your application.
{% highlight scala %}
-val conf = new SparkConf().setAppName(<app name>).setMaster(<master>)
+val conf = new SparkConf().setAppName(appName).setMaster(master)
new SparkContext(conf)
{% endhighlight %}
-The `<master>` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
-to connect to, or a special "local" string to run in local mode, as described below. `<app name>` is
+The `master` parameter is a string specifying a [Spark, Mesos or YARN cluster URL](#master-urls)
+to connect to, or a special "local" string to run in local mode, as described below. `appName` is
a name for your application, which will be shown in the cluster web UI. It's also possible to set
these variables [using a configuration file](cluster-overview.html#loading-configurations-from-a-file)
which avoids hard-coding the master name in your application.
@@ -81,9 +81,8 @@ The master URL passed to Spark can be in one of the following formats:
<table class="table">
<tr><th>Master URL</th><th>Meaning</th></tr>
<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
-<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
+<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine). </td></tr>
<tr><td> local[*] </td><td> Run Spark locally with as many worker threads as logical cores on your machine.</td></tr>
-</td></tr>
<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
</td></tr>
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 0c743c9d60..8a785450ad 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -416,3 +416,4 @@ results = hiveCtx.hql("FROM src SELECT key, value").collect()
{% endhighlight %}
</div>
+</div>
diff --git a/mllib/data/sample_naive_bayes_data.txt b/mllib/data/sample_naive_bayes_data.txt
index f874adbaf4..981da382d6 100644
--- a/mllib/data/sample_naive_bayes_data.txt
+++ b/mllib/data/sample_naive_bayes_data.txt
@@ -1,6 +1,6 @@
-0, 1 0 0
-0, 2 0 0
-1, 0 1 0
-1, 0 2 0
-2, 0 0 1
-2, 0 0 2
+0,1 0 0
+0,2 0 0
+1,0 1 0
+1,0 2 0
+2,0 0 1
+2,0 0 2