Merge pull request #380 from mateiz/py-bayes

Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)
author: Patrick Wendell <pwendell@gmail.com> 2014-01-13 23:08:26 -0800
committer: Patrick Wendell <pwendell@gmail.com> 2014-01-13 23:08:26 -0800
commit: fdaabdc67387524ffb84354f87985f48bd31cf60 (patch)
tree: eb7c3473f653c55b4018e73bf408f73e6d49462a /docs
parent: 4a805aff5e381752afb2bfd579af908d623743ed (diff)
parent: cc93c2abb1a44f230d2951981fdfc2fe8e7df46f (diff)
download: spark-fdaabdc67387524ffb84354f87985f48bd31cf60.tar.gz
spark-fdaabdc67387524ffb84354f87985f48bd31cf60.tar.bz2
spark-fdaabdc67387524ffb84354f87985f48bd31cf60.zip
3 files changed, 22 insertions, 7 deletions
diff --git a/docs/_config.yml b/docs/_config.yml
index 11d18f0ac2..ce0fdf5fb4 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -5,6 +5,6 @@ markdown: kramdown
 # of Spark, Scala, and Mesos.
 SPARK_VERSION: 0.9.0-incubating-SNAPSHOT
 SPARK_VERSION_SHORT: 0.9.0
-SCALA_VERSION: 2.10
+SCALA_VERSION: "2.10"
 MESOS_VERSION: 0.13.0
 SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 45ee166688..1a5c640d10 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -21,6 +21,8 @@ depends on native Fortran routines. You may need to install the
 if it is not already present on your nodes. MLlib will throw a linking error if it cannot 
 detect these libraries automatically.
 
+To use MLlib in Python, you will also need [NumPy](http://www.numpy.org) version 1.7 or newer.
+
 # Binary Classification
 
 Binary classification is a supervised learning problem in which we want to
@@ -316,6 +318,13 @@ other signals), you can use the trainImplicit method to get better results.
 val model = ALS.trainImplicit(ratings, 1, 20, 0.01)
 {% endhighlight %}
 
+# Using MLLib in Java
+
+All of MLlib's methods use Java-friendly types, so you can import and call them there the same
+way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the
+Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by
+calling `.rdd()` on your `JavaRDD` object.
+
 # Using MLLib in Python
 Following examples can be tested in the PySpark shell.
 
@@ -330,7 +339,7 @@ from numpy import array
 # Load and parse the data
 data = sc.textFile("mllib/data/sample_svm_data.txt")
 parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
-model = LogisticRegressionWithSGD.train(sc, parsedData)
+model = LogisticRegressionWithSGD.train(parsedData)
 
 # Build the model
 labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)),
@@ -356,7 +365,7 @@ data = sc.textFile("mllib/data/ridge-data/lpsa.data")
 parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')]))
 
 # Build the model
-model = LinearRegressionWithSGD.train(sc, parsedData)
+model = LinearRegressionWithSGD.train(parsedData)
 
 # Evaluate the model on training data
 valuesAndPreds = parsedData.map(lambda point: (point.item(0),
@@ -382,7 +391,7 @@ data = sc.textFile("kmeans_data.txt")
 parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')]))
 
 # Build the model (cluster the data)
-clusters = KMeans.train(sc, parsedData, 2, maxIterations=10,
+clusters = KMeans.train(parsedData, 2, maxIterations=10,
         runs=30, initialization_mode="random")
 
 # Evaluate clustering by computing Within Set Sum of Squared Errors
@@ -411,7 +420,7 @@ data = sc.textFile("mllib/data/als/test.data")
 ratings = data.map(lambda line: array([float(x) for x in line.split(',')]))
 
 # Build the recommendation model using Alternating Least Squares
-model = ALS.train(sc, ratings, 1, 20)
+model = ALS.train(ratings, 1, 20)
 
 # Evaluate the model on training data
 testdata = ratings.map(lambda p: (int(p[0]), int(p[1])))
@@ -426,5 +435,5 @@ signals), you can use the trainImplicit method to get better results.
 
 {% highlight python %}
 # Build the recommendation model using Alternating Least Squares based on implicit ratings
-model = ALS.trainImplicit(sc, ratings, 1, 20)
+model = ALS.trainImplicit(ratings, 1, 20)
 {% endhighlight %}
diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md
index c4236f8312..b07899c2e1 100644
--- a/docs/python-programming-guide.md
+++ b/docs/python-programming-guide.md
@@ -52,7 +52,7 @@ In addition, PySpark fully supports interactive use---simply run `./bin/pyspark`
 
 # Installing and Configuring PySpark
 
-PySpark requires Python 2.6 or higher.
+PySpark requires Python 2.7 or higher.
 PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions.
 We have not tested PySpark with Python 3 or with alternative Python interpreters, such as [PyPy](http://pypy.org/) or [Jython](http://www.jython.org/).
 
@@ -149,6 +149,12 @@ sc = SparkContext(conf = conf)
 [API documentation](api/pyspark/index.html) for PySpark is available as Epydoc.
 Many of the methods also contain [doctests](http://docs.python.org/2/library/doctest.html) that provide additional usage examples.
 
+# Libraries
+
+[MLlib](mllib-guide.html) is also available in PySpark. To use it, you'll need
+[NumPy](http://www.numpy.org) version 1.7 or newer. The [MLlib guide](mllib-guide.html) contains
+some example applications.
+
 # Where to Go from Here
 
 PySpark also includes several sample programs in the [`python/examples` folder](https://github.com/apache/incubator-spark/tree/master/python/examples).
author	Patrick Wendell <pwendell@gmail.com>	2014-01-13 23:08:26 -0800
committer	Patrick Wendell <pwendell@gmail.com>	2014-01-13 23:08:26 -0800
commit	fdaabdc67387524ffb84354f87985f48bd31cf60 (patch)
tree	eb7c3473f653c55b4018e73bf408f73e6d49462a /docs
parent	4a805aff5e381752afb2bfd579af908d623743ed (diff)
parent	cc93c2abb1a44f230d2951981fdfc2fe8e7df46f (diff)
download	spark-fdaabdc67387524ffb84354f87985f48bd31cf60.tar.gz spark-fdaabdc67387524ffb84354f87985f48bd31cf60.tar.bz2 spark-fdaabdc67387524ffb84354f87985f48bd31cf60.zip