diff options
author | Patrick Wendell <pwendell@gmail.com> | 2014-01-13 23:08:26 -0800 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-01-13 23:08:26 -0800 |
commit | fdaabdc67387524ffb84354f87985f48bd31cf60 (patch) | |
tree | eb7c3473f653c55b4018e73bf408f73e6d49462a /docs | |
parent | 4a805aff5e381752afb2bfd579af908d623743ed (diff) | |
parent | cc93c2abb1a44f230d2951981fdfc2fe8e7df46f (diff) | |
download | spark-fdaabdc67387524ffb84354f87985f48bd31cf60.tar.gz spark-fdaabdc67387524ffb84354f87985f48bd31cf60.tar.bz2 spark-fdaabdc67387524ffb84354f87985f48bd31cf60.zip |
Merge pull request #380 from mateiz/py-bayes
Add Naive Bayes to Python MLlib, and some API fixes
- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
algorithms better and in particular make it easier to call from Java
(added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
they could only be run individually through each file)
Diffstat (limited to 'docs')
-rw-r--r-- | docs/_config.yml | 2 | ||||
-rw-r--r-- | docs/mllib-guide.md | 19 | ||||
-rw-r--r-- | docs/python-programming-guide.md | 8 |
3 files changed, 22 insertions, 7 deletions
diff --git a/docs/_config.yml b/docs/_config.yml index 11d18f0ac2..ce0fdf5fb4 100644 --- a/docs/_config.yml +++ b/docs/_config.yml @@ -5,6 +5,6 @@ markdown: kramdown # of Spark, Scala, and Mesos. SPARK_VERSION: 0.9.0-incubating-SNAPSHOT SPARK_VERSION_SHORT: 0.9.0 -SCALA_VERSION: 2.10 +SCALA_VERSION: "2.10" MESOS_VERSION: 0.13.0 SPARK_ISSUE_TRACKER_URL: https://spark-project.atlassian.net diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 45ee166688..1a5c640d10 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -21,6 +21,8 @@ depends on native Fortran routines. You may need to install the if it is not already present on your nodes. MLlib will throw a linking error if it cannot detect these libraries automatically. +To use MLlib in Python, you will also need [NumPy](http://www.numpy.org) version 1.7 or newer. + # Binary Classification Binary classification is a supervised learning problem in which we want to @@ -316,6 +318,13 @@ other signals), you can use the trainImplicit method to get better results. val model = ALS.trainImplicit(ratings, 1, 20, 0.01) {% endhighlight %} +# Using MLLib in Java + +All of MLlib's methods use Java-friendly types, so you can import and call them there the same +way you do in Scala. The only caveat is that the methods take Scala RDD objects, while the +Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a Scala one by +calling `.rdd()` on your `JavaRDD` object. + # Using MLLib in Python Following examples can be tested in the PySpark shell. @@ -330,7 +339,7 @@ from numpy import array # Load and parse the data data = sc.textFile("mllib/data/sample_svm_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) -model = LogisticRegressionWithSGD.train(sc, parsedData) +model = LogisticRegressionWithSGD.train(parsedData) # Build the model labelsAndPreds = parsedData.map(lambda point: (int(point.item(0)), @@ -356,7 +365,7 @@ data = sc.textFile("mllib/data/ridge-data/lpsa.data") parsedData = data.map(lambda line: array([float(x) for x in line.replace(',', ' ').split(' ')])) # Build the model -model = LinearRegressionWithSGD.train(sc, parsedData) +model = LinearRegressionWithSGD.train(parsedData) # Evaluate the model on training data valuesAndPreds = parsedData.map(lambda point: (point.item(0), @@ -382,7 +391,7 @@ data = sc.textFile("kmeans_data.txt") parsedData = data.map(lambda line: array([float(x) for x in line.split(' ')])) # Build the model (cluster the data) -clusters = KMeans.train(sc, parsedData, 2, maxIterations=10, +clusters = KMeans.train(parsedData, 2, maxIterations=10, runs=30, initialization_mode="random") # Evaluate clustering by computing Within Set Sum of Squared Errors @@ -411,7 +420,7 @@ data = sc.textFile("mllib/data/als/test.data") ratings = data.map(lambda line: array([float(x) for x in line.split(',')])) # Build the recommendation model using Alternating Least Squares -model = ALS.train(sc, ratings, 1, 20) +model = ALS.train(ratings, 1, 20) # Evaluate the model on training data testdata = ratings.map(lambda p: (int(p[0]), int(p[1]))) @@ -426,5 +435,5 @@ signals), you can use the trainImplicit method to get better results. {% highlight python %} # Build the recommendation model using Alternating Least Squares based on implicit ratings -model = ALS.trainImplicit(sc, ratings, 1, 20) +model = ALS.trainImplicit(ratings, 1, 20) {% endhighlight %} diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md index c4236f8312..b07899c2e1 100644 --- a/docs/python-programming-guide.md +++ b/docs/python-programming-guide.md @@ -52,7 +52,7 @@ In addition, PySpark fully supports interactive use---simply run `./bin/pyspark` # Installing and Configuring PySpark -PySpark requires Python 2.6 or higher. +PySpark requires Python 2.7 or higher. PySpark applications are executed using a standard CPython interpreter in order to support Python modules that use C extensions. We have not tested PySpark with Python 3 or with alternative Python interpreters, such as [PyPy](http://pypy.org/) or [Jython](http://www.jython.org/). @@ -149,6 +149,12 @@ sc = SparkContext(conf = conf) [API documentation](api/pyspark/index.html) for PySpark is available as Epydoc. Many of the methods also contain [doctests](http://docs.python.org/2/library/doctest.html) that provide additional usage examples. +# Libraries + +[MLlib](mllib-guide.html) is also available in PySpark. To use it, you'll need +[NumPy](http://www.numpy.org) version 1.7 or newer. The [MLlib guide](mllib-guide.html) contains +some example applications. + # Where to Go from Here PySpark also includes several sample programs in the [`python/examples` folder](https://github.com/apache/incubator-spark/tree/master/python/examples). |