Add `pyspark` script to replace the other scripts.

Expand the PySpark programming guide.
author: Josh Rosen <joshrosen@eecs.berkeley.edu> 2013-01-01 21:25:49 -0800
committer: Josh Rosen <joshrosen@eecs.berkeley.edu> 2013-01-01 21:25:49 -0800
commit: ce9f1bbe20eff794cd1d588dc88f109d32588cfe (patch)
tree: ff840eea62e8314dc4cefcaa08534c4b21e544ba /docs/python-programming-guide.md
parent: b58340dbd9a741331fc4c3829b08c093560056c2 (diff)
download: spark-ce9f1bbe20eff794cd1d588dc88f109d32588cfe.tar.gz
spark-ce9f1bbe20eff794cd1d588dc88f109d32588cfe.tar.bz2
spark-ce9f1bbe20eff794cd1d588dc88f109d32588cfe.zip
1 files changed, 44 insertions, 5 deletions
diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md
index d88d4eb42d..d963551296 100644
--- a/docs/python-programming-guide.md
+++ b/docs/python-programming-guide.md
@@ -24,6 +24,35 @@ There are a few key differences between the Python and Scala APIs:
     - `sample`
     - `sort`
 
+In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types.
+Short functions can be passed to RDD methods using Python's [`lambda`](http://www.diveintopython.net/power_of_introspection/lambda_functions.html) syntax:
+
+{% highlight python %}
+logData = sc.textFile(logFile).cache()
+errors = logData.filter(lambda s: 'ERROR' in s.split())
+{% endhighlight %}
+
+You can also pass functions that are defined using the `def` keyword; this is useful for more complicated functions that cannot be expressed using `lambda`:
+
+{% highlight python %}
+def is_error(line):
+    return 'ERROR' in line.split()
+errors = logData.filter(is_error)
+{% endhighlight %}
+
+Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated to other tasks:
+
+{% highlight python %}
+error_keywords = ["Exception", "Error"]
+def is_error(line):
+     words = line.split()
+     return any(keyword in words for keyword in error_keywords)
+errors = logData.filter(is_error)
+{% endhighlight %}
+
+PySpark will automatically ship these functions to workers, along with any objects that they reference.
+Instances of classes will be serialized and shipped to workers by PySpark, but classes themselves cannot be automatically distributed to workers.
+The [Standalone Use](#standalone-use) section describes how to ship code dependencies to workers.
 
 # Installing and Configuring PySpark
 
@@ -34,13 +63,14 @@ By default, PySpark's scripts will run programs using `python`; an alternate Pyt
 
 All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.net/), are bundled with PySpark and automatically imported.
 
-Standalone PySpark jobs should be run using the `run-pyspark` script, which automatically configures the Java and Python environmnt using the settings in `conf/spark-env.sh`.
+Standalone PySpark jobs should be run using the `pyspark` script, which automatically configures the Java and Python environment using the settings in `conf/spark-env.sh`.
 The script automatically adds the `pyspark` package to the `PYTHONPATH`.
 
 
 # Interactive Use
 
-PySpark's `pyspark-shell` script provides a simple way to learn the API:
+The `pyspark` script launches a Python interpreter that is configured to run PySpark jobs.
+When run without any input files, `pyspark` launches a shell that can be used explore data interactively, which is a simple way to learn the API:
 
 {% highlight python %}
 >>> words = sc.textFile("/usr/share/dict/words")
@@ -48,9 +78,18 @@ PySpark's `pyspark-shell` script provides a simple way to learn the API:
 [u'spar', u'sparable', u'sparada', u'sparadrap', u'sparagrass']
 {% endhighlight %}
 
+By default, the `pyspark` shell creates SparkContext that runs jobs locally.
+To connect to a non-local cluster, set the `MASTER` environment variable.
+For example, to use the `pyspark` shell with a [standalone Spark cluster](spark-standalone.html):
+
+{% highlight shell %}
+$ MASTER=spark://IP:PORT ./pyspark
+{% endhighlight %}
+
+
 # Standalone Use
 
-PySpark can also be used from standalone Python scripts by creating a SparkContext in the script and running the script using the `run-pyspark` script in the `pyspark` directory.
+PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using `pyspark`.
 The Quick Start guide includes a [complete example](quick-start.html#a-standalone-job-in-python) of a standalone Python job.
 
 Code dependencies can be deployed by listing them in the `pyFiles` option in the SparkContext constructor:
@@ -65,8 +104,8 @@ Code dependencies can be added to an existing SparkContext using its `addPyFile(
 
 # Where to Go from Here
 
-PySpark includes several sample programs using the Python API in `pyspark/examples`.
-You can run them by passing the files to the `pyspark-run` script included in PySpark -- for example `./pyspark-run examples/wordcount.py`.
+PySpark includes several sample programs using the Python API in `python/examples`.
+You can run them by passing the files to the `pyspark` script -- for example `./pyspark python/examples/wordcount.py`.
 Each example program prints usage help when run without any arguments.
 
 We currently provide [API documentation](api/pyspark/index.html) for the Python API as Epydoc.
author	Josh Rosen <joshrosen@eecs.berkeley.edu>	2013-01-01 21:25:49 -0800
committer	Josh Rosen <joshrosen@eecs.berkeley.edu>	2013-01-01 21:25:49 -0800
commit	ce9f1bbe20eff794cd1d588dc88f109d32588cfe (patch)
tree	ff840eea62e8314dc4cefcaa08534c4b21e544ba /docs/python-programming-guide.md
parent	b58340dbd9a741331fc4c3829b08c093560056c2 (diff)
download	spark-ce9f1bbe20eff794cd1d588dc88f109d32588cfe.tar.gz spark-ce9f1bbe20eff794cd1d588dc88f109d32588cfe.tar.bz2 spark-ce9f1bbe20eff794cd1d588dc88f109d32588cfe.zip