diff options
Diffstat (limited to 'docs')
-rw-r--r-- | docs/python-programming-guide.md | 49 | ||||
-rw-r--r-- | docs/quick-start.md | 4 |
2 files changed, 46 insertions, 7 deletions
diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md index d88d4eb42d..d963551296 100644 --- a/docs/python-programming-guide.md +++ b/docs/python-programming-guide.md @@ -24,6 +24,35 @@ There are a few key differences between the Python and Scala APIs: - `sample` - `sort` +In PySpark, RDDs support the same methods as their Scala counterparts but take Python functions and return Python collection types. +Short functions can be passed to RDD methods using Python's [`lambda`](http://www.diveintopython.net/power_of_introspection/lambda_functions.html) syntax: + +{% highlight python %} +logData = sc.textFile(logFile).cache() +errors = logData.filter(lambda s: 'ERROR' in s.split()) +{% endhighlight %} + +You can also pass functions that are defined using the `def` keyword; this is useful for more complicated functions that cannot be expressed using `lambda`: + +{% highlight python %} +def is_error(line): + return 'ERROR' in line.split() +errors = logData.filter(is_error) +{% endhighlight %} + +Functions can access objects in enclosing scopes, although modifications to those objects within RDD methods will not be propagated to other tasks: + +{% highlight python %} +error_keywords = ["Exception", "Error"] +def is_error(line): + words = line.split() + return any(keyword in words for keyword in error_keywords) +errors = logData.filter(is_error) +{% endhighlight %} + +PySpark will automatically ship these functions to workers, along with any objects that they reference. +Instances of classes will be serialized and shipped to workers by PySpark, but classes themselves cannot be automatically distributed to workers. +The [Standalone Use](#standalone-use) section describes how to ship code dependencies to workers. # Installing and Configuring PySpark @@ -34,13 +63,14 @@ By default, PySpark's scripts will run programs using `python`; an alternate Pyt All of PySpark's library dependencies, including [Py4J](http://py4j.sourceforge.net/), are bundled with PySpark and automatically imported. -Standalone PySpark jobs should be run using the `run-pyspark` script, which automatically configures the Java and Python environmnt using the settings in `conf/spark-env.sh`. +Standalone PySpark jobs should be run using the `pyspark` script, which automatically configures the Java and Python environment using the settings in `conf/spark-env.sh`. The script automatically adds the `pyspark` package to the `PYTHONPATH`. # Interactive Use -PySpark's `pyspark-shell` script provides a simple way to learn the API: +The `pyspark` script launches a Python interpreter that is configured to run PySpark jobs. +When run without any input files, `pyspark` launches a shell that can be used explore data interactively, which is a simple way to learn the API: {% highlight python %} >>> words = sc.textFile("/usr/share/dict/words") @@ -48,9 +78,18 @@ PySpark's `pyspark-shell` script provides a simple way to learn the API: [u'spar', u'sparable', u'sparada', u'sparadrap', u'sparagrass'] {% endhighlight %} +By default, the `pyspark` shell creates SparkContext that runs jobs locally. +To connect to a non-local cluster, set the `MASTER` environment variable. +For example, to use the `pyspark` shell with a [standalone Spark cluster](spark-standalone.html): + +{% highlight shell %} +$ MASTER=spark://IP:PORT ./pyspark +{% endhighlight %} + + # Standalone Use -PySpark can also be used from standalone Python scripts by creating a SparkContext in the script and running the script using the `run-pyspark` script in the `pyspark` directory. +PySpark can also be used from standalone Python scripts by creating a SparkContext in your script and running the script using `pyspark`. The Quick Start guide includes a [complete example](quick-start.html#a-standalone-job-in-python) of a standalone Python job. Code dependencies can be deployed by listing them in the `pyFiles` option in the SparkContext constructor: @@ -65,8 +104,8 @@ Code dependencies can be added to an existing SparkContext using its `addPyFile( # Where to Go from Here -PySpark includes several sample programs using the Python API in `pyspark/examples`. -You can run them by passing the files to the `pyspark-run` script included in PySpark -- for example `./pyspark-run examples/wordcount.py`. +PySpark includes several sample programs using the Python API in `python/examples`. +You can run them by passing the files to the `pyspark` script -- for example `./pyspark python/examples/wordcount.py`. Each example program prints usage help when run without any arguments. We currently provide [API documentation](api/pyspark/index.html) for the Python API as Epydoc. diff --git a/docs/quick-start.md b/docs/quick-start.md index 8c25df5486..2c7cfbed25 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -258,11 +258,11 @@ We can pass Python functions to Spark, which are automatically serialized along For jobs that use custom classes or third-party libraries, we can add those code dependencies to SparkContext to ensure that they will be available on remote machines; this is described in more detail in the [Python programming guide](python-programming-guide). `SimpleJob` is simple enough that we do not need to specify any code dependencies. -We can run this job using the `run-pyspark` script in `$SPARK_HOME/pyspark`: +We can run this job using the `pyspark` script: {% highlight python %} $ cd $SPARK_HOME -$ ./pyspark/run-pyspark SimpleJob.py +$ ./pyspark SimpleJob.py ... Lines with a: 8422, Lines with b: 1836 {% endhighlight python %} |