# PySpark

PySpark is a Python API for Spark.

PySpark jobs are writen in Python and executed using a standard Python
interpreter; this supports modules that use Python C extensions.  The
API is based on the Spark Scala API and uses regular Python functions
and lambdas to support user-defined functions.  PySpark supports
interactive use through a standard Python interpreter; it can
automatically serialize closures and ship them to worker processes.

PySpark is built on top of the Spark Java API.  Data is uniformly
represented as serialized Python objects and stored in Spark Java
processes, which communicate with PySpark worker processes over pipes.

## Features

PySpark supports most of the Spark API, including broadcast variables.
RDDs are dynamically typed and can hold any Python object.

PySpark does not support:

- Special functions on RDDs of doubles
- Accumulators

## Examples and Documentation

The PySpark source contains docstrings and doctests that document its
API.  The public classes are in `context.py` and `rdd.py`.

The `pyspark/pyspark/examples` directory contains a few complete
examples.

## Installing PySpark

PySpark requires a development version of Py4J, a Python library for
interacting with Java processes.  It can be installed from
https://github.com/bartdag/py4j; make sure to install a version that
contains at least the commits through b7924aabe9.

PySpark uses the `PYTHONPATH` environment variable to search for Python
classes; Py4J should be on this path, along with any libraries used by
PySpark programs.  `PYTHONPATH` will be automatically shipped to worker
machines, but the files that it points to must be present on each
machine.

PySpark requires the Spark assembly JAR, which can be created by running
`sbt/sbt assembly` in the Spark directory.

Additionally, `SPARK_HOME` should be set to the location of the Spark
package.

## Running PySpark

The easiest way to run PySpark is to use the `run-pyspark` and
`pyspark-shell` scripts, which are included in the `pyspark` directory.
These scripts automatically load the `spark-conf.sh` file, set
`SPARK_HOME`, and add the `pyspark` package to the `PYTHONPATH`.