# PySpark PySpark is a Python API for Spark. PySpark jobs are writen in Python and executed using a standard Python interpreter; this supports modules that use Python C extensions. The API is based on the Spark Scala API and uses regular Python functions and lambdas to support user-defined functions. PySpark supports interactive use through a standard Python interpreter; it can automatically serialize closures and ship them to worker processes. PySpark is built on top of the Spark Java API. Data is uniformly represented as serialized Python objects and stored in Spark Java processes, which communicate with PySpark worker processes over pipes. ## Features PySpark supports most of the Spark API, including broadcast variables. RDDs are dynamically typed and can hold any Python object. PySpark does not support: - Special functions on RDDs of doubles - Accumulators ## Examples and Documentation The PySpark source contains docstrings and doctests that document its API. The public classes are in `context.py` and `rdd.py`. The `pyspark/pyspark/examples` directory contains a few complete examples. ## Installing PySpark # To use PySpark, `SPARK_HOME` should be set to the location of the Spark package. ## Running PySpark The easiest way to run PySpark is to use the `run-pyspark` and `pyspark-shell` scripts, which are included in the `pyspark` directory.