pyspark/README


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42

# PySpark

PySpark is a Python API for Spark.

PySpark jobs are writen in Python and executed using a standard Python
interpreter; this supports modules that use Python C extensions.  The
API is based on the Spark Scala API and uses regular Python functions
and lambdas to support user-defined functions.  PySpark supports
interactive use through a standard Python interpreter; it can
automatically serialize closures and ship them to worker processes.

PySpark is built on top of the Spark Java API.  Data is uniformly
represented as serialized Python objects and stored in Spark Java
processes, which communicate with PySpark worker processes over pipes.

## Features

PySpark supports most of the Spark API, including broadcast variables.
RDDs are dynamically typed and can hold any Python object.

PySpark does not support:

- Special functions on RDDs of doubles
- Accumulators

## Examples and Documentation

The PySpark source contains docstrings and doctests that document its
API.  The public classes are in `context.py` and `rdd.py`.

The `pyspark/pyspark/examples` directory contains a few complete
examples.

## Installing PySpark
#
To use PySpark, `SPARK_HOME` should be set to the location of the Spark
package.

## Running PySpark

The easiest way to run PySpark is to use the `run-pyspark` and
`pyspark-shell` scripts, which are included in the `pyspark` directory.