Add PySpark README and run scripts.

author: Josh Rosen <joshrosen@eecs.berkeley.edu> 2012-10-20 00:16:41 +0000
committer: Josh Rosen <joshrosen@eecs.berkeley.edu> 2012-10-20 00:22:27 +0000
commit: c23bf1aff4b9a1faf9d32c7b64acad2213f9515c (patch)
tree: 85c71f77ef78714eb6a459874354c8ba11afdd2c /pyspark/README
parent: 52989c8a2c8c10d7f5610c033f6782e58fd3abc2 (diff)
download: spark-c23bf1aff4b9a1faf9d32c7b64acad2213f9515c.tar.gz
spark-c23bf1aff4b9a1faf9d32c7b64acad2213f9515c.tar.bz2
spark-c23bf1aff4b9a1faf9d32c7b64acad2213f9515c.zip
1 files changed, 58 insertions, 0 deletions
diff --git a/pyspark/README b/pyspark/README
new file mode 100644
index 0000000000..63a1def141
--- /dev/null
+++ b/pyspark/README
@@ -0,0 +1,58 @@
+# PySpark
+
+PySpark is a Python API for Spark.
+
+PySpark jobs are writen in Python and executed using a standard Python
+interpreter; this supports modules that use Python C extensions.  The
+API is based on the Spark Scala API and uses regular Python functions
+and lambdas to support user-defined functions.  PySpark supports
+interactive use through a standard Python interpreter; it can
+automatically serialize closures and ship them to worker processes.
+
+PySpark is built on top of the Spark Java API.  Data is uniformly
+represented as serialized Python objects and stored in Spark Java
+processes, which communicate with PySpark worker processes over pipes.
+
+## Features
+
+PySpark supports most of the Spark API, including broadcast variables.
+RDDs are dynamically typed and can hold any Python object.
+
+PySpark does not support:
+
+- Special functions on RDDs of doubles
+- Accumulators
+
+## Examples and Documentation
+
+The PySpark source contains docstrings and doctests that document its
+API.  The public classes are in `context.py` and `rdd.py`.
+
+The `pyspark/pyspark/examples` directory contains a few complete
+examples.
+
+## Installing PySpark
+
+PySpark requires a development version of Py4J, a Python library for
+interacting with Java processes.  It can be installed from
+https://github.com/bartdag/py4j; make sure to install a version that
+contains at least the commits through 3dbf380d3d.
+
+PySpark uses the `PYTHONPATH` environment variable to search for Python
+classes; Py4J should be on this path, along with any libraries used by
+PySpark programs.  `PYTHONPATH` will be automatically shipped to worker
+machines, but the files that it points to must be present on each
+machine.
+
+PySpark requires the Spark assembly JAR, which can be created by running
+`sbt/sbt assembly` in the Spark directory.
+
+Additionally, `SPARK_HOME` should be set to the location of the Spark
+package.
+
+## Running PySpark
+
+The easiest way to run PySpark is to use the `run-pyspark` and
+`pyspark-shell` scripts, which are included in the `pyspark` directory.
+These scripts automatically load the `spark-conf.sh` file, set
+`SPARK_HOME`, and add the `pyspark` package to the `PYTHONPATH`.
author	Josh Rosen <joshrosen@eecs.berkeley.edu>	2012-10-20 00:16:41 +0000
committer	Josh Rosen <joshrosen@eecs.berkeley.edu>	2012-10-20 00:22:27 +0000
commit	c23bf1aff4b9a1faf9d32c7b64acad2213f9515c (patch)
tree	85c71f77ef78714eb6a459874354c8ba11afdd2c /pyspark/README
parent	52989c8a2c8c10d7f5610c033f6782e58fd3abc2 (diff)
download	spark-c23bf1aff4b9a1faf9d32c7b64acad2213f9515c.tar.gz spark-c23bf1aff4b9a1faf9d32c7b64acad2213f9515c.tar.bz2 spark-c23bf1aff4b9a1faf9d32c7b64acad2213f9515c.zip