[SPARK-1267][SPARK-18129] Allow PySpark to be pip installed

## What changes were proposed in this pull request? This PR aims to provide a pip installable PySpark package. This does a bunch of work to copy the jars over and package them with the Python code (to prevent challenges from trying to use different versions of the Python code with different versions of the JAR). It does not currently publish to PyPI but that is the natural follow up (SPARK-18129). Done: - pip installable on conda [manual tested] - setup.py installed on a non-pip managed system (RHEL) with YARN [manual tested] - Automated testing of this (virtualenv) - packaging and signing with release-build* Possible follow up work: - release-build update to publish to PyPI (SPARK-18128) - figure out who owns the pyspark package name on prod PyPI (is it someone with in the project or should we ask PyPI or should we choose a different name to publish with like ApachePySpark?) - Windows support and or testing ( SPARK-18136 ) - investigate details of wheel caching and see if we can avoid cleaning the wheel cache during our test - consider how we want to number our dev/snapshot versions Explicitly out of scope: - Using pip installed PySpark to start a standalone cluster - Using pip installed PySpark for non-Python Spark programs *I've done some work to test release-build locally but as a non-committer I've just done local testing. ## How was this patch tested? Automated testing with virtualenv, manual testing with conda, a system wide install, and YARN integration. release-build changes tested locally as a non-committer (no testing of upload artifacts to Apache staging websites) Author: Holden Karau <holden@us.ibm.com> Author: Juliet Hougland <juliet@cloudera.com> Author: Juliet Hougland <not@myemail.com> Closes #15659 from holdenk/SPARK-1267-pip-install-pyspark.
author: Holden Karau <holden@us.ibm.com> 2016-11-16 14:22:15 -0800
committer: Josh Rosen <joshrosen@databricks.com> 2016-11-16 14:22:15 -0800
commit: a36a76ac43c36a3b897a748bd9f138b629dbc684 (patch)
tree: 651dffb9af189f06369b2d3bbc0a897bc3b9f5ee /python/README.md
parent: bb6cdfd9a6a6b6c91aada7c3174436146045ed1e (diff)
download: spark-a36a76ac43c36a3b897a748bd9f138b629dbc684.tar.gz
spark-a36a76ac43c36a3b897a748bd9f138b629dbc684.tar.bz2
spark-a36a76ac43c36a3b897a748bd9f138b629dbc684.zip
1 files changed, 32 insertions, 0 deletions
diff --git a/python/README.md b/python/README.md
new file mode 100644
index 0000000000..0a5c8010b8
--- /dev/null
+++ b/python/README.md
@@ -0,0 +1,32 @@
+# Apache Spark
+
+Spark is a fast and general cluster computing system for Big Data. It provides
+high-level APIs in Scala, Java, Python, and R, and an optimized engine that
+supports general computation graphs for data analysis. It also supports a
+rich set of higher-level tools including Spark SQL for SQL and DataFrames,
+MLlib for machine learning, GraphX for graph processing,
+and Spark Streaming for stream processing.
+
+<http://spark.apache.org/>
+
+## Online Documentation
+
+You can find the latest Spark documentation, including a programming
+guide, on the [project web page](http://spark.apache.org/documentation.html)
+
+
+## Python Packaging
+
+This README file only contains basic information related to pip installed PySpark.
+This packaging is currently experimental and may change in future versions (although we will do our best to keep compatibility).
+Using PySpark requires the Spark JARs, and if you are building this from source please see the builder instructions at
+["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
+
+The Python packaging for Spark is not intended to replace all of the other use cases. This Python packaged version of Spark is suitable for interacting with an existing cluster (be it Spark standalone, YARN, or Mesos) - but does not contain the tools required to setup your own standalone Spark cluster. You can download the full version of Spark from the [Apache Spark downloads page](http://spark.apache.org/downloads.html).
+
+
+**NOTE:** If you are using this with a Spark standalone cluster you must ensure that the version (including minor version) matches or you may experience odd errors.
+
+## Python Requirements
+
+At its core PySpark depends on Py4J (currently version 0.10.4), but additional sub-packages have their own requirements (including numpy and pandas).
+\ No newline at end of file
author	Holden Karau <holden@us.ibm.com>	2016-11-16 14:22:15 -0800
committer	Josh Rosen <joshrosen@databricks.com>	2016-11-16 14:22:15 -0800
commit	a36a76ac43c36a3b897a748bd9f138b629dbc684 (patch)
tree	651dffb9af189f06369b2d3bbc0a897bc3b9f5ee /python/README.md
parent	bb6cdfd9a6a6b6c91aada7c3174436146045ed1e (diff)
download	spark-a36a76ac43c36a3b897a748bd9f138b629dbc684.tar.gz spark-a36a76ac43c36a3b897a748bd9f138b629dbc684.tar.bz2 spark-a36a76ac43c36a3b897a748bd9f138b629dbc684.zip