aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--README.md5
-rwxr-xr-xdocs/_layouts/global.html1
-rw-r--r--docs/configuration.md15
-rw-r--r--docs/hadoop-third-party-distributions.md117
-rw-r--r--docs/index.md1
-rw-r--r--docs/programming-guide.md9
6 files changed, 19 insertions, 129 deletions
diff --git a/README.md b/README.md
index 4116ef3563..c0d6a94603 100644
--- a/README.md
+++ b/README.md
@@ -87,10 +87,7 @@ Hadoop, you must build Spark against the same version that your cluster runs.
Please refer to the build documentation at
["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
for detailed guidance on building for a particular distribution of Hadoop, including
-building for particular Hive and Hive Thriftserver distributions. See also
-["Third Party Hadoop Distributions"](http://spark.apache.org/docs/latest/hadoop-third-party-distributions.html)
-for guidance on building a Spark application that works with a particular
-distribution.
+building for particular Hive and Hive Thriftserver distributions.
## Configuration
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index b4952fe97c..467ff7a03f 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -112,7 +112,6 @@
<li><a href="job-scheduling.html">Job Scheduling</a></li>
<li><a href="security.html">Security</a></li>
<li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
- <li><a href="hadoop-third-party-distributions.html">3<sup>rd</sup>-Party Hadoop Distros</a></li>
<li class="divider"></li>
<li><a href="building-spark.html">Building Spark</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
diff --git a/docs/configuration.md b/docs/configuration.md
index 682384d424..c276e8e90d 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -1674,3 +1674,18 @@ Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can config
To specify a different configuration directory other than the default "SPARK_HOME/conf",
you can set SPARK_CONF_DIR. Spark will use the the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc)
from this directory.
+
+# Inheriting Hadoop Cluster Configuration
+
+If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that
+should be included on Spark's classpath:
+
+* `hdfs-site.xml`, which provides default behaviors for the HDFS client.
+* `core-site.xml`, which sets the default filesystem name.
+
+The location of these configuration files varies across CDH and HDP versions, but
+a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create
+configurations on-the-fly, but offer a mechanisms to download copies of them.
+
+To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/spark-env.sh`
+to a location containing the configuration files.
diff --git a/docs/hadoop-third-party-distributions.md b/docs/hadoop-third-party-distributions.md
deleted file mode 100644
index 795dd82a6b..0000000000
--- a/docs/hadoop-third-party-distributions.md
+++ /dev/null
@@ -1,117 +0,0 @@
----
-layout: global
-title: Third-Party Hadoop Distributions
----
-
-Spark can run against all versions of Cloudera's Distribution Including Apache Hadoop (CDH) and
-the Hortonworks Data Platform (HDP). There are a few things to keep in mind when using Spark
-with these distributions:
-
-# Compile-time Hadoop Version
-
-When compiling Spark, you'll need to specify the Hadoop version by defining the `hadoop.version`
-property. For certain versions, you will need to specify additional profiles. For more detail,
-see the guide on [building with maven](building-spark.html#specifying-the-hadoop-version):
-
- mvn -Dhadoop.version=1.0.4 -DskipTests clean package
- mvn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
-
-The table below lists the corresponding `hadoop.version` code for each CDH/HDP release. Note that
-some Hadoop releases are binary compatible across client versions. This means the pre-built Spark
-distribution may "just work" without you needing to compile. That said, we recommend compiling with
-the _exact_ Hadoop version you are running to avoid any compatibility errors.
-
-<table>
- <tr valign="top">
- <td>
- <h3>CDH Releases</h3>
- <table class="table" style="width:350px; margin-right: 20px;">
- <tr><th>Release</th><th>Version code</th></tr>
- <tr><td>CDH 4.X.X (YARN mode)</td><td>2.0.0-cdh4.X.X</td></tr>
- <tr><td>CDH 4.X.X</td><td>2.0.0-mr1-cdh4.X.X</td></tr>
- </table>
- </td>
- <td>
- <h3>HDP Releases</h3>
- <table class="table" style="width:350px;">
- <tr><th>Release</th><th>Version code</th></tr>
- <tr><td>HDP 1.3</td><td>1.2.0</td></tr>
- <tr><td>HDP 1.2</td><td>1.1.2</td></tr>
- <tr><td>HDP 1.1</td><td>1.0.3</td></tr>
- <tr><td>HDP 1.0</td><td>1.0.3</td></tr>
- <tr><td>HDP 2.0</td><td>2.2.0</td></tr>
- </table>
- </td>
- </tr>
-</table>
-
-In SBT, the equivalent can be achieved by setting the the `hadoop.version` property:
-
- build/sbt -Dhadoop.version=1.0.4 assembly
-
-# Linking Applications to the Hadoop Version
-
-In addition to compiling Spark itself against the right version, you need to add a Maven dependency on that
-version of `hadoop-client` to any Spark applications you run, so they can also talk to the HDFS version
-on the cluster. If you are using CDH, you also need to add the Cloudera Maven repository.
-This looks as follows in SBT:
-
-{% highlight scala %}
-libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<version>"
-
-// If using CDH, also add Cloudera repo
-resolvers += "Cloudera Repository" at "https://repository.cloudera.com/artifactory/cloudera-repos/"
-{% endhighlight %}
-
-Or in Maven:
-
-{% highlight xml %}
-<project>
- <dependencies>
- ...
- <dependency>
- <groupId>org.apache.hadoop</groupId>
- <artifactId>hadoop-client</artifactId>
- <version>[version]</version>
- </dependency>
- </dependencies>
-
- <!-- If using CDH, also add Cloudera repo -->
- <repositories>
- ...
- <repository>
- <id>Cloudera repository</id>
- <url>https://repository.cloudera.com/artifactory/cloudera-repos/</url>
- </repository>
- </repositories>
-</project>
-
-{% endhighlight %}
-
-# Where to Run Spark
-
-As described in the [Hardware Provisioning](hardware-provisioning.html#storage-systems) guide,
-Spark can run in a variety of deployment modes:
-
-* Using dedicated set of Spark nodes in your cluster. These nodes should be co-located with your
- Hadoop installation.
-* Running on the same nodes as an existing Hadoop installation, with a fixed amount memory and
- cores dedicated to Spark on each node.
-* Run Spark alongside Hadoop using a cluster resource manager, such as YARN or Mesos.
-
-These options are identical for those using CDH and HDP.
-
-# Inheriting Cluster Configuration
-
-If you plan to read and write from HDFS using Spark, there are two Hadoop configuration files that
-should be included on Spark's classpath:
-
-* `hdfs-site.xml`, which provides default behaviors for the HDFS client.
-* `core-site.xml`, which sets the default filesystem name.
-
-The location of these configuration files varies across CDH and HDP versions, but
-a common location is inside of `/etc/hadoop/conf`. Some tools, such as Cloudera Manager, create
-configurations on-the-fly, but offer a mechanisms to download copies of them.
-
-To make these files visible to Spark, set `HADOOP_CONF_DIR` in `$SPARK_HOME/spark-env.sh`
-to a location containing the configuration files.
diff --git a/docs/index.md b/docs/index.md
index c0dc2b8d74..f1d9e012c6 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -117,7 +117,6 @@ options for deployment:
* [Job Scheduling](job-scheduling.html): scheduling resources across and within Spark applications
* [Security](security.html): Spark security support
* [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware
-* [3<sup>rd</sup> Party Hadoop Distributions](hadoop-third-party-distributions.html): using common Hadoop distributions
* Integration with other storage systems:
* [OpenStack Swift](storage-openstack-swift.html)
* [Building Spark](building-spark.html): build Spark using the Maven system
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 22656fd791..f823b89a4b 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -34,8 +34,7 @@ To write a Spark application, you need to add a Maven dependency on Spark. Spark
version = {{site.SPARK_VERSION}}
In addition, if you wish to access an HDFS cluster, you need to add a dependency on
-`hadoop-client` for your version of HDFS. Some common HDFS version tags are listed on the
-[third party distributions](hadoop-third-party-distributions.html) page.
+`hadoop-client` for your version of HDFS.
groupId = org.apache.hadoop
artifactId = hadoop-client
@@ -66,8 +65,7 @@ To write a Spark application in Java, you need to add a dependency on Spark. Spa
version = {{site.SPARK_VERSION}}
In addition, if you wish to access an HDFS cluster, you need to add a dependency on
-`hadoop-client` for your version of HDFS. Some common HDFS version tags are listed on the
-[third party distributions](hadoop-third-party-distributions.html) page.
+`hadoop-client` for your version of HDFS.
groupId = org.apache.hadoop
artifactId = hadoop-client
@@ -93,8 +91,7 @@ This script will load Spark's Java/Scala libraries and allow you to submit appli
You can also use `bin/pyspark` to launch an interactive Python shell.
If you wish to access HDFS data, you need to use a build of PySpark linking
-to your version of HDFS. Some common HDFS version tags are listed on the
-[third party distributions](hadoop-third-party-distributions.html) page.
+to your version of HDFS.
[Prebuilt packages](http://spark.apache.org/downloads.html) are also available on the Spark homepage
for common HDFS versions.