diff options
author | Marcelo Vanzin <vanzin@cloudera.com> | 2015-06-26 08:45:22 -0500 |
---|---|---|
committer | Imran Rashid <irashid@cloudera.com> | 2015-06-26 08:45:22 -0500 |
commit | 37bf76a2de2143ec6348a3d43b782227849520cc (patch) | |
tree | 5c4b07354e7bb3dbf0e896ffe448fa7e6451c324 /docs/running-on-yarn.md | |
parent | c9e05a315a96fbf3026a2b3c6934dd2dec420099 (diff) | |
download | spark-37bf76a2de2143ec6348a3d43b782227849520cc.tar.gz spark-37bf76a2de2143ec6348a3d43b782227849520cc.tar.bz2 spark-37bf76a2de2143ec6348a3d43b782227849520cc.zip |
[SPARK-8302] Support heterogeneous cluster install paths on YARN.
Some users have Hadoop installations on different paths across
their cluster. Currently, that makes it hard to set up some
configuration in Spark since that requires hardcoding paths to
jar files or native libraries, which wouldn't work on such a cluster.
This change introduces a couple of YARN-specific configurations
that instruct the backend to replace certain paths when launching
remote processes. That way, if the configuration says the Spark
jar is in "/spark/spark.jar", and also says that "/spark" should be
replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
of the jar.
Coupled with YARN's environment whitelist (which allows certain
env variables to be exposed to containers), this allows users to
support such heterogeneous environments, as long as a single
replacement is enough. (Otherwise, this feature would need to be
extended to support multiple path replacements.)
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #6752 from vanzin/SPARK-8302 and squashes the following commits:
4bff8d4 [Marcelo Vanzin] Add docs, rename configs.
0aa2a02 [Marcelo Vanzin] Only do replacement for paths that need it.
2e9cc9d [Marcelo Vanzin] Style.
a5e1f68 [Marcelo Vanzin] [SPARK-8302] Support heterogeneous cluster install paths on YARN.
Diffstat (limited to 'docs/running-on-yarn.md')
-rw-r--r-- | docs/running-on-yarn.md | 26 |
1 files changed, 26 insertions, 0 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md index 96cf612c54..3f8a093bbe 100644 --- a/docs/running-on-yarn.md +++ b/docs/running-on-yarn.md @@ -258,6 +258,32 @@ Most of the configs are the same for Spark on YARN as for other deployment modes Principal to be used to login to KDC, while running on secure HDFS. </td> </tr> +<tr> + <td><code>spark.yarn.config.gatewayPath</code></td> + <td>(none)</td> + <td> + A path that is valid on the gateway host (the host where a Spark application is started) but may + differ for paths for the same resource in other nodes in the cluster. Coupled with + <code>spark.yarn.config.replacementPath</code>, this is used to support clusters with + heterogeneous configurations, so that Spark can correctly launch remote processes. + <p/> + The replacement path normally will contain a reference to some environment variable exported by + YARN (and, thus, visible to Spark containers). + <p/> + For example, if the gateway node has Hadoop libraries installed on <code>/disk1/hadoop</code>, and + the location of the Hadoop install is exported by YARN as the <code>HADOOP_HOME</code> + environment variable, setting this value to <code>/disk1/hadoop</code> and the replacement path to + <code>$HADOOP_HOME</code> will make sure that paths used to launch remote processes properly + reference the local YARN configuration. + </td> +</tr> +<tr> + <td><code>spark.yarn.config.replacementPath</code></td> + <td>(none)</td> + <td> + See <code>spark.yarn.config.gatewayPath</code>. + </td> +</tr> </table> # Launching Spark on YARN |