SPARK-2624 add datanucleus jars to the container in yarn-cluster

If `spark-submit` finds the datanucleus jars, it adds them to the driver's classpath, but does not add it to the container. This patch modifies the yarn deployment class to copy all `datanucleus-*` jars found in `[spark-home]/libs` to the container. Author: Jim Lim <jim@quixey.com> Closes #3238 from jimjh/SPARK-2624 and squashes the following commits: 3633071 [Jim Lim] SPARK-2624 update documentation and comments fe95125 [Jim Lim] SPARK-2624 keep java imports together 6c31fe0 [Jim Lim] SPARK-2624 update documentation 6690fbf [Jim Lim] SPARK-2624 add tests d28d8e9 [Jim Lim] SPARK-2624 add spark.yarn.datanucleus.dir option 84e6cba [Jim Lim] SPARK-2624 add datanucleus jars to the container in yarn-cluster
author: Jim Lim <jim@quixey.com> 2014-12-03 11:16:02 -0800
committer: Andrew Or <andrew@databricks.com> 2014-12-03 11:16:29 -0800
commit: a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 (patch)
tree: 4d360d83bf07ae47d9b12962c47431d7611568c9 /docs/running-on-yarn.md
parent: d00542987ed80635782dcc826fc0bdbf434fff10 (diff)
download: spark-a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.tar.gz
spark-a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.tar.bz2
spark-a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.zip
1 files changed, 15 insertions, 0 deletions
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index dfe2db4b3f..45e219e0c1 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -132,6 +132,21 @@ Most of the configs are the same for Spark on YARN as for other deployment modes
     The maximum number of threads to use in the application master for launching executor containers.
   </td>
 </tr>
+<tr>
+  <td><code>spark.yarn.datanucleus.dir</code></td>
+  <td>$SPARK_HOME/lib</td>
+  <td>
+     The location of the DataNucleus jars, in case overriding the default location is desired.
+     By default, Spark on YARN will use the DataNucleus jars installed at
+     <code>$SPARK_HOME/lib</code>, but the jars can also be in a world-readable location on HDFS.
+     This allows YARN to cache it on nodes so that it doesn't need to be distributed each time an
+     application runs. To point to a directory on HDFS, for example, set this configuration to
+     "hdfs:///some/path".
+
+     This is required because the datanucleus jars cannot be packaged into the
+     assembly jar due to metadata conflicts (involving <code>plugin.xml</code>.)
+  </td>
+</tr>
 </table>
 
 # Launching Spark on YARN
author	Jim Lim <jim@quixey.com>	2014-12-03 11:16:02 -0800
committer	Andrew Or <andrew@databricks.com>	2014-12-03 11:16:29 -0800
commit	a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53 (patch)
tree	4d360d83bf07ae47d9b12962c47431d7611568c9 /docs/running-on-yarn.md
parent	d00542987ed80635782dcc826fc0bdbf434fff10 (diff)
download	spark-a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.tar.gz spark-a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.tar.bz2 spark-a975dc32799bb8a14f9e1c76defaaa7cfbaf8b53.zip