[SPARK-1753 / 1773 / 1814] Update outdated docs for spark-submit, YARN, standalone etc.

YARN - SparkPi was updated to not take in master as an argument; we should update the docs to reflect that. - The default YARN build guide should be in maven, not sbt. - This PR also adds a paragraph on steps to debug a YARN application. Standalone - Emphasize spark-submit more. Right now it's one small paragraph preceding the legacy way of launching through `org.apache.spark.deploy.Client`. - The way we set configurations / environment variables according to the old docs is outdated. This needs to reflect changes introduced by the Spark configuration changes we made. In general, this PR also adds a little more documentation on the new spark-shell, spark-submit, spark-defaults.conf etc here and there. Author: Andrew Or <andrewor14@gmail.com> Closes #701 from andrewor14/yarn-docs and squashes the following commits: e2c2312 [Andrew Or] Merge in changes in #752 (SPARK-1814) 25cfe7b [Andrew Or] Merge in the warning from SPARK-1753 a8c39c5 [Andrew Or] Minor changes 336bbd9 [Andrew Or] Tabs -> spaces 4d9d8f7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 041017a [Andrew Or] Abstract Spark submit documentation to cluster-overview.html 3cc0649 [Andrew Or] Detail how to set configurations + remove legacy instructions 5b7140a [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 85a51fc [Andrew Or] Update run-example, spark-shell, configuration etc. c10e8c7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 381fe32 [Andrew Or] Update docs for standalone mode 757c184 [Andrew Or] Add a note about the requirements for the debugging trick f8ca990 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-docs 924f04c [Andrew Or] Revert addition of --deploy-mode d5fe17b [Andrew Or] Update the YARN docs
author: Andrew Or <andrewor14@gmail.com> 2014-05-12 19:44:14 -0700
committer: Patrick Wendell <pwendell@gmail.com> 2014-05-12 19:44:14 -0700
commit: 2ffd1eafd28635dcecc0ac738d4a62c05d740925 (patch)
tree: 0c2b30a97dfd24fc6268d4f429111fe6c7348bbe /docs/configuration.md
parent: ba96bb3d591130075763706526f86fb2aaffa3ae (diff)
download: spark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.tar.gz
spark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.tar.bz2
spark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.zip
1 files changed, 41 insertions, 23 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 5b034e3cb3..2eed96f704 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -5,9 +5,9 @@ title: Spark Configuration
 
 Spark provides three locations to configure the system:
 
-* [Spark properties](#spark-properties) control most application parameters and can be set by passing
-  a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java
-  system properties.
+* [Spark properties](#spark-properties) control most application parameters and can be set by
+  passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext,
+  or through the `conf/spark-defaults.conf` properties file.
 * [Environment variables](#environment-variables) can be used to set per-machine settings, such as
   the IP address, through the `conf/spark-env.sh` script on each node.
 * [Logging](#configuring-logging) can be configured through `log4j.properties`.
@@ -15,25 +15,41 @@ Spark provides three locations to configure the system:
 
 # Spark Properties
 
-Spark properties control most application settings and are configured separately for each application.
-The preferred way to set them is by passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
-class to your SparkContext constructor.
-Alternatively, Spark will also load them from Java system properties, for compatibility with old versions
-of Spark.
-
-SparkConf lets you configure most of the common properties to initialize a cluster (e.g., master URL and
-application name), as well as arbitrary key-value pairs through the `set()` method. For example, we could
-initialize an application as follows:
+Spark properties control most application settings and are configured separately for each
+application. The preferred way is to set them through
+[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) and passing it as an argument to your
+SparkContext. SparkConf allows you to configure most of the common properties to initialize a
+cluster (e.g. master URL and application name), as well as arbitrary key-value pairs through the
+`set()` method. For example, we could initialize an application as follows:
 
 {% highlight scala %}
-val conf = new SparkConf().
-             setMaster("local").
-             setAppName("My application").
-             set("spark.executor.memory", "1g")
+val conf = new SparkConf
+             .setMaster("local")
+             .setAppName("CountingSheep")
+             .set("spark.executor.memory", "1g")
 val sc = new SparkContext(conf)
 {% endhighlight %}
 
-Most of the properties control internal settings that have reasonable default values. However,
+## Loading Default Configurations
+
+In the case of `spark-shell`, a SparkContext has already been created for you, so you cannot control
+the configuration properties through SparkConf. However, you can still set configuration properties
+through a default configuration file. By default, `spark-shell` (and more generally `spark-submit`)
+will read configuration options from `conf/spark-defaults.conf`, in which each line consists of a
+key and a value separated by whitespace. For example,
+
+    spark.master            spark://5.6.7.8:7077
+    spark.executor.memory   512m
+    spark.eventLog.enabled  true
+    spark.serializer        org.apache.spark.serializer.KryoSerializer
+
+Any values specified in the file will be passed on to the application, and merged with those
+specified through SparkConf. If the same configuration property exists in both `spark-defaults.conf`
+and SparkConf, then the latter will take precedence as it is the most application-specific.
+
+## All Configuration Properties
+
+Most of the properties that control internal settings have reasonable default values. However,
 there are at least five properties that you will commonly want to control:
 
 <table class="table">
@@ -101,9 +117,9 @@ Apart from these, the following properties are also available, and may be useful
   <td>spark.default.parallelism</td>
   <td>
     <ul>
+      <li>Local mode: number of cores on the local machine</li>
       <li>Mesos fine grained mode: 8</li>
-      <li>Local mode: core number of the local machine</li>
-      <li>Others: total core number of all executor nodes or 2, whichever is larger</li>
+      <li>Others: total number of cores on all executor nodes or 2, whichever is larger</li>
     </ul>
   </td>
   <td>
@@ -187,7 +203,7 @@ Apart from these, the following properties are also available, and may be useful
     Comma separated list of filter class names to apply to the Spark web ui. The filter should be a
     standard javax servlet Filter. Parameters to each filter can also be specified by setting a
     java system property of spark.&lt;class name of filter&gt;.params='param1=value1,param2=value2'
-    (e.g.-Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
+    (e.g. -Dspark.ui.filters=com.test.filter1 -Dspark.com.test.filter1.params='param1=foo,param2=testing')
   </td>
 </tr>
 <tr>
@@ -696,7 +712,9 @@ Apart from these, the following properties are also available, and may be useful
 ## Viewing Spark Properties
 
 The application web UI at `http://<driver>:4040` lists Spark properties in the "Environment" tab.
-This is a useful place to check to make sure that your properties have been set correctly.
+This is a useful place to check to make sure that your properties have been set correctly. Note
+that only values explicitly specified through either `spark-defaults.conf` or SparkConf will
+appear. For all other configuration properties, you can assume the default value is used.
 
 # Environment Variables
 
@@ -714,8 +732,8 @@ The following variables can be set in `spark-env.sh`:
 * `PYSPARK_PYTHON`, the Python binary to use for PySpark
 * `SPARK_LOCAL_IP`, to configure which IP address of the machine to bind to.
 * `SPARK_PUBLIC_DNS`, the hostname your Spark program will advertise to other machines.
-* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts), such as number of cores
-  to use on each machine and maximum memory.
+* Options for the Spark [standalone cluster scripts](spark-standalone.html#cluster-launch-scripts),
+  such as number of cores to use on each machine and maximum memory.
 
 Since `spark-env.sh` is a shell script, some of these can be set programmatically -- for example, you might
 compute `SPARK_LOCAL_IP` by looking up the IP of a specific network interface.
author	Andrew Or <andrewor14@gmail.com>	2014-05-12 19:44:14 -0700
committer	Patrick Wendell <pwendell@gmail.com>	2014-05-12 19:44:14 -0700
commit	2ffd1eafd28635dcecc0ac738d4a62c05d740925 (patch)
tree	0c2b30a97dfd24fc6268d4f429111fe6c7348bbe /docs/configuration.md
parent	ba96bb3d591130075763706526f86fb2aaffa3ae (diff)
download	spark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.tar.gz spark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.tar.bz2 spark-2ffd1eafd28635dcecc0ac738d4a62c05d740925.zip