Merge remote-tracking branch 'upstream/master' into sparsesvd

author: Reza Zadeh <rizlar@gmail.com> 2014-01-11 13:27:15 -0800
committer: Reza Zadeh <rizlar@gmail.com> 2014-01-11 13:27:15 -0800
commit: f324d5355514b1c7ae85019b476046bb64b5593e (patch)
tree: f2774712cb0b4f6558ad00fe0168b00b51f0674a /docs
parent: 1afdeaeb2f436084a6fbe8d73690f148f7b462c4 (diff)
parent: ee6e7f9b8cc56985787546882fba291cf9ad7667 (diff)
download: spark-f324d5355514b1c7ae85019b476046bb64b5593e.tar.gz
spark-f324d5355514b1c7ae85019b476046bb64b5593e.tar.bz2
spark-f324d5355514b1c7ae85019b476046bb64b5593e.zip
2 files changed, 35 insertions, 5 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 6717757781..ad75e06fc7 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -104,14 +104,25 @@ Apart from these, the following properties are also available, and may be useful
 </tr>
 <tr>
   <td>spark.storage.memoryFraction</td>
-  <td>0.66</td>
+  <td>0.6</td>
   <td>
     Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old"
-    generation of objects in the JVM, which by default is given 2/3 of the heap, but you can increase
+    generation of objects in the JVM, which by default is given 0.6 of the heap, but you can increase
     it if you configure your own old generation size.
   </td>
 </tr>
 <tr>
+  <td>spark.shuffle.memoryFraction</td>
+  <td>0.3</td>
+  <td>
+    Fraction of Java heap to use for aggregation and cogroups during shuffles, if
+    <code>spark.shuffle.externalSorting</code> is enabled. At any given time, the collective size of
+    all in-memory maps used for shuffles is bounded by this limit, beyond which the contents will
+    begin to spill to disk. If spills are often, consider increasing this value at the expense of
+    <code>spark.storage.memoryFraction</code>.
+  </td>
+</tr>
+<tr>
   <td>spark.mesos.coarse</td>
   <td>false</td>
   <td>
@@ -371,12 +382,20 @@ Apart from these, the following properties are also available, and may be useful
 
 <tr>
   <td>spark.shuffle.consolidateFiles</td>
-  <td>false</td>
+  <td>true</td>
   <td>
     If set to "true", consolidates intermediate files created during a shuffle. Creating fewer files can improve filesystem performance for shuffles with large numbers of reduce tasks. It is recommended to set this to "true" when using ext4 or xfs filesystems. On ext3, this option might degrade performance on machines with many (>8) cores due to filesystem limitations.
   </td>
 </tr>
 <tr>
+  <td>spark.shuffle.externalSorting</td>
+  <td>true</td>
+  <td>
+    If set to "true", limits the amount of memory used during reduces by spilling data out to disk. This spilling
+    threshold is specified by <code>spark.shuffle.memoryFraction</code>.
+  </td>
+</tr>
+<tr>
   <td>spark.speculation</td>
   <td>false</td>
   <td>
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index b206270107..3bd62646ba 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -101,7 +101,19 @@ With this mode, your application is actually run on the remote machine where the
 
 With yarn-client mode, the application will be launched locally. Just like running application or spark-shell on Local / Mesos / Standalone mode. The launch method is also the similar with them, just make sure that when you need to specify a master url, use "yarn-client" instead. And you also need to export the env value for SPARK_JAR and SPARK_YARN_APP_JAR
 
-In order to tune worker core/number/memory etc. You need to export SPARK_WORKER_CORES, SPARK_WORKER_MEMORY, SPARK_WORKER_INSTANCES e.g. by ./conf/spark-env.sh
+Configuration in yarn-client mode:
+
+In order to tune worker core/number/memory etc. You need to export environment variables or add them to the spark configuration file (./conf/spark_env.sh). The following are the list of options.
+
+* `SPARK_YARN_APP_JAR`, Path to your application's JAR file (required)
+* `SPARK_WORKER_INSTANCES`, Number of workers to start (Default: 2)
+* `SPARK_WORKER_CORES`, Number of cores for the workers (Default: 1).
+* `SPARK_WORKER_MEMORY`, Memory per Worker (e.g. 1000M, 2G) (Default: 1G)
+* `SPARK_MASTER_MEMORY`, Memory for Master (e.g. 1000M, 2G) (Default: 512 Mb)
+* `SPARK_YARN_APP_NAME`, The name of your application (Default: Spark)
+* `SPARK_YARN_QUEUE`, The hadoop queue to use for allocation requests (Default: 'default')
+* `SPARK_YARN_DIST_FILES`, Comma separated list of files to be distributed with the job.
+* `SPARK_YARN_DIST_ARCHIVES`, Comma separated list of archives to be distributed with the job.
 
 For example:
 
@@ -114,7 +126,6 @@ For example:
     SPARK_YARN_APP_JAR=examples/target/scala-{{site.SCALA_VERSION}}/spark-examples-assembly-{{site.SPARK_VERSION}}.jar \
     MASTER=yarn-client ./bin/spark-shell
 
-You can also send extra files to yarn cluster for worker to use by exporting SPARK_YARN_DIST_FILES=file1,file2... etc.
 
 # Building Spark for Hadoop/YARN 2.2.x
author	Reza Zadeh <rizlar@gmail.com>	2014-01-11 13:27:15 -0800
committer	Reza Zadeh <rizlar@gmail.com>	2014-01-11 13:27:15 -0800
commit	f324d5355514b1c7ae85019b476046bb64b5593e (patch)
tree	f2774712cb0b4f6558ad00fe0168b00b51f0674a /docs
parent	1afdeaeb2f436084a6fbe8d73690f148f7b462c4 (diff)
parent	ee6e7f9b8cc56985787546882fba291cf9ad7667 (diff)
download	spark-f324d5355514b1c7ae85019b476046bb64b5593e.tar.gz spark-f324d5355514b1c7ae85019b476046bb64b5593e.tar.bz2 spark-f324d5355514b1c7ae85019b476046bb64b5593e.zip