aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorBrennon York <brennon.york@capitalone.com>2015-02-25 16:12:56 -0800
committerReynold Xin <rxin@databricks.com>2015-02-25 16:12:56 -0800
commit46a044a36a2aff1306f7f677e952ce253ddbefac (patch)
tree96c638352e86ac3f399afb2fe342a3e625f30e2a
parent41e2e5acb749c25641f1f8dea5a2e1d8af319486 (diff)
downloadspark-46a044a36a2aff1306f7f677e952ce253ddbefac.tar.gz
spark-46a044a36a2aff1306f7f677e952ce253ddbefac.tar.bz2
spark-46a044a36a2aff1306f7f677e952ce253ddbefac.zip
[SPARK-1182][Docs] Sort the configuration parameters in configuration.md
Sorts all configuration options present on the `configuration.md` page to ease readability. Author: Brennon York <brennon.york@capitalone.com> Closes #3863 from brennonyork/SPARK-1182 and squashes the following commits: 5696f21 [Brennon York] fixed merge conflict with port comments 81a7b10 [Brennon York] capitalized A in Allocation e240486 [Brennon York] moved all spark.mesos properties into the running-on-mesos doc 7de5f75 [Brennon York] moved serialization from application to compression and serialization section a16fec0 [Brennon York] moved shuffle settings from network to shuffle f8fa286 [Brennon York] sorted encryption category 1023f15 [Brennon York] moved initialExecutors e9d62aa [Brennon York] fixed akka.heartbeat.interval 25e6f6f [Brennon York] moved spark.executer.user* 4625ade [Brennon York] added spark.executor.extra* items 4ee5648 [Brennon York] fixed merge conflicts 1b49234 [Brennon York] sorting mishap 2b5758b [Brennon York] sorting mishap 6fbdf42 [Brennon York] sorting mishap 55dc6f8 [Brennon York] sorted security ec34294 [Brennon York] sorted dynamic allocation 2a7c4a3 [Brennon York] sorted scheduling aa9acdc [Brennon York] sorted networking a4380b8 [Brennon York] sorted execution behavior 27f3919 [Brennon York] sorted compression and serialization 80a5bbb [Brennon York] sorted spark ui 3f32e5b [Brennon York] sorted shuffle behavior 6c51b38 [Brennon York] sorted runtime environment efe9d6f [Brennon York] sorted application properties
-rw-r--r--docs/configuration.md984
-rw-r--r--docs/running-on-mesos.md24
2 files changed, 496 insertions, 512 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 81298514a7..8dd2bad613 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -95,41 +95,12 @@ of the most common options to set are:
</td>
</tr>
<tr>
- <td><code>spark.master</code></td>
- <td>(none)</td>
- <td>
- The cluster manager to connect to. See the list of
- <a href="submitting-applications.html#master-urls"> allowed master URL's</a>.
- </td>
-</tr>
-<tr>
<td><code>spark.driver.cores</code></td>
<td>1</td>
<td>
Number of cores to use for the driver process, only in cluster mode.
</td>
</tr>
-<tr>
- <td><code>spark.driver.memory</code></td>
- <td>512m</td>
- <td>
- Amount of memory to use for the driver process, i.e. where SparkContext is initialized.
- (e.g. <code>512m</code>, <code>2g</code>).
-
- <br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
- directly in your application, because the driver JVM has already started at that point.
- Instead, please set this through the <code>--driver-memory</code> command line option
- or in your default properties file.</td>
-</tr>
-<tr>
- <td><code>spark.executor.memory</code></td>
- <td>512m</td>
- <td>
- Amount of memory to use per executor process, in the same format as JVM memory strings
- (e.g. <code>512m</code>, <code>2g</code>).
- </td>
-</tr>
-<tr>
<td><code>spark.driver.maxResultSize</code></td>
<td>1g</td>
<td>
@@ -142,38 +113,35 @@ of the most common options to set are:
</td>
</tr>
<tr>
- <td><code>spark.serializer</code></td>
- <td>org.apache.spark.serializer.<br />JavaSerializer</td>
+ <td><code>spark.driver.memory</code></td>
+ <td>512m</td>
<td>
- Class to use for serializing objects that will be sent over the network or need to be cached
- in serialized form. The default of Java serialization works with any Serializable Java object
- but is quite slow, so we recommend <a href="tuning.html">using
- <code>org.apache.spark.serializer.KryoSerializer</code> and configuring Kryo serialization</a>
- when speed is necessary. Can be any subclass of
- <a href="api/scala/index.html#org.apache.spark.serializer.Serializer">
- <code>org.apache.spark.Serializer</code></a>.
+ Amount of memory to use for the driver process, i.e. where SparkContext is initialized.
+ (e.g. <code>512m</code>, <code>2g</code>).
+
+ <br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
+ directly in your application, because the driver JVM has already started at that point.
+ Instead, please set this through the <code>--driver-memory</code> command line option
+ or in your default properties file.
</td>
</tr>
<tr>
- <td><code>spark.kryo.classesToRegister</code></td>
- <td>(none)</td>
+ <td><code>spark.executor.memory</code></td>
+ <td>512m</td>
<td>
- If you use Kryo serialization, give a comma-separated list of custom class names to register
- with Kryo.
- See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
+ Amount of memory to use per executor process, in the same format as JVM memory strings
+ (e.g. <code>512m</code>, <code>2g</code>).
</td>
</tr>
<tr>
- <td><code>spark.kryo.registrator</code></td>
+ <td><code>spark.extraListeners</code></td>
<td>(none)</td>
<td>
- If you use Kryo serialization, set this class to register your custom classes with Kryo. This
- property is useful if you need to register your classes in a custom way, e.g. to specify a custom
- field serializer. Otherwise <code>spark.kryo.classesToRegister</code> is simpler. It should be
- set to a class that extends
- <a href="api/scala/index.html#org.apache.spark.serializer.KryoRegistrator">
- <code>KryoRegistrator</code></a>.
- See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
+ A comma-separated list of classes that implement <code>SparkListener</code>; when initializing
+ SparkContext, instances of these classes will be created and registered with Spark's listener
+ bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor
+ will be called; otherwise, a zero-argument constructor will be called. If no valid constructor
+ can be found, the SparkContext creation will fail with an exception.
</td>
</tr>
<tr>
@@ -196,14 +164,11 @@ of the most common options to set are:
</td>
</tr>
<tr>
- <td><code>spark.extraListeners</code></td>
+ <td><code>spark.master</code></td>
<td>(none)</td>
<td>
- A comma-separated list of classes that implement <code>SparkListener</code>; when initializing
- SparkContext, instances of these classes will be created and registered with Spark's listener
- bus. If a class has a single-argument constructor that accepts a SparkConf, that constructor
- will be called; otherwise, a zero-argument constructor will be called. If no valid constructor
- can be found, the SparkContext creation will fail with an exception.
+ The cluster manager to connect to. See the list of
+ <a href="submitting-applications.html#master-urls"> allowed master URL's</a>.
</td>
</tr>
</table>
@@ -214,26 +179,26 @@ Apart from these, the following properties are also available, and may be useful
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
- <td><code>spark.driver.extraJavaOptions</code></td>
+ <td><code>spark.driver.extraClassPath</code></td>
<td>(none)</td>
<td>
- A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.
-
+ Extra classpath entries to append to the classpath of the driver.
+
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
- Instead, please set this through the <code>--driver-java-options</code> command line option or in
+ Instead, please set this through the <code>--driver-class-path</code> command line option or in
your default properties file.</td>
</td>
</tr>
<tr>
- <td><code>spark.driver.extraClassPath</code></td>
+ <td><code>spark.driver.extraJavaOptions</code></td>
<td>(none)</td>
<td>
- Extra classpath entries to append to the classpath of the driver.
-
+ A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.
+
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
- Instead, please set this through the <code>--driver-class-path</code> command line option or in
+ Instead, please set this through the <code>--driver-java-options</code> command line option or in
your default properties file.</td>
</td>
</tr>
@@ -261,23 +226,22 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.executor.extraJavaOptions</code></td>
+ <td><code>spark.executor.extraClassPath</code></td>
<td>(none)</td>
<td>
- A string of extra JVM options to pass to executors. For instance, GC settings or other
- logging. Note that it is illegal to set Spark properties or heap size settings with this
- option. Spark properties should be set using a SparkConf object or the
- spark-defaults.conf file used with the spark-submit script. Heap size settings can be set
- with spark.executor.memory.
+ Extra classpath entries to append to the classpath of executors. This exists primarily for
+ backwards-compatibility with older versions of Spark. Users typically should not need to set
+ this option.
</td>
</tr>
<tr>
- <td><code>spark.executor.extraClassPath</code></td>
+ <td><code>spark.executor.extraJavaOptions</code></td>
<td>(none)</td>
<td>
- Extra classpath entries to append to the classpath of executors. This exists primarily
- for backwards-compatibility with older versions of Spark. Users typically should not need
- to set this option.
+ A string of extra JVM options to pass to executors. For instance, GC settings or other logging.
+ Note that it is illegal to set Spark properties or heap size settings with this option. Spark
+ properties should be set using a SparkConf object or the spark-defaults.conf file used with the
+ spark-submit script. Heap size settings can be set with spark.executor.memory.
</td>
</tr>
<tr>
@@ -288,6 +252,24 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
+ <td>(none)</td>
+ <td>
+ Sets the number of latest rolling log files that are going to be retained by the system.
+ Older log files will be deleted. Disabled by default.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
+ <td>(none)</td>
+ <td>
+ Set the max size of the file by which the executor logs will be rolled over.
+ Rolling is disabled by default. Value is set in terms of bytes.
+ See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
+ for automatic cleaning of old logs.
+ </td>
+</tr>
+<tr>
<td><code>spark.executor.logs.rolling.strategy</code></td>
<td>(none)</td>
<td>
@@ -309,24 +291,6 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.executor.logs.rolling.size.maxBytes</code></td>
- <td>(none)</td>
- <td>
- Set the max size of the file by which the executor logs will be rolled over.
- Rolling is disabled by default. Value is set in terms of bytes.
- See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
- for automatic cleaning of old logs.
- </td>
-</tr>
-<tr>
- <td><code>spark.executor.logs.rolling.maxRetainedFiles</code></td>
- <td>(none)</td>
- <td>
- Sets the number of latest rolling log files that are going to be retained by the system.
- Older log files will be deleted. Disabled by default.
- </td>
-</tr>
-<tr>
<td><code>spark.executor.userClassPathFirst</code></td>
<td>false</td>
<td>
@@ -335,12 +299,11 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.python.worker.memory</code></td>
- <td>512m</td>
+ <td><code>spark.executorEnv.[EnvironmentVariableName]</code></td>
+ <td>(none)</td>
<td>
- Amount of memory to use per python worker process during aggregation, in the same
- format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>). If the memory
- used during aggregation goes above this amount, it will spill the data into disks.
+ Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
+ process. The user can specify multiple of these to set multiple environment variables.
</td>
</tr>
<tr>
@@ -367,6 +330,15 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.python.worker.memory</code></td>
+ <td>512m</td>
+ <td>
+ Amount of memory to use per python worker process during aggregation, in the same
+ format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>). If the memory
+ used during aggregation goes above this amount, it will spill the data into disks.
+ </td>
+</tr>
+<tr>
<td><code>spark.python.worker.reuse</code></td>
<td>true</td>
<td>
@@ -376,40 +348,38 @@ Apart from these, the following properties are also available, and may be useful
from JVM to Python worker for every task.
</td>
</tr>
+</table>
+
+#### Shuffle Behavior
+<table class="table">
+<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
- <td><code>spark.executorEnv.[EnvironmentVariableName]</code></td>
- <td>(none)</td>
+ <td><code>spark.reducer.maxMbInFlight</code></td>
+ <td>48</td>
<td>
- Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
- process. The user can specify multiple of these to set multiple environment variables.
+ Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since
+ each output requires us to create a buffer to receive it, this represents a fixed memory
+ overhead per reduce task, so keep it small unless you have a large amount of memory.
</td>
</tr>
<tr>
- <td><code>spark.mesos.executor.home</code></td>
- <td>driver side <code>SPARK_HOME</code></td>
+ <td><code>spark.shuffle.blockTransferService</code></td>
+ <td>netty</td>
<td>
- Set the directory in which Spark is installed on the executors in Mesos. By default, the
- executors will simply use the driver's Spark home directory, which may not be visible to
- them. Note that this is only relevant if a Spark binary package is not specified through
- <code>spark.executor.uri</code>.
+ Implementation to use for transferring shuffle and cached blocks between executors. There
+ are two implementations available: <code>netty</code> and <code>nio</code>. Netty-based
+ block transfer is intended to be simpler but equally efficient and is the default option
+ starting in 1.2.
</td>
</tr>
<tr>
- <td><code>spark.mesos.executor.memoryOverhead</code></td>
- <td>executor memory * 0.07, with minimum of 384</td>
+ <td><code>spark.shuffle.compress</code></td>
+ <td>true</td>
<td>
- This value is an additive for <code>spark.executor.memory</code>, specified in MiB,
- which is used to calculate the total Mesos task memory. A value of <code>384</code>
- implies a 384MiB overhead. Additionally, there is a hard-coded 7% minimum
- overhead. The final overhead will be the larger of either
- `spark.mesos.executor.memoryOverhead` or 7% of `spark.executor.memory`.
+ Whether to compress map output files. Generally a good idea. Compression will use
+ <code>spark.io.compression.codec</code>.
</td>
</tr>
-</table>
-
-#### Shuffle Behavior
-<table class="table">
-<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.shuffle.consolidateFiles</code></td>
<td>false</td>
@@ -421,55 +391,46 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.shuffle.spill</code></td>
- <td>true</td>
+ <td><code>spark.shuffle.file.buffer.kb</code></td>
+ <td>32</td>
<td>
- If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
- This spilling threshold is specified by <code>spark.shuffle.memoryFraction</code>.
+ Size of the in-memory buffer for each shuffle file output stream, in kilobytes. These buffers
+ reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
</td>
</tr>
<tr>
- <td><code>spark.shuffle.spill.compress</code></td>
- <td>true</td>
+ <td><code>spark.shuffle.io.maxRetries</code></td>
+ <td>3</td>
<td>
- Whether to compress data spilled during shuffles. Compression will use
- <code>spark.io.compression.codec</code>.
+ (Netty only) Fetches that fail due to IO-related exceptions are automatically retried if this is
+ set to a non-zero value. This retry logic helps stabilize large shuffles in the face of long GC
+ pauses or transient network connectivity issues.
</td>
</tr>
<tr>
- <td><code>spark.shuffle.memoryFraction</code></td>
- <td>0.2</td>
+ <td><code>spark.shuffle.io.numConnectionsPerPeer</code></td>
+ <td>1</td>
<td>
- Fraction of Java heap to use for aggregation and cogroups during shuffles, if
- <code>spark.shuffle.spill</code> is true. At any given time, the collective size of
- all in-memory maps used for shuffles is bounded by this limit, beyond which the contents will
- begin to spill to disk. If spills are often, consider increasing this value at the expense of
- <code>spark.storage.memoryFraction</code>.
+ (Netty only) Connections between hosts are reused in order to reduce connection buildup for
+ large clusters. For clusters with many hard disks and few hosts, this may result in insufficient
+ concurrency to saturate all disks, and so users may consider increasing this value.
</td>
</tr>
<tr>
- <td><code>spark.shuffle.compress</code></td>
+ <td><code>spark.shuffle.io.preferDirectBufs</code></td>
<td>true</td>
<td>
- Whether to compress map output files. Generally a good idea. Compression will use
- <code>spark.io.compression.codec</code>.
- </td>
-</tr>
-<tr>
- <td><code>spark.shuffle.file.buffer.kb</code></td>
- <td>32</td>
- <td>
- Size of the in-memory buffer for each shuffle file output stream, in kilobytes. These buffers
- reduce the number of disk seeks and system calls made in creating intermediate shuffle files.
+ (Netty only) Off-heap buffers are used to reduce garbage collection during shuffle and cache
+ block transfer. For environments where off-heap memory is tightly limited, users may wish to
+ turn this off to force all allocations from Netty to be on-heap.
</td>
</tr>
<tr>
- <td><code>spark.reducer.maxMbInFlight</code></td>
- <td>48</td>
+ <td><code>spark.shuffle.io.retryWait</code></td>
+ <td>5</td>
<td>
- Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since
- each output requires us to create a buffer to receive it, this represents a fixed memory
- overhead per reduce task, so keep it small unless you have a large amount of memory.
+ (Netty only) Seconds to wait between retries of fetches. The maximum delay caused by retrying
+ is simply <code>maxRetries * retryWait</code>, by default 15 seconds.
</td>
</tr>
<tr>
@@ -482,6 +443,17 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.shuffle.memoryFraction</code></td>
+ <td>0.2</td>
+ <td>
+ Fraction of Java heap to use for aggregation and cogroups during shuffles, if
+ <code>spark.shuffle.spill</code> is true. At any given time, the collective size of
+ all in-memory maps used for shuffles is bounded by this limit, beyond which the contents will
+ begin to spill to disk. If spills are often, consider increasing this value at the expense of
+ <code>spark.storage.memoryFraction</code>.
+ </td>
+</tr>
+<tr>
<td><code>spark.shuffle.sort.bypassMergeThreshold</code></td>
<td>200</td>
<td>
@@ -490,13 +462,19 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.shuffle.blockTransferService</code></td>
- <td>netty</td>
+ <td><code>spark.shuffle.spill</code></td>
+ <td>true</td>
<td>
- Implementation to use for transferring shuffle and cached blocks between executors. There
- are two implementations available: <code>netty</code> and <code>nio</code>. Netty-based
- block transfer is intended to be simpler but equally efficient and is the default option
- starting in 1.2.
+ If set to "true", limits the amount of memory used during reduces by spilling data out to disk.
+ This spilling threshold is specified by <code>spark.shuffle.memoryFraction</code>.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.shuffle.spill.compress</code></td>
+ <td>true</td>
+ <td>
+ Whether to compress data spilled during shuffles. Compression will use
+ <code>spark.io.compression.codec</code>.
</td>
</tr>
</table>
@@ -505,26 +483,28 @@ Apart from these, the following properties are also available, and may be useful
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
- <td><code>spark.ui.port</code></td>
- <td>4040</td>
+ <td><code>spark.eventLog.compress</code></td>
+ <td>false</td>
<td>
- Port for your application's dashboard, which shows memory and workload data.
+ Whether to compress logged events, if <code>spark.eventLog.enabled</code> is true.
</td>
</tr>
<tr>
- <td><code>spark.ui.retainedStages</code></td>
- <td>1000</td>
+ <td><code>spark.eventLog.dir</code></td>
+ <td>file:///tmp/spark-events</td>
<td>
- How many stages the Spark UI and status APIs remember before garbage
- collecting.
+ Base directory in which Spark events are logged, if <code>spark.eventLog.enabled</code> is true.
+ Within this base directory, Spark creates a sub-directory for each application, and logs the
+ events specific to the application in this directory. Users may want to set this to
+ a unified location like an HDFS directory so history files can be read by the history server.
</td>
</tr>
<tr>
- <td><code>spark.ui.retainedJobs</code></td>
- <td>1000</td>
+ <td><code>spark.eventLog.enabled</code></td>
+ <td>false</td>
<td>
- How many jobs the Spark UI and status APIs remember before garbage
- collecting.
+ Whether to log Spark events, useful for reconstructing the Web UI after the application has
+ finished.
</td>
</tr>
<tr>
@@ -535,28 +515,26 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.eventLog.enabled</code></td>
- <td>false</td>
+ <td><code>spark.ui.port</code></td>
+ <td>4040</td>
<td>
- Whether to log Spark events, useful for reconstructing the Web UI after the application has
- finished.
+ Port for your application's dashboard, which shows memory and workload data.
</td>
</tr>
<tr>
- <td><code>spark.eventLog.compress</code></td>
- <td>false</td>
+ <td><code>spark.ui.retainedJobs</code></td>
+ <td>1000</td>
<td>
- Whether to compress logged events, if <code>spark.eventLog.enabled</code> is true.
+ How many jobs the Spark UI and status APIs remember before garbage
+ collecting.
</td>
</tr>
<tr>
- <td><code>spark.eventLog.dir</code></td>
- <td>file:///tmp/spark-events</td>
+ <td><code>spark.ui.retainedStages</code></td>
+ <td>1000</td>
<td>
- Base directory in which Spark events are logged, if <code>spark.eventLog.enabled</code> is true.
- Within this base directory, Spark creates a sub-directory for each application, and logs the
- events specific to the application in this directory. Users may want to set this to
- a unified location like an HDFS directory so history files can be read by the history server.
+ How many stages the Spark UI and status APIs remember before garbage
+ collecting.
</td>
</tr>
</table>
@@ -572,12 +550,10 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.rdd.compress</code></td>
- <td>false</td>
+ <td><code>spark.closure.serializer</code></td>
+ <td>org.apache.spark.serializer.<br />JavaSerializer</td>
<td>
- Whether to compress serialized RDD partitions (e.g. for
- <code>StorageLevel.MEMORY_ONLY_SER</code>). Can save substantial space at the cost of some
- extra CPU time.
+ Serializer class to use for closures. Currently only the Java serializer is supported.
</td>
</tr>
<tr>
@@ -594,14 +570,6 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.io.compression.snappy.block.size</code></td>
- <td>32768</td>
- <td>
- Block size (in bytes) used in Snappy compression, in the case when Snappy compression codec
- is used. Lowering this block size will also lower shuffle memory usage when Snappy is used.
- </td>
-</tr>
-<tr>
<td><code>spark.io.compression.lz4.block.size</code></td>
<td>32768</td>
<td>
@@ -610,21 +578,20 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.closure.serializer</code></td>
- <td>org.apache.spark.serializer.<br />JavaSerializer</td>
+ <td><code>spark.io.compression.snappy.block.size</code></td>
+ <td>32768</td>
<td>
- Serializer class to use for closures. Currently only the Java serializer is supported.
+ Block size (in bytes) used in Snappy compression, in the case when Snappy compression codec
+ is used. Lowering this block size will also lower shuffle memory usage when Snappy is used.
</td>
</tr>
<tr>
- <td><code>spark.serializer.objectStreamReset</code></td>
- <td>100</td>
+ <td><code>spark.kryo.classesToRegister</code></td>
+ <td>(none)</td>
<td>
- When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
- objects to prevent writing redundant data, however that stops garbage collection of those
- objects. By calling 'reset' you flush that info from the serializer, and allow old
- objects to be collected. To turn off this periodic reset set it to -1.
- By default it will reset the serializer every 100 objects.
+ If you use Kryo serialization, give a comma-separated list of custom class names to register
+ with Kryo.
+ See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
</td>
</tr>
<tr>
@@ -649,12 +616,16 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.kryoserializer.buffer.mb</code></td>
- <td>0.064</td>
+ <td><code>spark.kryo.registrator</code></td>
+ <td>(none)</td>
<td>
- Initial size of Kryo's serialization buffer, in megabytes. Note that there will be one buffer
- <i>per core</i> on each worker. This buffer will grow up to
- <code>spark.kryoserializer.buffer.max.mb</code> if needed.
+ If you use Kryo serialization, set this class to register your custom classes with Kryo. This
+ property is useful if you need to register your classes in a custom way, e.g. to specify a custom
+ field serializer. Otherwise <code>spark.kryo.classesToRegister</code> is simpler. It should be
+ set to a class that extends
+ <a href="api/scala/index.html#org.apache.spark.serializer.KryoRegistrator">
+ <code>KryoRegistrator</code></a>.
+ See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
</td>
</tr>
<tr>
@@ -666,12 +637,81 @@ Apart from these, the following properties are also available, and may be useful
inside Kryo.
</td>
</tr>
+<tr>
+ <td><code>spark.kryoserializer.buffer.mb</code></td>
+ <td>0.064</td>
+ <td>
+ Initial size of Kryo's serialization buffer, in megabytes. Note that there will be one buffer
+ <i>per core</i> on each worker. This buffer will grow up to
+ <code>spark.kryoserializer.buffer.max.mb</code> if needed.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.rdd.compress</code></td>
+ <td>false</td>
+ <td>
+ Whether to compress serialized RDD partitions (e.g. for
+ <code>StorageLevel.MEMORY_ONLY_SER</code>). Can save substantial space at the cost of some
+ extra CPU time.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.serializer</code></td>
+ <td>org.apache.spark.serializer.<br />JavaSerializer</td>
+ <td>
+ Class to use for serializing objects that will be sent over the network or need to be cached
+ in serialized form. The default of Java serialization works with any Serializable Java object
+ but is quite slow, so we recommend <a href="tuning.html">using
+ <code>org.apache.spark.serializer.KryoSerializer</code> and configuring Kryo serialization</a>
+ when speed is necessary. Can be any subclass of
+ <a href="api/scala/index.html#org.apache.spark.serializer.Serializer">
+ <code>org.apache.spark.Serializer</code></a>.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.serializer.objectStreamReset</code></td>
+ <td>100</td>
+ <td>
+ When serializing using org.apache.spark.serializer.JavaSerializer, the serializer caches
+ objects to prevent writing redundant data, however that stops garbage collection of those
+ objects. By calling 'reset' you flush that info from the serializer, and allow old
+ objects to be collected. To turn off this periodic reset set it to -1.
+ By default it will reset the serializer every 100 objects.
+ </td>
+</tr>
</table>
#### Execution Behavior
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
+ <td><code>spark.broadcast.blockSize</code></td>
+ <td>4096</td>
+ <td>
+ Size of each piece of a block in kilobytes for <code>TorrentBroadcastFactory</code>.
+ Too large a value decreases parallelism during broadcast (makes it slower); however, if it is
+ too small, <code>BlockManager</code> might take a performance hit.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.broadcast.factory</code></td>
+ <td>org.apache.spark.broadcast.<br />TorrentBroadcastFactory</td>
+ <td>
+ Which broadcast implementation to use.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.cleaner.ttl</code></td>
+ <td>(infinite)</td>
+ <td>
+ Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks
+ generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be
+ forgotten. This is useful for running Spark for many hours / days (for example, running 24/7 in
+ case of Spark Streaming applications). Note that any RDD that persists in memory for more than
+ this duration will be cleared as well.
+ </td>
+</tr>
+<tr>
<td><code>spark.default.parallelism</code></td>
<td>
For distributed shuffle operations like <code>reduceByKey</code> and <code>join</code>, the
@@ -689,19 +729,18 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.broadcast.factory</code></td>
- <td>org.apache.spark.broadcast.<br />TorrentBroadcastFactory</td>
- <td>
- Which broadcast implementation to use.
- </td>
+ <td><code>spark.executor.heartbeatInterval</code></td>
+ <td>10000</td>
+ <td>Interval (milliseconds) between each executor's heartbeats to the driver. Heartbeats let
+ the driver know that the executor is still alive and update it with metrics for in-progress
+ tasks.</td>
</tr>
<tr>
- <td><code>spark.broadcast.blockSize</code></td>
- <td>4096</td>
+ <td><code>spark.files.fetchTimeout</code></td>
+ <td>60</td>
<td>
- Size of each piece of a block in kilobytes for <code>TorrentBroadcastFactory</code>.
- Too large a value decreases parallelism during broadcast (makes it slower); however, if it is
- too small, <code>BlockManager</code> might take a performance hit.
+ Communication timeout to use when fetching files added through SparkContext.addFile() from
+ the driver, in seconds.
</td>
</tr>
<tr>
@@ -713,12 +752,23 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.files.fetchTimeout</code></td>
- <td>60</td>
- <td>
- Communication timeout to use when fetching files added through SparkContext.addFile() from
- the driver, in seconds.
- </td>
+ <td><code>spark.hadoop.cloneConf</code></td>
+ <td>false</td>
+ <td>If set to true, clones a new Hadoop <code>Configuration</code> object for each task. This
+ option should be enabled to work around <code>Configuration</code> thread-safety issues (see
+ <a href="https://issues.apache.org/jira/browse/SPARK-2546">SPARK-2546</a> for more details).
+ This is disabled by default in order to avoid unexpected performance regressions for jobs that
+ are not affected by these issues.</td>
+</tr>
+<tr>
+ <td><code>spark.hadoop.validateOutputSpecs</code></td>
+ <td>true</td>
+ <td>If set to true, validates the output specification (e.g. checking if the output directory already exists)
+ used in saveAsHadoopFile and other variants. This can be disabled to silence exceptions due to pre-existing
+ output directories. We recommend that users do not disable this except if trying to achieve compatibility with
+ previous versions of Spark. Simply use Hadoop's FileSystem API to delete output directories by hand.
+ This setting is ignored for jobs generated through Spark Streaming's StreamingContext, since
+ data may need to be rewritten to pre-existing output directories during checkpoint recovery.</td>
</tr>
<tr>
<td><code>spark.storage.memoryFraction</code></td>
@@ -730,6 +780,15 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.storage.memoryMapThreshold</code></td>
+ <td>2097152</td>
+ <td>
+ Size of a block, in bytes, above which Spark memory maps when reading a block from disk.
+ This prevents Spark from memory mapping very small blocks. In general, memory
+ mapping has high overhead for blocks close to or below the page size of the operating system.
+ </td>
+</tr>
+<tr>
<td><code>spark.storage.unrollFraction</code></td>
<td>0.2</td>
<td>
@@ -748,100 +807,74 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.storage.memoryMapThreshold</code></td>
- <td>2097152</td>
- <td>
- Size of a block, in bytes, above which Spark memory maps when reading a block from disk.
- This prevents Spark from memory mapping very small blocks. In general, memory
- mapping has high overhead for blocks close to or below the page size of the operating system.
- </td>
-</tr>
-<tr>
<td><code>spark.tachyonStore.url</code></td>
<td>tachyon://localhost:19998</td>
<td>
The URL of the underlying Tachyon file system in the TachyonStore.
</td>
</tr>
-<tr>
- <td><code>spark.cleaner.ttl</code></td>
- <td>(infinite)</td>
- <td>
- Duration (seconds) of how long Spark will remember any metadata (stages generated, tasks
- generated, etc.). Periodic cleanups will ensure that metadata older than this duration will be
- forgotten. This is useful for running Spark for many hours / days (for example, running 24/7 in
- case of Spark Streaming applications). Note that any RDD that persists in memory for more than
- this duration will be cleared as well.
- </td>
-</tr>
-<tr>
- <td><code>spark.hadoop.validateOutputSpecs</code></td>
- <td>true</td>
- <td>If set to true, validates the output specification (e.g. checking if the output directory already exists)
- used in saveAsHadoopFile and other variants. This can be disabled to silence exceptions due to pre-existing
- output directories. We recommend that users do not disable this except if trying to achieve compatibility with
- previous versions of Spark. Simply use Hadoop's FileSystem API to delete output directories by hand.
- This setting is ignored for jobs generated through Spark Streaming's StreamingContext, since
- data may need to be rewritten to pre-existing output directories during checkpoint recovery.</td>
-</tr>
-<tr>
- <td><code>spark.hadoop.cloneConf</code></td>
- <td>false</td>
- <td>If set to true, clones a new Hadoop <code>Configuration</code> object for each task. This
- option should be enabled to work around <code>Configuration</code> thread-safety issues (see
- <a href="https://issues.apache.org/jira/browse/SPARK-2546">SPARK-2546</a> for more details).
- This is disabled by default in order to avoid unexpected performance regressions for jobs that
- are not affected by these issues.</td>
-</tr>
-<tr>
- <td><code>spark.executor.heartbeatInterval</code></td>
- <td>10000</td>
- <td>Interval (milliseconds) between each executor's heartbeats to the driver. Heartbeats let
- the driver know that the executor is still alive and update it with metrics for in-progress
- tasks.</td>
-</tr>
</table>
#### Networking
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
- <td><code>spark.driver.host</code></td>
- <td>(local hostname)</td>
+ <td><code>spark.akka.failure-detector.threshold</code></td>
+ <td>300.0</td>
<td>
- Hostname or IP address for the driver to listen on.
- This is used for communicating with the executors and the standalone Master.
+ This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
+ enabled again, if you plan to use this feature (Not recommended). This maps to akka's
+ `akka.remote.transport-failure-detector.threshold`. Tune this in combination of
+ `spark.akka.heartbeat.pauses` and `spark.akka.heartbeat.interval` if you need to.
</td>
</tr>
<tr>
- <td><code>spark.driver.port</code></td>
- <td>(random)</td>
+ <td><code>spark.akka.frameSize</code></td>
+ <td>10</td>
<td>
- Port for the driver to listen on.
- This is used for communicating with the executors and the standalone Master.
+ Maximum message size to allow in "control plane" communication (for serialized tasks and task
+ results), in MB. Increase this if your tasks need to send back large results to the driver
+ (e.g. using <code>collect()</code> on a large dataset).
</td>
</tr>
<tr>
- <td><code>spark.fileserver.port</code></td>
- <td>(random)</td>
+ <td><code>spark.akka.heartbeat.interval</code></td>
+ <td>1000</td>
<td>
- Port for the driver's HTTP file server to listen on.
+ This is set to a larger value to disable the transport failure detector that comes built in to
+ Akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger
+ interval value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more
+ informative for Akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses`
+ if you need to. A likely positive use case for using failure detector would be: a sensistive
+ failure detector can help evict rogue executors quickly. However this is usually not the case
+ as GC pauses and network lags are expected in a real Spark cluster. Apart from that enabling
+ this leads to a lot of exchanges of heart beats between nodes leading to flooding the network
+ with those.
</td>
</tr>
<tr>
- <td><code>spark.broadcast.port</code></td>
- <td>(random)</td>
+ <td><code>spark.akka.heartbeat.pauses</code></td>
+ <td>6000</td>
<td>
- Port for the driver's HTTP broadcast server to listen on.
- This is not relevant for torrent broadcast.
+ This is set to a larger value to disable the transport failure detector that comes built in to Akka.
+ It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart
+ beat pause in seconds for Akka. This can be used to control sensitivity to GC pauses. Tune
+ this along with `spark.akka.heartbeat.interval` if you need to.
</td>
</tr>
<tr>
- <td><code>spark.replClassServer.port</code></td>
- <td>(random)</td>
+ <td><code>spark.akka.threads</code></td>
+ <td>4</td>
<td>
- Port for the driver's HTTP class server to listen on.
- This is only relevant for the Spark shell.
+ Number of actor threads to use for communication. Can be useful to increase on large clusters
+ when the driver has a lot of CPU cores.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.akka.timeout</code></td>
+ <td>100</td>
+ <td>
+ Communication timeout between Spark nodes, in seconds.
</td>
</tr>
<tr>
@@ -852,41 +885,41 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.executor.port</code></td>
+ <td><code>spark.broadcast.port</code></td>
<td>(random)</td>
<td>
- Port for the executor to listen on. This is used for communicating with the driver.
+ Port for the driver's HTTP broadcast server to listen on.
+ This is not relevant for torrent broadcast.
</td>
</tr>
<tr>
- <td><code>spark.port.maxRetries</code></td>
- <td>16</td>
+ <td><code>spark.driver.host</code></td>
+ <td>(local hostname)</td>
<td>
- Default maximum number of retries when binding to a port before giving up.
+ Hostname or IP address for the driver to listen on.
+ This is used for communicating with the executors and the standalone Master.
</td>
</tr>
<tr>
- <td><code>spark.akka.frameSize</code></td>
- <td>10</td>
+ <td><code>spark.driver.port</code></td>
+ <td>(random)</td>
<td>
- Maximum message size to allow in "control plane" communication (for serialized tasks and task
- results), in MB. Increase this if your tasks need to send back large results to the driver
- (e.g. using <code>collect()</code> on a large dataset).
+ Port for the driver to listen on.
+ This is used for communicating with the executors and the standalone Master.
</td>
</tr>
<tr>
- <td><code>spark.akka.threads</code></td>
- <td>4</td>
+ <td><code>spark.executor.port</code></td>
+ <td>(random)</td>
<td>
- Number of actor threads to use for communication. Can be useful to increase on large clusters
- when the driver has a lot of CPU cores.
+ Port for the executor to listen on. This is used for communicating with the driver.
</td>
</tr>
<tr>
- <td><code>spark.akka.timeout</code></td>
- <td>100</td>
+ <td><code>spark.fileserver.port</code></td>
+ <td>(random)</td>
<td>
- Communication timeout between Spark nodes, in seconds.
+ Port for the driver's HTTP file server to listen on.
</td>
</tr>
<tr>
@@ -900,64 +933,18 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.akka.heartbeat.pauses</code></td>
- <td>6000</td>
- <td>
- This is set to a larger value to disable the transport failure detector that comes built in to Akka.
- It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart
- beat pause in seconds for Akka. This can be used to control sensitivity to GC pauses. Tune
- this along with `spark.akka.heartbeat.interval` if you need to.
- </td>
-</tr>
-<tr>
- <td><code>spark.akka.heartbeat.interval</code></td>
- <td>1000</td>
- <td>
- This is set to a larger value to disable the transport failure detector that comes built in to Akka.
- It can be enabled again, if you plan to use this feature (Not recommended). A larger interval
- value in seconds reduces network overhead and a smaller value ( ~ 1 s) might be more informative
- for Akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses` if you need
- to. A likely positive use case for using failure detector would be: a sensistive failure detector
- can help evict rogue executors quickly. However this is usually not the case as GC pauses
- and network lags are expected in a real Spark cluster. Apart from that enabling this leads to
- a lot of exchanges of heart beats between nodes leading to flooding the network with those.
- </td>
-</tr>
-<tr>
- <td><code>spark.shuffle.io.preferDirectBufs</code></td>
- <td>true</td>
- <td>
- (Netty only) Off-heap buffers are used to reduce garbage collection during shuffle and cache
- block transfer. For environments where off-heap memory is tightly limited, users may wish to
- turn this off to force all allocations from Netty to be on-heap.
- </td>
-</tr>
-<tr>
- <td><code>spark.shuffle.io.numConnectionsPerPeer</code></td>
- <td>1</td>
- <td>
- (Netty only) Connections between hosts are reused in order to reduce connection buildup for
- large clusters. For clusters with many hard disks and few hosts, this may result in insufficient
- concurrency to saturate all disks, and so users may consider increasing this value.
- </td>
-</tr>
-<tr>
- <td><code>spark.shuffle.io.maxRetries</code></td>
- <td>3</td>
+ <td><code>spark.port.maxRetries</code></td>
+ <td>16</td>
<td>
- (Netty only) Fetches that fail due to IO-related exceptions are automatically retried if this is
- set to a non-zero value. This retry logic helps stabilize large shuffles in the face of long GC
- pauses or transient network connectivity issues.
+ Default maximum number of retries when binding to a port before giving up.
</td>
</tr>
<tr>
- <td><code>spark.shuffle.io.retryWait</code></td>
- <td>5</td>
+ <td><code>spark.replClassServer.port</code></td>
+ <td>(random)</td>
<td>
- (Netty only) Seconds to wait between retries of fetches. The maximum delay caused by retrying
- is simply <code>maxRetries * retryWait</code>. The default maximum delay is therefore
- 15 seconds, because the default value of <code>maxRetries</code> is 3, and the default
- <code>retryWait</code> here is 5 seconds.
+ Port for the driver's HTTP class server to listen on.
+ This is only relevant for the Spark shell.
</td>
</tr>
</table>
@@ -966,31 +953,6 @@ Apart from these, the following properties are also available, and may be useful
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
- <td><code>spark.task.cpus</code></td>
- <td>1</td>
- <td>
- Number of cores to allocate for each task.
- </td>
-</tr>
-<tr>
- <td><code>spark.task.maxFailures</code></td>
- <td>4</td>
- <td>
- Number of individual task failures before giving up on the job.
- Should be greater than or equal to 1. Number of allowed retries = this value - 1.
- </td>
-</tr>
-<tr>
- <td><code>spark.scheduler.mode</code></td>
- <td>FIFO</td>
- <td>
- The <a href="job-scheduling.html#scheduling-within-an-application">scheduling mode</a> between
- jobs submitted to the same SparkContext. Can be set to <code>FAIR</code>
- to use fair sharing instead of queueing jobs one after another. Useful for
- multi-user services.
- </td>
-</tr>
-<tr>
<td><code>spark.cores.max</code></td>
<td>(not set)</td>
<td>
@@ -1003,43 +965,12 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.mesos.coarse</code></td>
- <td>false</td>
- <td>
- If set to "true", runs over Mesos clusters in
- <a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
- where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per
- Spark task. This gives lower-latency scheduling for short queries, but leaves resources in use
- for the whole duration of the Spark job.
- </td>
-</tr>
-<tr>
- <td><code>spark.speculation</code></td>
+ <td><code>spark.localExecution.enabled</code></td>
<td>false</td>
<td>
- If set to "true", performs speculative execution of tasks. This means if one or more tasks are
- running slowly in a stage, they will be re-launched.
- </td>
-</tr>
-<tr>
- <td><code>spark.speculation.interval</code></td>
- <td>100</td>
- <td>
- How often Spark will check for tasks to speculate, in milliseconds.
- </td>
-</tr>
-<tr>
- <td><code>spark.speculation.quantile</code></td>
- <td>0.75</td>
- <td>
- Percentage of tasks which must be complete before speculation is enabled for a particular stage.
- </td>
-</tr>
-<tr>
- <td><code>spark.speculation.multiplier</code></td>
- <td>1.5</td>
- <td>
- How many times slower a task is than the median to be considered for speculation.
+ Enables Spark to run certain jobs, such as first() or take() on the driver, without sending
+ tasks to the cluster. This can make certain jobs execute very quickly, but may require
+ shipping a whole partition of data to the driver.
</td>
</tr>
<tr>
@@ -1055,19 +986,19 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.locality.wait.process</code></td>
+ <td><code>spark.locality.wait.node</code></td>
<td>spark.locality.wait</td>
<td>
- Customize the locality wait for process locality. This affects tasks that attempt to access
- cached data in a particular executor process.
+ Customize the locality wait for node locality. For example, you can set this to 0 to skip
+ node locality and search immediately for rack locality (if your cluster has rack information).
</td>
</tr>
<tr>
- <td><code>spark.locality.wait.node</code></td>
+ <td><code>spark.locality.wait.process</code></td>
<td>spark.locality.wait</td>
<td>
- Customize the locality wait for node locality. For example, you can set this to 0 to skip
- node locality and search immediately for rack locality (if your cluster has rack information).
+ Customize the locality wait for process locality. This affects tasks that attempt to access
+ cached data in a particular executor process.
</td>
</tr>
<tr>
@@ -1078,14 +1009,14 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.scheduler.revive.interval</code></td>
- <td>1000</td>
+ <td><code>spark.scheduler.maxRegisteredResourcesWaitingTime</code></td>
+ <td>30000</td>
<td>
- The interval length for the scheduler to revive the worker resource offers to run tasks
+ Maximum amount of time to wait for resources to register before scheduling begins
(in milliseconds).
</td>
</tr>
-</tr>
+<tr>
<td><code>spark.scheduler.minRegisteredResourcesRatio</code></td>
<td>0.0 for Mesos and Standalone mode, 0.8 for YARN</td>
<td>
@@ -1098,25 +1029,70 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.scheduler.maxRegisteredResourcesWaitingTime</code></td>
- <td>30000</td>
+ <td><code>spark.scheduler.mode</code></td>
+ <td>FIFO</td>
<td>
- Maximum amount of time to wait for resources to register before scheduling begins
+ The <a href="job-scheduling.html#scheduling-within-an-application">scheduling mode</a> between
+ jobs submitted to the same SparkContext. Can be set to <code>FAIR</code>
+ to use fair sharing instead of queueing jobs one after another. Useful for
+ multi-user services.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.scheduler.revive.interval</code></td>
+ <td>1000</td>
+ <td>
+ The interval length for the scheduler to revive the worker resource offers to run tasks
(in milliseconds).
</td>
</tr>
<tr>
- <td><code>spark.localExecution.enabled</code></td>
+ <td><code>spark.speculation</code></td>
<td>false</td>
<td>
- Enables Spark to run certain jobs, such as first() or take() on the driver, without sending
- tasks to the cluster. This can make certain jobs execute very quickly, but may require
- shipping a whole partition of data to the driver.
+ If set to "true", performs speculative execution of tasks. This means if one or more tasks are
+ running slowly in a stage, they will be re-launched.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.speculation.interval</code></td>
+ <td>100</td>
+ <td>
+ How often Spark will check for tasks to speculate, in milliseconds.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.speculation.multiplier</code></td>
+ <td>1.5</td>
+ <td>
+ How many times slower a task is than the median to be considered for speculation.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.speculation.quantile</code></td>
+ <td>0.75</td>
+ <td>
+ Percentage of tasks which must be complete before speculation is enabled for a particular stage.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.task.cpus</code></td>
+ <td>1</td>
+ <td>
+ Number of cores to allocate for each task.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.task.maxFailures</code></td>
+ <td>4</td>
+ <td>
+ Number of individual task failures before giving up on the job.
+ Should be greater than or equal to 1. Number of allowed retries = this value - 1.
</td>
</tr>
</table>
-#### Dynamic allocation
+#### Dynamic Allocation
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
@@ -1136,10 +1112,19 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.dynamicAllocation.executorIdleTimeout</code></td>
+ <td>600</td>
+ <td>
+ If dynamic allocation is enabled and an executor has been idle for more than this duration
+ (in seconds), the executor will be removed. For more detail, see this
+ <a href="job-scheduling.html#resource-allocation-policy">description</a>.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.dynamicAllocation.initialExecutors</code></td>
<td><code>spark.dynamicAllocation.minExecutors</code></td>
- <td>0</td>
<td>
- Lower bound for the number of executors if dynamic allocation is enabled.
+ Initial number of executors to run if dynamic allocation is enabled.
</td>
</tr>
<tr>
@@ -1150,10 +1135,10 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.dynamicAllocation.maxExecutors</code></td>
<td><code>spark.dynamicAllocation.minExecutors</code></td>
+ <td>0</td>
<td>
- Initial number of executors to run if dynamic allocation is enabled.
+ Lower bound for the number of executors if dynamic allocation is enabled.
</td>
</tr>
<tr>
@@ -1174,21 +1159,31 @@ Apart from these, the following properties are also available, and may be useful
<a href="job-scheduling.html#resource-allocation-policy">description</a>.
</td>
</tr>
-<tr>
- <td><code>spark.dynamicAllocation.executorIdleTimeout</code></td>
- <td>600</td>
- <td>
- If dynamic allocation is enabled and an executor has been idle for more than this duration
- (in seconds), the executor will be removed. For more detail, see this
- <a href="job-scheduling.html#resource-allocation-policy">description</a>.
- </td>
-</tr>
</table>
#### Security
<table class="table">
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
+ <td><code>spark.acls.enable</code></td>
+ <td>false</td>
+ <td>
+ Whether Spark acls should are enabled. If enabled, this checks to see if the user has
+ access permissions to view or modify the job. Note this requires the user to be known,
+ so if the user comes across as null no checks are done. Filters can be used with the UI
+ to authenticate and set the user.
+ </td>
+</tr>
+<tr>
+ <td><code>spark.admin.acls</code></td>
+ <td>Empty</td>
+ <td>
+ Comma separated list of users/administrators that have view and modify access to all Spark jobs.
+ This can be used if you run on a shared cluster and have a set of administrators or devs who
+ help debug when things work.
+ </td>
+</tr>
+<tr>
<td><code>spark.authenticate</code></td>
<td>false</td>
<td>
@@ -1205,6 +1200,15 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.core.connection.ack.wait.timeout</code></td>
+ <td>60</td>
+ <td>
+ Number of seconds for the connection to wait for ack to occur before timing
+ out and giving up. To avoid unwilling timeout caused by long pause like GC,
+ you can set larger value.
+ </td>
+</tr>
+<tr>
<td><code>spark.core.connection.auth.wait.timeout</code></td>
<td>30</td>
<td>
@@ -1213,12 +1217,11 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.core.connection.ack.wait.timeout</code></td>
- <td>60</td>
+ <td><code>spark.modify.acls</code></td>
+ <td>Empty</td>
<td>
- Number of seconds for the connection to wait for ack to occur before timing
- out and giving up. To avoid unwilling timeout caused by long pause like GC,
- you can set larger value.
+ Comma separated list of users that have modify access to the Spark job. By default only the
+ user that started the Spark job has access to modify it (kill it for example).
</td>
</tr>
<tr>
@@ -1236,16 +1239,6 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.acls.enable</code></td>
- <td>false</td>
- <td>
- Whether Spark acls should are enabled. If enabled, this checks to see if the user has
- access permissions to view or modify the job. Note this requires the user to be known,
- so if the user comes across as null no checks are done. Filters can be used with the UI
- to authenticate and set the user.
- </td>
-</tr>
-<tr>
<td><code>spark.ui.view.acls</code></td>
<td>Empty</td>
<td>
@@ -1253,23 +1246,6 @@ Apart from these, the following properties are also available, and may be useful
user that started the Spark job has view access.
</td>
</tr>
-<tr>
- <td><code>spark.modify.acls</code></td>
- <td>Empty</td>
- <td>
- Comma separated list of users that have modify access to the Spark job. By default only the
- user that started the Spark job has access to modify it (kill it for example).
- </td>
-</tr>
-<tr>
- <td><code>spark.admin.acls</code></td>
- <td>Empty</td>
- <td>
- Comma separated list of users/administrators that have view and modify access to all Spark jobs.
- This can be used if you run on a shared cluster and have a set of administrators or devs who
- help debug when things work.
- </td>
-</tr>
</table>
#### Encryption
@@ -1294,6 +1270,23 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
+ <td><code>spark.ssl.enabledAlgorithms</code></td>
+ <td>Empty</td>
+ <td>
+ A comma separated list of ciphers. The specified ciphers must be supported by JVM.
+ The reference list of protocols one can find on
+ <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">this</a>
+ page.
+ </td>
+ </tr>
+ <tr>
+ <td><code>spark.ssl.keyPassword</code></td>
+ <td>None</td>
+ <td>
+ A password to the private key in key-store.
+ </td>
+ </tr>
+ <tr>
<td><code>spark.ssl.keyStore</code></td>
<td>None</td>
<td>
@@ -1309,10 +1302,12 @@ Apart from these, the following properties are also available, and may be useful
</td>
</tr>
<tr>
- <td><code>spark.ssl.keyPassword</code></td>
+ <td><code>spark.ssl.protocol</code></td>
<td>None</td>
<td>
- A password to the private key in key-store.
+ A protocol name. The protocol must be supported by JVM. The reference list of protocols
+ one can find on <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">this</a>
+ page.
</td>
</tr>
<tr>
@@ -1330,25 +1325,6 @@ Apart from these, the following properties are also available, and may be useful
A password to the trust-store.
</td>
</tr>
- <tr>
- <td><code>spark.ssl.protocol</code></td>
- <td>None</td>
- <td>
- A protocol name. The protocol must be supported by JVM. The reference list of protocols
- one can find on <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">this</a>
- page.
- </td>
- </tr>
- <tr>
- <td><code>spark.ssl.enabledAlgorithms</code></td>
- <td>Empty</td>
- <td>
- A comma separated list of ciphers. The specified ciphers must be supported by JVM.
- The reference list of protocols one can find on
- <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">this</a>
- page.
- </td>
- </tr>
</table>
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
index 78358499fd..db1173a06b 100644
--- a/docs/running-on-mesos.md
+++ b/docs/running-on-mesos.md
@@ -197,7 +197,11 @@ See the [configuration page](configuration.html) for information on Spark config
<td><code>spark.mesos.coarse</code></td>
<td>false</td>
<td>
- Set the run mode for Spark on Mesos. For more information about the run mode, refer to #Mesos Run Mode section above.
+ If set to "true", runs over Mesos clusters in
+ <a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
+ where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per
+ Spark task. This gives lower-latency scheduling for short queries, but leaves resources in use
+ for the whole duration of the Spark job.
</td>
</tr>
<tr>
@@ -211,19 +215,23 @@ See the [configuration page](configuration.html) for information on Spark config
</tr>
<tr>
<td><code>spark.mesos.executor.home</code></td>
- <td>SPARK_HOME</td>
+ <td>driver side <code>SPARK_HOME</code></td>
<td>
- The location where the mesos executor will look for Spark binaries to execute, and uses the SPARK_HOME setting on default.
- This variable is only used when no spark.executor.uri is provided, and assumes Spark is installed on the specified location
- on each slave.
+ Set the directory in which Spark is installed on the executors in Mesos. By default, the
+ executors will simply use the driver's Spark home directory, which may not be visible to
+ them. Note that this is only relevant if a Spark binary package is not specified through
+ <code>spark.executor.uri</code>.
</td>
</tr>
<tr>
<td><code>spark.mesos.executor.memoryOverhead</code></td>
- <td>384</td>
+ <td>executor memory * 0.07, with minimum of 384</td>
<td>
- The amount of memory that Mesos executor will request for the task to account for the overhead of running the executor itself.
- The final total amount of memory allocated is the maximum value between executor memory plus memoryOverhead, and overhead fraction (1.07) plus the executor memory.
+ This value is an additive for <code>spark.executor.memory</code>, specified in MiB,
+ which is used to calculate the total Mesos task memory. A value of <code>384</code>
+ implies a 384MiB overhead. Additionally, there is a hard-coded 7% minimum
+ overhead. The final overhead will be the larger of either
+ `spark.mesos.executor.memoryOverhead` or 7% of `spark.executor.memory`.
</td>
</tr>
</table>