aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorJacek Laskowski <jacek.laskowski@deepsense.io>2015-09-21 19:46:39 +0100
committerSean Owen <sowen@cloudera.com>2015-09-21 19:46:39 +0100
commitca9fe540fe04e2e230d1e76526b5502bab152914 (patch)
tree48b2bde988e1162e2528aae9452f1b84d3680148 /docs
parentebbf85f07bb8de0d566f1ae4b41f26421180bebe (diff)
downloadspark-ca9fe540fe04e2e230d1e76526b5502bab152914.tar.gz
spark-ca9fe540fe04e2e230d1e76526b5502bab152914.tar.bz2
spark-ca9fe540fe04e2e230d1e76526b5502bab152914.zip
[SPARK-10662] [DOCS] Code snippets are not properly formatted in tables
* Backticks are processed properly in Spark Properties table * Removed unnecessary spaces * See http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #8795 from jaceklaskowski/docs-yarn-formatting.
Diffstat (limited to 'docs')
-rw-r--r--docs/configuration.md97
-rw-r--r--docs/programming-guide.md100
-rw-r--r--docs/running-on-mesos.md14
-rw-r--r--docs/running-on-yarn.md106
-rw-r--r--docs/sql-programming-guide.md16
-rw-r--r--docs/submitting-applications.md8
6 files changed, 171 insertions, 170 deletions
diff --git a/docs/configuration.md b/docs/configuration.md
index 5ec097c78a..b22587c703 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -34,20 +34,20 @@ val conf = new SparkConf()
val sc = new SparkContext(conf)
{% endhighlight %}
-Note that we can have more than 1 thread in local mode, and in cases like Spark Streaming, we may
+Note that we can have more than 1 thread in local mode, and in cases like Spark Streaming, we may
actually require one to prevent any sort of starvation issues.
-Properties that specify some time duration should be configured with a unit of time.
+Properties that specify some time duration should be configured with a unit of time.
The following format is accepted:
-
+
25ms (milliseconds)
5s (seconds)
10m or 10min (minutes)
3h (hours)
5d (days)
1y (years)
-
-
+
+
Properties that specify a byte size should be configured with a unit of size.
The following format is accepted:
@@ -140,7 +140,7 @@ of the most common options to set are:
<td>
Amount of memory to use for the driver process, i.e. where SparkContext is initialized.
(e.g. <code>1g</code>, <code>2g</code>).
-
+
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
Instead, please set this through the <code>--driver-memory</code> command line option
@@ -207,7 +207,7 @@ Apart from these, the following properties are also available, and may be useful
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
- Instead, please set this through the <code>--driver-class-path</code> command line option or in
+ Instead, please set this through the <code>--driver-class-path</code> command line option or in
your default properties file.</td>
</td>
</tr>
@@ -216,10 +216,10 @@ Apart from these, the following properties are also available, and may be useful
<td>(none)</td>
<td>
A string of extra JVM options to pass to the driver. For instance, GC settings or other logging.
-
+
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
- Instead, please set this through the <code>--driver-java-options</code> command line option or in
+ Instead, please set this through the <code>--driver-java-options</code> command line option or in
your default properties file.</td>
</td>
</tr>
@@ -228,10 +228,10 @@ Apart from these, the following properties are also available, and may be useful
<td>(none)</td>
<td>
Set a special library path to use when launching the driver JVM.
-
+
<br /><em>Note:</em> In client mode, this config must not be set through the <code>SparkConf</code>
directly in your application, because the driver JVM has already started at that point.
- Instead, please set this through the <code>--driver-library-path</code> command line option or in
+ Instead, please set this through the <code>--driver-library-path</code> command line option or in
your default properties file.</td>
</td>
</tr>
@@ -242,7 +242,7 @@ Apart from these, the following properties are also available, and may be useful
(Experimental) Whether to give user-added jars precedence over Spark's own jars when loading
classes in the the driver. This feature can be used to mitigate conflicts between Spark's
dependencies and user dependencies. It is currently an experimental feature.
-
+
This is used in cluster mode only.
</td>
</tr>
@@ -250,8 +250,8 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.executor.extraClassPath</code></td>
<td>(none)</td>
<td>
- Extra classpath entries to prepend to the classpath of executors. This exists primarily for
- backwards-compatibility with older versions of Spark. Users typically should not need to set
+ Extra classpath entries to prepend to the classpath of executors. This exists primarily for
+ backwards-compatibility with older versions of Spark. Users typically should not need to set
this option.
</td>
</tr>
@@ -259,9 +259,9 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.executor.extraJavaOptions</code></td>
<td>(none)</td>
<td>
- A string of extra JVM options to pass to executors. For instance, GC settings or other logging.
- Note that it is illegal to set Spark properties or heap size settings with this option. Spark
- properties should be set using a SparkConf object or the spark-defaults.conf file used with the
+ A string of extra JVM options to pass to executors. For instance, GC settings or other logging.
+ Note that it is illegal to set Spark properties or heap size settings with this option. Spark
+ properties should be set using a SparkConf object or the spark-defaults.conf file used with the
spark-submit script. Heap size settings can be set with spark.executor.memory.
</td>
</tr>
@@ -305,7 +305,7 @@ Apart from these, the following properties are also available, and may be useful
<td>daily</td>
<td>
Set the time interval by which the executor logs will be rolled over.
- Rolling is disabled by default. Valid values are `daily`, `hourly`, `minutely` or
+ Rolling is disabled by default. Valid values are <code>daily</code>, <code>hourly<code>, <code>minutely<code> or
any interval in seconds. See <code>spark.executor.logs.rolling.maxRetainedFiles</code>
for automatic cleaning of old logs.
</td>
@@ -330,13 +330,13 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.python.profile</code></td>
<td>false</td>
<td>
- Enable profiling in Python worker, the profile result will show up by `sc.show_profiles()`,
+ Enable profiling in Python worker, the profile result will show up by <code>sc.show_profiles()<code>,
or it will be displayed before the driver exiting. It also can be dumped into disk by
- `sc.dump_profiles(path)`. If some of the profile results had been displayed manually,
+ <code>sc.dump_profiles(path)<code>. If some of the profile results had been displayed manually,
they will not be displayed automatically before driver exiting.
- By default the `pyspark.profiler.BasicProfiler` will be used, but this can be overridden by
- passing a profiler class in as a parameter to the `SparkContext` constructor.
+ By default the <code>pyspark.profiler.BasicProfiler<code> will be used, but this can be overridden by
+ passing a profiler class in as a parameter to the <code>SparkContext<code> constructor.
</td>
</tr>
<tr>
@@ -460,11 +460,11 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.shuffle.service.enabled</code></td>
<td>false</td>
<td>
- Enables the external shuffle service. This service preserves the shuffle files written by
- executors so the executors can be safely removed. This must be enabled if
+ Enables the external shuffle service. This service preserves the shuffle files written by
+ executors so the executors can be safely removed. This must be enabled if
<code>spark.dynamicAllocation.enabled</code> is "true". The external shuffle service
must be set up in order to enable it. See
- <a href="job-scheduling.html#configuration-and-setup">dynamic allocation
+ <a href="job-scheduling.html#configuration-and-setup">dynamic allocation
configuration and setup documentation</a> for more information.
</td>
</tr>
@@ -747,9 +747,9 @@ Apart from these, the following properties are also available, and may be useful
<td>1 in YARN mode, all the available cores on the worker in standalone mode.</td>
<td>
The number of cores to use on each executor. For YARN and standalone mode only.
-
- In standalone mode, setting this parameter allows an application to run multiple executors on
- the same worker, provided that there are enough cores on that worker. Otherwise, only one
+
+ In standalone mode, setting this parameter allows an application to run multiple executors on
+ the same worker, provided that there are enough cores on that worker. Otherwise, only one
executor per application will run on each worker.
</td>
</tr>
@@ -893,14 +893,14 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.akka.heartbeat.interval</code></td>
<td>1000s</td>
<td>
- This is set to a larger value to disable the transport failure detector that comes built in to
- Akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger
- interval value reduces network overhead and a smaller value ( ~ 1 s) might be more
- informative for Akka's failure detector. Tune this in combination of `spark.akka.heartbeat.pauses`
- if you need to. A likely positive use case for using failure detector would be: a sensistive
- failure detector can help evict rogue executors quickly. However this is usually not the case
- as GC pauses and network lags are expected in a real Spark cluster. Apart from that enabling
- this leads to a lot of exchanges of heart beats between nodes leading to flooding the network
+ This is set to a larger value to disable the transport failure detector that comes built in to
+ Akka. It can be enabled again, if you plan to use this feature (Not recommended). A larger
+ interval value reduces network overhead and a smaller value ( ~ 1 s) might be more
+ informative for Akka's failure detector. Tune this in combination of <code>spark.akka.heartbeat.pauses</code>
+ if you need to. A likely positive use case for using failure detector would be: a sensistive
+ failure detector can help evict rogue executors quickly. However this is usually not the case
+ as GC pauses and network lags are expected in a real Spark cluster. Apart from that enabling
+ this leads to a lot of exchanges of heart beats between nodes leading to flooding the network
with those.
</td>
</tr>
@@ -909,9 +909,9 @@ Apart from these, the following properties are also available, and may be useful
<td>6000s</td>
<td>
This is set to a larger value to disable the transport failure detector that comes built in to Akka.
- It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart
+ It can be enabled again, if you plan to use this feature (Not recommended). Acceptable heart
beat pause for Akka. This can be used to control sensitivity to GC pauses. Tune
- this along with `spark.akka.heartbeat.interval` if you need to.
+ this along with <code>spark.akka.heartbeat.interval</code> if you need to.
</td>
</tr>
<tr>
@@ -978,7 +978,7 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.network.timeout</code></td>
<td>120s</td>
<td>
- Default timeout for all network interactions. This config will be used in place of
+ Default timeout for all network interactions. This config will be used in place of
<code>spark.core.connection.ack.wait.timeout</code>, <code>spark.akka.timeout</code>,
<code>spark.storage.blockManagerSlaveTimeoutMs</code>,
<code>spark.shuffle.io.connectionTimeout</code>, <code>spark.rpc.askTimeout</code> or
@@ -991,8 +991,8 @@ Apart from these, the following properties are also available, and may be useful
<td>
Maximum number of retries when binding to a port before giving up.
When a port is given a specific value (non 0), each subsequent retry will
- increment the port used in the previous attempt by 1 before retrying. This
- essentially allows it to try a range of ports from the start port specified
+ increment the port used in the previous attempt by 1 before retrying. This
+ essentially allows it to try a range of ports from the start port specified
to port + maxRetries.
</td>
</tr>
@@ -1191,7 +1191,7 @@ Apart from these, the following properties are also available, and may be useful
<td><code>spark.dynamicAllocation.executorIdleTimeout</code></td>
<td>60s</td>
<td>
- If dynamic allocation is enabled and an executor has been idle for more than this duration,
+ If dynamic allocation is enabled and an executor has been idle for more than this duration,
the executor will be removed. For more detail, see this
<a href="job-scheduling.html#resource-allocation-policy">description</a>.
</td>
@@ -1424,11 +1424,11 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
Enables or disables Spark Streaming's internal backpressure mechanism (since 1.5).
- This enables the Spark Streaming to control the receiving rate based on the
+ This enables the Spark Streaming to control the receiving rate based on the
current batch scheduling delays and processing times so that the system receives
- only as fast as the system can process. Internally, this dynamically sets the
+ only as fast as the system can process. Internally, this dynamically sets the
maximum receiving rate of receivers. This rate is upper bounded by the values
- `spark.streaming.receiver.maxRate` and `spark.streaming.kafka.maxRatePerPartition`
+ <code>spark.streaming.receiver.maxRate</code> and <code>spark.streaming.kafka.maxRatePerPartition</code>
if they are set (see below).
</td>
</tr>
@@ -1542,15 +1542,15 @@ The following variables can be set in `spark-env.sh`:
<tr><th style="width:21%">Environment Variable</th><th>Meaning</th></tr>
<tr>
<td><code>JAVA_HOME</code></td>
- <td>Location where Java is installed (if it's not on your default `PATH`).</td>
+ <td>Location where Java is installed (if it's not on your default <code>PATH</code>).</td>
</tr>
<tr>
<td><code>PYSPARK_PYTHON</code></td>
- <td>Python binary executable to use for PySpark in both driver and workers (default is `python`).</td>
+ <td>Python binary executable to use for PySpark in both driver and workers (default is <code>python</code>).</td>
</tr>
<tr>
<td><code>PYSPARK_DRIVER_PYTHON</code></td>
- <td>Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).</td>
+ <td>Python binary executable to use for PySpark in driver only (default is <code>PYSPARK_PYTHON</code>).</td>
</tr>
<tr>
<td><code>SPARK_LOCAL_IP</code></td>
@@ -1580,4 +1580,3 @@ Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can config
To specify a different configuration directory other than the default "SPARK_HOME/conf",
you can set SPARK_CONF_DIR. Spark will use the the configuration files (spark-defaults.conf, spark-env.sh, log4j.properties, etc)
from this directory.
-
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 4cf83bb392..8ad238315f 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -182,8 +182,8 @@ in-process.
In the Spark shell, a special interpreter-aware SparkContext is already created for you, in the
variable called `sc`. Making your own SparkContext will not work. You can set which master the
context connects to using the `--master` argument, and you can add JARs to the classpath
-by passing a comma-separated list to the `--jars` argument. You can also add dependencies
-(e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates
+by passing a comma-separated list to the `--jars` argument. You can also add dependencies
+(e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates
to the `--packages` argument. Any additional repositories where dependencies might exist (e.g. SonaType)
can be passed to the `--repositories` argument. For example, to run `bin/spark-shell` on exactly
four cores, use:
@@ -217,7 +217,7 @@ context connects to using the `--master` argument, and you can add Python .zip,
to the runtime path by passing a comma-separated list to `--py-files`. You can also add dependencies
(e.g. Spark Packages) to your shell session by supplying a comma-separated list of maven coordinates
to the `--packages` argument. Any additional repositories where dependencies might exist (e.g. SonaType)
-can be passed to the `--repositories` argument. Any python dependencies a Spark Package has (listed in
+can be passed to the `--repositories` argument. Any python dependencies a Spark Package has (listed in
the requirements.txt of that package) must be manually installed using pip when necessary.
For example, to run `bin/pyspark` on exactly four cores, use:
@@ -249,8 +249,8 @@ the [IPython Notebook](http://ipython.org/notebook.html) with PyLab plot support
$ PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook" ./bin/pyspark
{% endhighlight %}
-After the IPython Notebook server is launched, you can create a new "Python 2" notebook from
-the "Files" tab. Inside the notebook, you can input the command `%pylab inline` as part of
+After the IPython Notebook server is launched, you can create a new "Python 2" notebook from
+the "Files" tab. Inside the notebook, you can input the command `%pylab inline` as part of
your notebook before you start to try Spark from the IPython notebook.
</div>
@@ -418,9 +418,9 @@ Apart from text files, Spark's Python API also supports several other data forma
**Writable Support**
-PySpark SequenceFile support loads an RDD of key-value pairs within Java, converts Writables to base Java types, and pickles the
-resulting Java objects using [Pyrolite](https://github.com/irmen/Pyrolite/). When saving an RDD of key-value pairs to SequenceFile,
-PySpark does the reverse. It unpickles Python objects into Java objects and then converts them to Writables. The following
+PySpark SequenceFile support loads an RDD of key-value pairs within Java, converts Writables to base Java types, and pickles the
+resulting Java objects using [Pyrolite](https://github.com/irmen/Pyrolite/). When saving an RDD of key-value pairs to SequenceFile,
+PySpark does the reverse. It unpickles Python objects into Java objects and then converts them to Writables. The following
Writables are automatically converted:
<table class="table">
@@ -435,9 +435,9 @@ Writables are automatically converted:
<tr><td>MapWritable</td><td>dict</td></tr>
</table>
-Arrays are not handled out-of-the-box. Users need to specify custom `ArrayWritable` subtypes when reading or writing. When writing,
-users also need to specify custom converters that convert arrays to custom `ArrayWritable` subtypes. When reading, the default
-converter will convert custom `ArrayWritable` subtypes to Java `Object[]`, which then get pickled to Python tuples. To get
+Arrays are not handled out-of-the-box. Users need to specify custom `ArrayWritable` subtypes when reading or writing. When writing,
+users also need to specify custom converters that convert arrays to custom `ArrayWritable` subtypes. When reading, the default
+converter will convert custom `ArrayWritable` subtypes to Java `Object[]`, which then get pickled to Python tuples. To get
Python `array.array` for arrays of primitive types, users need to specify custom converters.
**Saving and Loading SequenceFiles**
@@ -454,7 +454,7 @@ classes can be specified, but for standard Writables this is not required.
**Saving and Loading Other Hadoop Input/Output Formats**
-PySpark can also read any Hadoop InputFormat or write any Hadoop OutputFormat, for both 'new' and 'old' Hadoop MapReduce APIs.
+PySpark can also read any Hadoop InputFormat or write any Hadoop OutputFormat, for both 'new' and 'old' Hadoop MapReduce APIs.
If required, a Hadoop configuration can be passed in as a Python dict. Here is an example using the
Elasticsearch ESInputFormat:
@@ -474,15 +474,15 @@ Note that, if the InputFormat simply depends on a Hadoop configuration and/or in
the key and value classes can easily be converted according to the above table,
then this approach should work well for such cases.
-If you have custom serialized binary data (such as loading data from Cassandra / HBase), then you will first need to
+If you have custom serialized binary data (such as loading data from Cassandra / HBase), then you will first need to
transform that data on the Scala/Java side to something which can be handled by Pyrolite's pickler.
-A [Converter](api/scala/index.html#org.apache.spark.api.python.Converter) trait is provided
-for this. Simply extend this trait and implement your transformation code in the ```convert```
-method. Remember to ensure that this class, along with any dependencies required to access your ```InputFormat```, are packaged into your Spark job jar and included on the PySpark
+A [Converter](api/scala/index.html#org.apache.spark.api.python.Converter) trait is provided
+for this. Simply extend this trait and implement your transformation code in the ```convert```
+method. Remember to ensure that this class, along with any dependencies required to access your ```InputFormat```, are packaged into your Spark job jar and included on the PySpark
classpath.
-See the [Python examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python) and
-the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters)
+See the [Python examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python) and
+the [Converter examples]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples/pythonconverters)
for examples of using Cassandra / HBase ```InputFormat``` and ```OutputFormat``` with custom converters.
</div>
@@ -758,7 +758,7 @@ One of the harder things about Spark is understanding the scope and life cycle o
#### Example
-Consider the naive RDD element sum below, which behaves completely differently depending on whether execution is happening within the same JVM. A common example of this is when running Spark in `local` mode (`--master = local[n]`) versus deploying a Spark application to a cluster (e.g. via spark-submit to YARN):
+Consider the naive RDD element sum below, which behaves completely differently depending on whether execution is happening within the same JVM. A common example of this is when running Spark in `local` mode (`--master = local[n]`) versus deploying a Spark application to a cluster (e.g. via spark-submit to YARN):
<div class="codetabs">
@@ -777,7 +777,7 @@ println("Counter value: " + counter)
<div data-lang="java" markdown="1">
{% highlight java %}
int counter = 0;
-JavaRDD<Integer> rdd = sc.parallelize(data);
+JavaRDD<Integer> rdd = sc.parallelize(data);
// Wrong: Don't do this!!
rdd.foreach(x -> counter += x);
@@ -803,7 +803,7 @@ print("Counter value: " + counter)
#### Local vs. cluster modes
-The primary challenge is that the behavior of the above code is undefined. In local mode with a single JVM, the above code will sum the values within the RDD and store it in **counter**. This is because both the RDD and the variable **counter** are in the same memory space on the driver node.
+The primary challenge is that the behavior of the above code is undefined. In local mode with a single JVM, the above code will sum the values within the RDD and store it in **counter**. This is because both the RDD and the variable **counter** are in the same memory space on the driver node.
However, in `cluster` mode, what happens is more complicated, and the above may not work as intended. To execute jobs, Spark breaks up the processing of RDD operations into tasks - each of which is operated on by an executor. Prior to execution, Spark computes the **closure**. The closure is those variables and methods which must be visible for the executor to perform its computations on the RDD (in this case `foreach()`). This closure is serialized and sent to each executor. In `local` mode, there is only the one executors so everything shares the same closure. In other modes however, this is not the case and the executors running on seperate worker nodes each have their own copy of the closure.
@@ -813,9 +813,9 @@ To ensure well-defined behavior in these sorts of scenarios one should use an [`
In general, closures - constructs like loops or locally defined methods, should not be used to mutate some global state. Spark does not define or guarantee the behavior of mutations to objects referenced from outside of closures. Some code that does this may work in local mode, but that's just by accident and such code will not behave as expected in distributed mode. Use an Accumulator instead if some global aggregation is needed.
-#### Printing elements of an RDD
+#### Printing elements of an RDD
Another common idiom is attempting to print out the elements of an RDD using `rdd.foreach(println)` or `rdd.map(println)`. On a single machine, this will generate the expected output and print all the RDD's elements. However, in `cluster` mode, the output to `stdout` being called by the executors is now writing to the executor's `stdout` instead, not the one on the driver, so `stdout` on the driver won't show these! To print all elements on the driver, one can use the `collect()` method to first bring the RDD to the driver node thus: `rdd.collect().foreach(println)`. This can cause the driver to run out of memory, though, because `collect()` fetches the entire RDD to a single machine; if you only need to print a few elements of the RDD, a safer approach is to use the `take()`: `rdd.take(100).foreach(println)`.
-
+
### Working with Key-Value Pairs
<div class="codetabs">
@@ -859,7 +859,7 @@ only available on RDDs of key-value pairs.
The most common ones are distributed "shuffle" operations, such as grouping or aggregating the elements
by a key.
-In Java, key-value pairs are represented using the
+In Java, key-value pairs are represented using the
[scala.Tuple2](http://www.scala-lang.org/api/{{site.SCALA_VERSION}}/index.html#scala.Tuple2) class
from the Scala standard library. You can simply call `new Tuple2(a, b)` to create a tuple, and access
its fields later with `tuple._1()` and `tuple._2()`.
@@ -974,7 +974,7 @@ for details.
<td> <b>groupByKey</b>([<i>numTasks</i>]) <a name="GroupByLink"></a> </td>
<td> When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable&lt;V&gt;) pairs. <br />
<b>Note:</b> If you are grouping in order to perform an aggregation (such as a sum or
- average) over each key, using <code>reduceByKey</code> or <code>aggregateByKey</code> will yield much better
+ average) over each key, using <code>reduceByKey</code> or <code>aggregateByKey</code> will yield much better
performance.
<br />
<b>Note:</b> By default, the level of parallelism in the output depends on the number of partitions of the parent RDD.
@@ -1025,7 +1025,7 @@ for details.
<tr>
<td> <b>repartitionAndSortWithinPartitions</b>(<i>partitioner</i>) <a name="Repartition2Link"></a></td>
<td> Repartition the RDD according to the given partitioner and, within each resulting partition,
- sort records by their keys. This is more efficient than calling <code>repartition</code> and then sorting within
+ sort records by their keys. This is more efficient than calling <code>repartition</code> and then sorting within
each partition because it can push the sorting down into the shuffle machinery. </td>
</tr>
</table>
@@ -1038,7 +1038,7 @@ RDD API doc
[Java](api/java/index.html?org/apache/spark/api/java/JavaRDD.html),
[Python](api/python/pyspark.html#pyspark.RDD),
[R](api/R/index.html))
-
+
and pair RDD functions doc
([Scala](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions),
[Java](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html))
@@ -1094,7 +1094,7 @@ for details.
</tr>
<tr>
<td> <b>foreach</b>(<i>func</i>) </td>
- <td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an <a href="#AccumLink">Accumulator</a> or interacting with external storage systems.
+ <td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an <a href="#AccumLink">Accumulator</a> or interacting with external storage systems.
<br /><b>Note</b>: modifying variables other than Accumulators outside of the <code>foreach()</code> may result in undefined behavior. See <a href="#ClosuresLink">Understanding closures </a> for more details.</td>
</tr>
</table>
@@ -1118,13 +1118,13 @@ co-located to compute the result.
In Spark, data is generally not distributed across partitions to be in the necessary place for a
specific operation. During computations, a single task will operate on a single partition - thus, to
organize all the data for a single `reduceByKey` reduce task to execute, Spark needs to perform an
-all-to-all operation. It must read from all partitions to find all the values for all keys,
-and then bring together values across partitions to compute the final result for each key -
+all-to-all operation. It must read from all partitions to find all the values for all keys,
+and then bring together values across partitions to compute the final result for each key -
this is called the **shuffle**.
Although the set of elements in each partition of newly shuffled data will be deterministic, and so
-is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably
-ordered data following shuffle then it's possible to use:
+is the ordering of partitions themselves, the ordering of these elements is not. If one desires predictably
+ordered data following shuffle then it's possible to use:
* `mapPartitions` to sort each partition using, for example, `.sorted`
* `repartitionAndSortWithinPartitions` to efficiently sort partitions while simultaneously repartitioning
@@ -1141,26 +1141,26 @@ network I/O. To organize data for the shuffle, Spark generates sets of tasks - *
organize the data, and a set of *reduce* tasks to aggregate it. This nomenclature comes from
MapReduce and does not directly relate to Spark's `map` and `reduce` operations.
-Internally, results from individual map tasks are kept in memory until they can't fit. Then, these
-are sorted based on the target partition and written to a single file. On the reduce side, tasks
+Internally, results from individual map tasks are kept in memory until they can't fit. Then, these
+are sorted based on the target partition and written to a single file. On the reduce side, tasks
read the relevant sorted blocks.
-
-Certain shuffle operations can consume significant amounts of heap memory since they employ
-in-memory data structures to organize records before or after transferring them. Specifically,
-`reduceByKey` and `aggregateByKey` create these structures on the map side, and `'ByKey` operations
-generate these on the reduce side. When data does not fit in memory Spark will spill these tables
+
+Certain shuffle operations can consume significant amounts of heap memory since they employ
+in-memory data structures to organize records before or after transferring them. Specifically,
+`reduceByKey` and `aggregateByKey` create these structures on the map side, and `'ByKey` operations
+generate these on the reduce side. When data does not fit in memory Spark will spill these tables
to disk, incurring the additional overhead of disk I/O and increased garbage collection.
Shuffle also generates a large number of intermediate files on disk. As of Spark 1.3, these files
-are preserved until the corresponding RDDs are no longer used and are garbage collected.
-This is done so the shuffle files don't need to be re-created if the lineage is re-computed.
-Garbage collection may happen only after a long period time, if the application retains references
-to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may
+are preserved until the corresponding RDDs are no longer used and are garbage collected.
+This is done so the shuffle files don't need to be re-created if the lineage is re-computed.
+Garbage collection may happen only after a long period time, if the application retains references
+to these RDDs or if GC does not kick in frequently. This means that long-running Spark jobs may
consume a large amount of disk space. The temporary storage directory is specified by the
`spark.local.dir` configuration parameter when configuring the Spark context.
Shuffle behavior can be tuned by adjusting a variety of configuration parameters. See the
-'Shuffle Behavior' section within the [Spark Configuration Guide](configuration.html).
+'Shuffle Behavior' section within the [Spark Configuration Guide](configuration.html).
## RDD Persistence
@@ -1246,7 +1246,7 @@ efficiency. We recommend going through the following process to select one:
This is the most CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
* If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization library](tuning.html) to
-make the objects much more space-efficient, but still reasonably fast to access.
+make the objects much more space-efficient, but still reasonably fast to access.
* Don't spill to disk unless the functions that computed your datasets are expensive, or they filter
a large amount of the data. Otherwise, recomputing a partition may be as fast as reading it from
@@ -1345,7 +1345,7 @@ Accumulators are variables that are only "added" to through an associative opera
therefore be efficiently supported in parallel. They can be used to implement counters (as in
MapReduce) or sums. Spark natively supports accumulators of numeric types, and programmers
can add support for new types. If accumulators are created with a name, they will be
-displayed in Spark's UI. This can be useful for understanding the progress of
+displayed in Spark's UI. This can be useful for understanding the progress of
running stages (NOTE: this is not yet supported in Python).
An accumulator is created from an initial value `v` by calling `SparkContext.accumulator(v)`. Tasks
@@ -1474,8 +1474,8 @@ vecAccum = sc.accumulator(Vector(...), VectorAccumulatorParam())
</div>
-For accumulator updates performed inside <b>actions only</b>, Spark guarantees that each task's update to the accumulator
-will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware
+For accumulator updates performed inside <b>actions only</b>, Spark guarantees that each task's update to the accumulator
+will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware
of that each task's update may be applied more than once if tasks or job stages are re-executed.
Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like `map()`. The below code fragment demonstrates this property:
@@ -1486,7 +1486,7 @@ Accumulators do not change the lazy evaluation model of Spark. If they are being
{% highlight scala %}
val accum = sc.accumulator(0)
data.map { x => accum += x; f(x) }
-// Here, accum is still 0 because no actions have caused the `map` to be computed.
+// Here, accum is still 0 because no actions have caused the <code>map</code> to be computed.
{% endhighlight %}
</div>
@@ -1553,7 +1553,7 @@ Several changes were made to the Java API:
code that `extends Function` should `implement Function` instead.
* New variants of the `map` transformations, like `mapToPair` and `mapToDouble`, were added to create RDDs
of special data types.
-* Grouping operations like `groupByKey`, `cogroup` and `join` have changed from returning
+* Grouping operations like `groupByKey`, `cogroup` and `join` have changed from returning
`(Key, List<Value>)` pairs to `(Key, Iterable<Value>)`.
</div>
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
index 330c159c67..460a66f37d 100644
--- a/docs/running-on-mesos.md
+++ b/docs/running-on-mesos.md
@@ -245,7 +245,7 @@ See the [configuration page](configuration.html) for information on Spark config
<td><code>spark.mesos.coarse</code></td>
<td>false</td>
<td>
- If set to "true", runs over Mesos clusters in
+ If set to <code>true</code>, runs over Mesos clusters in
<a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per
Spark task. This gives lower-latency scheduling for short queries, but leaves resources in use
@@ -254,16 +254,16 @@ See the [configuration page](configuration.html) for information on Spark config
</tr>
<tr>
<td><code>spark.mesos.extra.cores</code></td>
- <td>0</td>
+ <td><code>0</code></td>
<td>
Set the extra amount of cpus to request per task. This setting is only used for Mesos coarse grain mode.
The total amount of cores requested per task is the number of cores in the offer plus the extra cores configured.
- Note that total amount of cores the executor will request in total will not exceed the spark.cores.max setting.
+ Note that total amount of cores the executor will request in total will not exceed the <code>spark.cores.max</code> setting.
</td>
</tr>
<tr>
<td><code>spark.mesos.mesosExecutor.cores</code></td>
- <td>1.0</td>
+ <td><code>1.0</code></td>
<td>
(Fine-grained mode only) Number of cores to give each Mesos executor. This does not
include the cores used to run the Spark tasks. In other words, even if no Spark task
@@ -287,7 +287,7 @@ See the [configuration page](configuration.html) for information on Spark config
<td>
Set the list of volumes which will be mounted into the Docker image, which was set using
<code>spark.mesos.executor.docker.image</code>. The format of this property is a comma-separated list of
- mappings following the form passed to <tt>docker run -v</tt>. That is they take the form:
+ mappings following the form passed to <code>docker run -v</code>. That is they take the form:
<pre>[host_path:]container_path[:ro|:rw]</pre>
</td>
@@ -318,7 +318,7 @@ See the [configuration page](configuration.html) for information on Spark config
<td>executor memory * 0.10, with minimum of 384</td>
<td>
The amount of additional memory, specified in MB, to be allocated per executor. By default,
- the overhead will be larger of either 384 or 10% of `spark.executor.memory`. If it's set,
+ the overhead will be larger of either 384 or 10% of <code>spark.executor.memory</code>. If set,
the final overhead will be this value.
</td>
</tr>
@@ -339,7 +339,7 @@ See the [configuration page](configuration.html) for information on Spark config
</tr>
<tr>
<td><code>spark.mesos.secret</code></td>
- <td>(none)/td>
+ <td>(none)</td>
<td>
Set the secret with which Spark framework will use to authenticate with Mesos.
</td>
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 3a961d245f..0e25ccf512 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -23,7 +23,7 @@ Unlike [Spark standalone](spark-standalone.html) and [Mesos](running-on-mesos.ht
To launch a Spark application in `yarn-cluster` mode:
$ ./bin/spark-submit --class path.to.your.Class --master yarn-cluster [options] <app jar> [app options]
-
+
For example:
$ ./bin/spark-submit --class org.apache.spark.examples.SparkPi \
@@ -43,7 +43,7 @@ To launch a Spark application in `yarn-client` mode, do the same, but replace `y
## Adding Other JARs
-In `yarn-cluster` mode, the driver runs on a different machine than the client, so `SparkContext.addJar` won't work out of the box with files that are local to the client. To make files on the client available to `SparkContext.addJar`, include them with the `--jars` option in the launch command.
+In `yarn-cluster` mode, the driver runs on a different machine than the client, so `SparkContext.addJar` won't work out of the box with files that are local to the client. To make files on the client available to `SparkContext.addJar`, include them with the `--jars` option in the launch command.
$ ./bin/spark-submit --class my.main.Class \
--master yarn-cluster \
@@ -64,16 +64,16 @@ Most of the configs are the same for Spark on YARN as for other deployment modes
# Debugging your Application
-In YARN terminology, executors and application masters run inside "containers". YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the `yarn.log-aggregation-enable` config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the "yarn logs" command.
+In YARN terminology, executors and application masters run inside "containers". YARN has two modes for handling container logs after an application has completed. If log aggregation is turned on (with the `yarn.log-aggregation-enable` config), container logs are copied to HDFS and deleted on the local machine. These logs can be viewed from anywhere on the cluster with the `yarn logs` command.
yarn logs -applicationId <app ID>
-
+
will print out the contents of all log files from all containers from the given application. You can also view the container log files directly in HDFS using the HDFS shell or API. The directory where they are located can be found by looking at your YARN configs (`yarn.nodemanager.remote-app-log-dir` and `yarn.nodemanager.remote-app-log-dir-suffix`). The logs are also available on the Spark Web UI under the Executors Tab. You need to have both the Spark history server and the MapReduce history server running and configure `yarn.log.server.url` in `yarn-site.xml` properly. The log URL on the Spark history server UI will redirect you to the MapReduce history server to show the aggregated logs.
When log aggregation isn't turned on, logs are retained locally on each machine under `YARN_APP_LOGS_DIR`, which is usually configured to `/tmp/logs` or `$HADOOP_HOME/logs/userlogs` depending on the Hadoop version and installation. Viewing logs for a container requires going to the host that contains them and looking in this directory. Subdirectories organize log files by application ID and container ID. The logs are also available on the Spark Web UI under the Executors Tab and doesn't require running the MapReduce history server.
To review per-container launch environment, increase `yarn.nodemanager.delete.debug-delay-sec` to a
-large value (e.g. 36000), and then access the application cache through `yarn.nodemanager.local-dirs`
+large value (e.g. `36000`), and then access the application cache through `yarn.nodemanager.local-dirs`
on the nodes on which containers are launched. This directory contains the launch script, JARs, and
all environment variables used for launching each container. This process is useful for debugging
classpath problems in particular. (Note that enabling this requires admin privileges on cluster
@@ -92,7 +92,7 @@ Note that for the first option, both executors and the application master will s
log4j configuration, which may cause issues when they run on the same node (e.g. trying to write
to the same log file).
-If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use `spark.yarn.app.container.log.dir` in your log4j.properties. For example, `log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`. For streaming application, configuring `RollingFileAppender` and setting file location to YARN's log directory will avoid disk overflow caused by large log file, and logs can be accessed using YARN's log utility.
+If you need a reference to the proper location to put log files in the YARN so that YARN can properly display and aggregate them, use `spark.yarn.app.container.log.dir` in your `log4j.properties`. For example, `log4j.appender.file_appender.File=${spark.yarn.app.container.log.dir}/spark.log`. For streaming applications, configuring `RollingFileAppender` and setting file location to YARN's log directory will avoid disk overflow caused by large log files, and logs can be accessed using YARN's log utility.
#### Spark Properties
@@ -100,24 +100,26 @@ If you need a reference to the proper location to put log files in the YARN so t
<tr><th>Property Name</th><th>Default</th><th>Meaning</th></tr>
<tr>
<td><code>spark.yarn.am.memory</code></td>
- <td>512m</td>
+ <td><code>512m</code></td>
<td>
Amount of memory to use for the YARN Application Master in client mode, in the same format as JVM memory strings (e.g. <code>512m</code>, <code>2g</code>).
In cluster mode, use <code>spark.driver.memory</code> instead.
+ <p/>
+ Use lower-case suffixes, e.g. <code>k</code>, <code>m</code>, <code>g</code>, <code>t</code>, and <code>p</code>, for kibi-, mebi-, gibi-, tebi-, and pebibytes, respectively.
</td>
</tr>
<tr>
<td><code>spark.driver.cores</code></td>
- <td>1</td>
+ <td><code>1</code></td>
<td>
Number of cores used by the driver in YARN cluster mode.
- Since the driver is run in the same JVM as the YARN Application Master in cluster mode, this also controls the cores used by the YARN AM.
- In client mode, use <code>spark.yarn.am.cores</code> to control the number of cores used by the YARN AM instead.
+ Since the driver is run in the same JVM as the YARN Application Master in cluster mode, this also controls the cores used by the YARN Application Master.
+ In client mode, use <code>spark.yarn.am.cores</code> to control the number of cores used by the YARN Application Master instead.
</td>
</tr>
<tr>
<td><code>spark.yarn.am.cores</code></td>
- <td>1</td>
+ <td><code>1</code></td>
<td>
Number of cores to use for the YARN Application Master in client mode.
In cluster mode, use <code>spark.driver.cores</code> instead.
@@ -125,39 +127,39 @@ If you need a reference to the proper location to put log files in the YARN so t
</tr>
<tr>
<td><code>spark.yarn.am.waitTime</code></td>
- <td>100s</td>
+ <td><code>100s</code></td>
<td>
- In `yarn-cluster` mode, time for the application master to wait for the
- SparkContext to be initialized. In `yarn-client` mode, time for the application master to wait
+ In <code>yarn-cluster</code> mode, time for the YARN Application Master to wait for the
+ SparkContext to be initialized. In <code>yarn-client</code> mode, time for the YARN Application Master to wait
for the driver to connect to it.
</td>
</tr>
<tr>
<td><code>spark.yarn.submit.file.replication</code></td>
- <td>The default HDFS replication (usually 3)</td>
+ <td>The default HDFS replication (usually <code>3</code>)</td>
<td>
HDFS replication level for the files uploaded into HDFS for the application. These include things like the Spark jar, the app jar, and any distributed cache files/archives.
</td>
</tr>
<tr>
<td><code>spark.yarn.preserve.staging.files</code></td>
- <td>false</td>
+ <td><code>false</code></td>
<td>
- Set to true to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather than delete them.
+ Set to <code>true</code> to preserve the staged files (Spark jar, app jar, distributed cache files) at the end of the job rather than delete them.
</td>
</tr>
<tr>
<td><code>spark.yarn.scheduler.heartbeat.interval-ms</code></td>
- <td>3000</td>
+ <td><code>3000</code></td>
<td>
The interval in ms in which the Spark application master heartbeats into the YARN ResourceManager.
- The value is capped at half the value of YARN's configuration for the expiry interval
- (<code>yarn.am.liveness-monitor.expiry-interval-ms</code>).
+ The value is capped at half the value of YARN's configuration for the expiry interval, i.e.
+ <code>yarn.am.liveness-monitor.expiry-interval-ms</code>.
</td>
</tr>
<tr>
<td><code>spark.yarn.scheduler.initial-allocation.interval</code></td>
- <td>200ms</td>
+ <td><code>200ms</code></td>
<td>
The initial interval in which the Spark application master eagerly heartbeats to the YARN ResourceManager
when there are pending container allocation requests. It should be no larger than
@@ -177,8 +179,8 @@ If you need a reference to the proper location to put log files in the YARN so t
<td><code>spark.yarn.historyServer.address</code></td>
<td>(none)</td>
<td>
- The address of the Spark history server (i.e. host.com:18080). The address should not contain a scheme (http://). Defaults to not being set since the history server is an optional service. This address is given to the YARN ResourceManager when the Spark application finishes to link the application from the ResourceManager UI to the Spark history server UI.
- For this property, YARN properties can be used as variables, and these are substituted by Spark at runtime. For eg, if the Spark history server runs on the same node as the YARN ResourceManager, it can be set to `${hadoopconf-yarn.resourcemanager.hostname}:18080`.
+ The address of the Spark history server, e.g. <code>host.com:18080</code>. The address should not contain a scheme (<code>http://</code>). Defaults to not being set since the history server is an optional service. This address is given to the YARN ResourceManager when the Spark application finishes to link the application from the ResourceManager UI to the Spark history server UI.
+ For this property, YARN properties can be used as variables, and these are substituted by Spark at runtime. For example, if the Spark history server runs on the same node as the YARN ResourceManager, it can be set to <code>${hadoopconf-yarn.resourcemanager.hostname}:18080</code>.
</td>
</tr>
<tr>
@@ -197,42 +199,42 @@ If you need a reference to the proper location to put log files in the YARN so t
</tr>
<tr>
<td><code>spark.executor.instances</code></td>
- <td>2</td>
+ <td><code>2</code></td>
<td>
- The number of executors. Note that this property is incompatible with <code>spark.dynamicAllocation.enabled</code>. If both <code>spark.dynamicAllocation.enabled</code> and <code>spark.executor.instances</code> are specified, dynamic allocation is turned off and the specified number of <code>spark.executor.instances</code> is used.
+ The number of executors. Note that this property is incompatible with <code>spark.dynamicAllocation.enabled</code>. If both <code>spark.dynamicAllocation.enabled</code> and <code>spark.executor.instances</code> are specified, dynamic allocation is turned off and the specified number of <code>spark.executor.instances</code> is used.
</td>
</tr>
<tr>
<td><code>spark.yarn.executor.memoryOverhead</code></td>
<td>executorMemory * 0.10, with minimum of 384 </td>
<td>
- The amount of off heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
+ The amount of off-heap memory (in megabytes) to be allocated per executor. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the executor size (typically 6-10%).
</td>
</tr>
<tr>
<td><code>spark.yarn.driver.memoryOverhead</code></td>
<td>driverMemory * 0.10, with minimum of 384 </td>
<td>
- The amount of off heap memory (in megabytes) to be allocated per driver in cluster mode. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%).
+ The amount of off-heap memory (in megabytes) to be allocated per driver in cluster mode. This is memory that accounts for things like VM overheads, interned strings, other native overheads, etc. This tends to grow with the container size (typically 6-10%).
</td>
</tr>
<tr>
<td><code>spark.yarn.am.memoryOverhead</code></td>
<td>AM memory * 0.10, with minimum of 384 </td>
<td>
- Same as <code>spark.yarn.driver.memoryOverhead</code>, but for the Application Master in client mode.
+ Same as <code>spark.yarn.driver.memoryOverhead</code>, but for the YARN Application Master in client mode.
</td>
</tr>
<tr>
<td><code>spark.yarn.am.port</code></td>
<td>(random)</td>
<td>
- Port for the YARN Application Master to listen on. In YARN client mode, this is used to communicate between the Spark driver running on a gateway and the Application Master running on YARN. In YARN cluster mode, this is used for the dynamic executor feature, where it handles the kill from the scheduler backend.
+ Port for the YARN Application Master to listen on. In YARN client mode, this is used to communicate between the Spark driver running on a gateway and the YARN Application Master running on YARN. In YARN cluster mode, this is used for the dynamic executor feature, where it handles the kill from the scheduler backend.
</td>
</tr>
<tr>
<td><code>spark.yarn.queue</code></td>
- <td>default</td>
+ <td><code>default</code></td>
<td>
The name of the YARN queue to which the application is submitted.
</td>
@@ -245,18 +247,18 @@ If you need a reference to the proper location to put log files in the YARN so t
By default, Spark on YARN will use a Spark jar installed locally, but the Spark jar can also be
in a world-readable location on HDFS. This allows YARN to cache it on nodes so that it doesn't
need to be distributed each time an application runs. To point to a jar on HDFS, for example,
- set this configuration to "hdfs:///some/path".
+ set this configuration to <code>hdfs:///some/path</code>.
</td>
</tr>
<tr>
<td><code>spark.yarn.access.namenodes</code></td>
<td>(none)</td>
<td>
- A list of secure HDFS namenodes your Spark application is going to access. For
- example, `spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032`.
- The Spark application must have acess to the namenodes listed and Kerberos must
- be properly configured to be able to access them (either in the same realm or in
- a trusted realm). Spark acquires security tokens for each of the namenodes so that
+ A comma-separated list of secure HDFS namenodes your Spark application is going to access. For
+ example, <code>spark.yarn.access.namenodes=hdfs://nn1.com:8032,hdfs://nn2.com:8032</code>.
+ The Spark application must have access to the namenodes listed and Kerberos must
+ be properly configured to be able to access them (either in the same realm or in
+ a trusted realm). Spark acquires security tokens for each of the namenodes so that
the Spark application can access those remote HDFS clusters.
</td>
</tr>
@@ -264,18 +266,18 @@ If you need a reference to the proper location to put log files in the YARN so t
<td><code>spark.yarn.appMasterEnv.[EnvironmentVariableName]</code></td>
<td>(none)</td>
<td>
- Add the environment variable specified by <code>EnvironmentVariableName</code> to the
- Application Master process launched on YARN. The user can specify multiple of
- these and to set multiple environment variables. In `yarn-cluster` mode this controls
- the environment of the SPARK driver and in `yarn-client` mode it only controls
- the environment of the executor launcher.
+ Add the environment variable specified by <code>EnvironmentVariableName</code> to the
+ Application Master process launched on YARN. The user can specify multiple of
+ these and to set multiple environment variables. In <code>yarn-cluster</code> mode this controls
+ the environment of the Spark driver and in <code>yarn-client</code> mode it only controls
+ the environment of the executor launcher.
</td>
</tr>
<tr>
<td><code>spark.yarn.containerLauncherMaxThreads</code></td>
- <td>25</td>
+ <td><code>25</code></td>
<td>
- The maximum number of threads to use in the application master for launching executor containers.
+ The maximum number of threads to use in the YARN Application Master for launching executor containers.
</td>
</tr>
<tr>
@@ -283,19 +285,19 @@ If you need a reference to the proper location to put log files in the YARN so t
<td>(none)</td>
<td>
A string of extra JVM options to pass to the YARN Application Master in client mode.
- In cluster mode, use `spark.driver.extraJavaOptions` instead.
+ In cluster mode, use <code>spark.driver.extraJavaOptions</code> instead.
</td>
</tr>
<tr>
<td><code>spark.yarn.am.extraLibraryPath</code></td>
<td>(none)</td>
<td>
- Set a special library path to use when launching the application master in client mode.
+ Set a special library path to use when launching the YARN Application Master in client mode.
</td>
</tr>
<tr>
<td><code>spark.yarn.maxAppAttempts</code></td>
- <td>yarn.resourcemanager.am.max-attempts in YARN</td>
+ <td><code>yarn.resourcemanager.am.max-attempts</code> in YARN</td>
<td>
The maximum number of attempts that will be made to submit the application.
It should be no larger than the global number of max attempts in the YARN configuration.
@@ -303,10 +305,10 @@ If you need a reference to the proper location to put log files in the YARN so t
</tr>
<tr>
<td><code>spark.yarn.submit.waitAppCompletion</code></td>
- <td>true</td>
+ <td><code>true</code></td>
<td>
In YARN cluster mode, controls whether the client waits to exit until the application completes.
- If set to true, the client process will stay alive reporting the application's status.
+ If set to <code>true</code>, the client process will stay alive reporting the application's status.
Otherwise, the client process will exit after submission.
</td>
</tr>
@@ -332,7 +334,7 @@ If you need a reference to the proper location to put log files in the YARN so t
<td>(none)</td>
<td>
The full path to the file that contains the keytab for the principal specified above.
- This keytab will be copied to the node running the Application Master via the Secure Distributed Cache,
+ This keytab will be copied to the node running the YARN Application Master via the Secure Distributed Cache,
for renewing the login tickets and the delegation tokens periodically.
</td>
</tr>
@@ -371,14 +373,14 @@ If you need a reference to the proper location to put log files in the YARN so t
</tr>
<tr>
<td><code>spark.yarn.security.tokens.${service}.enabled</code></td>
- <td>true</td>
+ <td><code>true</code></td>
<td>
Controls whether to retrieve delegation tokens for non-HDFS services when security is enabled.
By default, delegation tokens for all supported services are retrieved when those services are
configured, but it's possible to disable that behavior if it somehow conflicts with the
application being run.
<p/>
- Currently supported services are: hive, hbase
+ Currently supported services are: <code>hive</code>, <code>hbase</code>
</td>
</tr>
</table>
@@ -387,5 +389,5 @@ If you need a reference to the proper location to put log files in the YARN so t
- Whether core requests are honored in scheduling decisions depends on which scheduler is in use and how it is configured.
- In `yarn-cluster` mode, the local directories used by the Spark executors and the Spark driver will be the local directories configured for YARN (Hadoop YARN config `yarn.nodemanager.local-dirs`). If the user specifies `spark.local.dir`, it will be ignored. In `yarn-client` mode, the Spark executors will use the local directories configured for YARN while the Spark driver will use those defined in `spark.local.dir`. This is because the Spark driver does not run on the YARN cluster in `yarn-client` mode, only the Spark executors do.
-- The `--files` and `--archives` options support specifying file names with the # similar to Hadoop. For example you can specify: `--files localtest.txt#appSees.txt` and this will upload the file you have locally named localtest.txt into HDFS but this will be linked to by the name `appSees.txt`, and your application should use the name as `appSees.txt` to reference it when running on YARN.
+- The `--files` and `--archives` options support specifying file names with the # similar to Hadoop. For example you can specify: `--files localtest.txt#appSees.txt` and this will upload the file you have locally named `localtest.txt` into HDFS but this will be linked to by the name `appSees.txt`, and your application should use the name as `appSees.txt` to reference it when running on YARN.
- The `--jars` option allows the `SparkContext.addJar` function to work if you are using it with local files and running in `yarn-cluster` mode. It does not need to be used if you are using it with HDFS, HTTP, HTTPS, or FTP files.
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 7ae9244c27..a1cbc7de97 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -1676,7 +1676,7 @@ results <- collect(sql(sqlContext, "FROM src SELECT key, value"))
### Interacting with Different Versions of Hive Metastore
One of the most important pieces of Spark SQL's Hive support is interaction with Hive metastore,
-which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
+which enables Spark SQL to access metadata of Hive tables. Starting from Spark 1.4.0, a single binary
build of Spark SQL can be used to query different versions of Hive metastores, using the configuration described below.
Note that independent of the version of Hive that is being used to talk to the metastore, internally Spark SQL
will compile against Hive 1.2.1 and use those classes for internal execution (serdes, UDFs, UDAFs, etc).
@@ -1706,8 +1706,8 @@ The following options can be used to configure the version of Hive that is used
either <code>1.2.1</code> or not defined.
<li><code>maven</code></li>
Use Hive jars of specified version downloaded from Maven repositories. This configuration
- is not generally recommended for production deployments.
- <li>A classpath in the standard format for the JVM. This classpath must include all of Hive
+ is not generally recommended for production deployments.
+ <li>A classpath in the standard format for the JVM. This classpath must include all of Hive
and its dependencies, including the correct version of Hadoop. These jars only need to be
present on the driver, but if you are running in yarn cluster mode then you must ensure
they are packaged with you application.</li>
@@ -1806,7 +1806,7 @@ the Data Sources API. The following options are supported:
<div data-lang="scala" markdown="1">
{% highlight scala %}
-val jdbcDF = sqlContext.read.format("jdbc").options(
+val jdbcDF = sqlContext.read.format("jdbc").options(
Map("url" -> "jdbc:postgresql:dbserver",
"dbtable" -> "schema.tablename")).load()
{% endhighlight %}
@@ -2023,11 +2023,11 @@ options.
- Optimized execution using manually managed memory (Tungsten) is now enabled by default, along with
code generation for expression evaluation. These features can both be disabled by setting
- `spark.sql.tungsten.enabled` to `false.
- - Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
+ `spark.sql.tungsten.enabled` to `false`.
+ - Parquet schema merging is no longer enabled by default. It can be re-enabled by setting
`spark.sql.parquet.mergeSchema` to `true`.
- - Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
- access nested values. For example `df['table.column.nestedField']`. However, this means that if
+ - Resolution of strings to columns in python now supports using dots (`.`) to qualify the column or
+ access nested values. For example `df['table.column.nestedField']`. However, this means that if
your column name contains any dots you must now escape them using backticks (e.g., ``table.`column.with.dots`.nested``).
- In-memory columnar storage partition pruning is on by default. It can be disabled by setting
`spark.sql.inMemoryColumnarStorage.partitionPruning` to `false`.
diff --git a/docs/submitting-applications.md b/docs/submitting-applications.md
index 7ea4d6f1a3..915be0f479 100644
--- a/docs/submitting-applications.md
+++ b/docs/submitting-applications.md
@@ -103,7 +103,7 @@ run it with `--help`. Here are a few examples of common options:
export HADOOP_CONF_DIR=XXX
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
- --master yarn-cluster \ # can also be `yarn-client` for client mode
+ --master yarn-cluster \ # can also be yarn-client for client mode
--executor-memory 20G \
--num-executors 50 \
/path/to/examples.jar \
@@ -174,9 +174,9 @@ This can use up a significant amount of space over time and will need to be clea
is handled automatically, and with Spark standalone, automatic cleanup can be configured with the
`spark.worker.cleanup.appDataTtl` property.
-Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates
-with `--packages`. All transitive dependencies will be handled when using this command. Additional
-repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag `--repositories`.
+Users may also include any other dependencies by supplying a comma-delimited list of maven coordinates
+with `--packages`. All transitive dependencies will be handled when using this command. Additional
+repositories (or resolvers in SBT) can be added in a comma-delimited fashion with the flag `--repositories`.
These commands can be used with `pyspark`, `spark-shell`, and `spark-submit` to include Spark Packages.
For Python, the equivalent `--py-files` option can be used to distribute `.egg`, `.zip` and `.py` libraries