aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/_config.yml8
-rwxr-xr-xdocs/_layouts/global.html32
-rw-r--r--docs/api.md6
-rw-r--r--docs/bagel-programming-guide.md2
-rw-r--r--docs/configuration.md16
-rw-r--r--docs/contributing-to-spark.md2
-rw-r--r--docs/ec2-scripts.md4
-rw-r--r--docs/index.md26
-rw-r--r--docs/java-programming-guide.md26
-rw-r--r--docs/quick-start.md14
-rw-r--r--docs/running-on-mesos.md4
-rw-r--r--docs/scala-programming-guide.md24
-rw-r--r--docs/spark-standalone.md4
-rw-r--r--docs/tuning.md16
14 files changed, 95 insertions, 89 deletions
diff --git a/docs/_config.yml b/docs/_config.yml
index 8412645ab2..88567403a0 100644
--- a/docs/_config.yml
+++ b/docs/_config.yml
@@ -1,2 +1,8 @@
pygments: true
-markdown: kramdown \ No newline at end of file
+markdown: kramdown
+
+# These allow the documentation to be updated with nerw releases
+# of Spark, Scala, and Mesos.
+SPARK_VERSION: 0.6.0
+SCALA_VERSION: 2.9.2
+MESOS_VERSION: 0.9.0
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index dbe9d085a2..d185790bb5 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -6,10 +6,10 @@
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
- <title>{{ page.title }} - Spark 0.6.0 Documentation</title>
+ <title>{{ page.title }} - Spark {{site.SPARK_VERSION}} Documentation</title>
<meta name="description" content="">
- <link rel="stylesheet" href="{{HOME_PATH}}css/bootstrap.min.css">
+ <link rel="stylesheet" href="css/bootstrap.min.css">
<style>
body {
padding-top: 60px;
@@ -17,12 +17,12 @@
}
</style>
<meta name="viewport" content="width=device-width">
- <link rel="stylesheet" href="{{HOME_PATH}}css/bootstrap-responsive.min.css">
- <link rel="stylesheet" href="{{HOME_PATH}}css/main.css">
+ <link rel="stylesheet" href="css/bootstrap-responsive.min.css">
+ <link rel="stylesheet" href="css/main.css">
- <script src="{{HOME_PATH}}js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
+ <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
- <link rel="stylesheet" href="{{HOME_PATH}}css/pygments-default.css">
+ <link rel="stylesheet" href="css/pygments-default.css">
</head>
<body>
<!--[if lt IE 7]>
@@ -34,17 +34,17 @@
<div class="navbar navbar-fixed-top" id="topbar">
<div class="navbar-inner">
<div class="container">
- <a class="brand" href="{{HOME_PATH}}index.html"></a>
+ <a class="brand" href="index.html"></a>
<ul class="nav">
<!--TODO(andyk): Add class="active" attribute to li some how.-->
- <li><a href="{{HOME_PATH}}index.html">Overview</a></li>
+ <li><a href="index.html">Overview</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
- <li><a href="{{HOME_PATH}}quick-start.html">Quick Start</a></li>
- <li><a href="{{HOME_PATH}}scala-programming-guide.html">Scala</a></li>
- <li><a href="{{HOME_PATH}}java-programming-guide.html">Java</a></li>
+ <li><a href="quick-start.html">Quick Start</a></li>
+ <li><a href="scala-programming-guide.html">Scala</a></li>
+ <li><a href="java-programming-guide.html">Java</a></li>
</ul>
</li>
@@ -53,15 +53,15 @@
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
<ul class="dropdown-menu">
- <li><a href="{{HOME_PATH}}ec2-scripts.html">Amazon EC2</a></li>
- <li><a href="{{HOME_PATH}}spark-standalone.html">Standalone Mode</a></li>
- <li><a href="{{HOME_PATH}}running-on-mesos.html">Mesos</a></li>
- <li><a href="{{HOME_PATH}}running-on-yarn.html">YARN</a></li>
+ <li><a href="ec2-scripts.html">Amazon EC2</a></li>
+ <li><a href="spark-standalone.html">Standalone Mode</a></li>
+ <li><a href="running-on-mesos.html">Mesos</a></li>
+ <li><a href="running-on-yarn.html">YARN</a></li>
</ul>
</li>
<li class="dropdown">
- <a href="{{HOME_PATH}}api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
+ <a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="configuration.html">Configuration</a></li>
<li><a href="tuning.html">Tuning Guide</a></li>
diff --git a/docs/api.md b/docs/api.md
index 3bdac174f9..43548b223c 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -5,6 +5,6 @@ title: Spark API documentation (Scaladoc)
Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt/sbt doc` from the Spark project home directory.
-- [Core]({{HOME_PATH}}api/core/index.html)
-- [Examples]({{HOME_PATH}}api/examples/index.html)
-- [Bagel]({{HOME_PATH}}api/bagel/index.html)
+- [Core](api/core/index.html)
+- [Examples](api/examples/index.html)
+- [Bagel](api/bagel/index.html)
diff --git a/docs/bagel-programming-guide.md b/docs/bagel-programming-guide.md
index 0c925c176c..cfaa73f45d 100644
--- a/docs/bagel-programming-guide.md
+++ b/docs/bagel-programming-guide.md
@@ -19,7 +19,7 @@ To write a Bagel application, you will need to add Spark, its dependencies, and
## Programming Model
-Bagel operates on a graph represented as a [distributed dataset]({{HOME_PATH}}scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
+Bagel operates on a graph represented as a [distributed dataset](scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.
diff --git a/docs/configuration.md b/docs/configuration.md
index db90b5bc16..4270e50f47 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -23,7 +23,7 @@ the copy executable.
Inside `spark-env.sh`, you can set the following environment variables:
* `SCALA_HOME` to point to your Scala installation.
-* `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster]({{HOME_PATH}}running-on-mesos.html).
+* `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster](running-on-mesos.html).
* `SPARK_MEM` to set the amount of memory used per node (this should be in the same format as the JVM's -Xmx option, e.g. `300m` or `1g`)
* `SPARK_JAVA_OPTS` to add JVM options. This includes any system properties that you'd like to pass with `-D`.
* `SPARK_CLASSPATH` to add elements to Spark's classpath.
@@ -53,9 +53,9 @@ there are at least four properties that you will commonly want to control:
<td>
Class to use for serializing objects that will be sent over the network or need to be cached
in serialized form. The default of Java serialization works with any Serializable Java object but is
- quite slow, so we recommend <a href="{{HOME_PATH}}tuning.html">using <code>spark.KryoSerializer</code>
+ quite slow, so we recommend <a href="tuning.html">using <code>spark.KryoSerializer</code>
and configuring Kryo serialization</a> when speed is necessary. Can be any subclass of
- <a href="{{HOME_PATH}}api/core/index.html#spark.Serializer"><code>spark.Serializer</code></a>).
+ <a href="api/core/index.html#spark.Serializer"><code>spark.Serializer</code></a>).
</td>
</tr>
<tr>
@@ -64,8 +64,8 @@ there are at least four properties that you will commonly want to control:
<td>
If you use Kryo serialization, set this class to register your custom classes with Kryo.
You need to set it to a class that extends
- <a href="{{HOME_PATH}}api/core/index.html#spark.KryoRegistrator"><code>spark.KryoRegistrator</code></a>).
- See the <a href="{{HOME_PATH}}tuning.html#data-serialization">tuning guide</a> for more details.
+ <a href="api/core/index.html#spark.KryoRegistrator"><code>spark.KryoRegistrator</code></a>).
+ See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
</td>
</tr>
<tr>
@@ -81,8 +81,8 @@ there are at least four properties that you will commonly want to control:
<td>spark.cores.max</td>
<td>(infinite)</td>
<td>
- When running on a <a href="{{HOME_PATH}}spark-standalone.html">standalone deploy cluster</a> or a
- <a href="{{HOME_PATH}}running-on-mesos.html#mesos-run-modes">Mesos cluster in "coarse-grained"
+ When running on a <a href="spark-standalone.html">standalone deploy cluster</a> or a
+ <a href="running-on-mesos.html#mesos-run-modes">Mesos cluster in "coarse-grained"
sharing mode</a>, how many CPU cores to request at most. The default will use all available cores.
</td>
</tr>
@@ -98,7 +98,7 @@ Apart from these, the following properties are also available, and may be useful
<td>false</td>
<td>
If set to "true", runs over Mesos clusters in
- <a href="{{HOME_PATH}}running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
+ <a href="running-on-mesos.html#mesos-run-modes">"coarse-grained" sharing mode</a>,
where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task.
This gives lower-latency scheduling for short queries, but leaves resources in use for the whole
duration of the Spark job.
diff --git a/docs/contributing-to-spark.md b/docs/contributing-to-spark.md
index a0e645d6cc..c6e01c62d8 100644
--- a/docs/contributing-to-spark.md
+++ b/docs/contributing-to-spark.md
@@ -12,7 +12,7 @@ The Spark team welcomes contributions in the form of GitHub pull requests. Here
* Always import packages using absolute paths (e.g. `scala.collection.Map` instead of `collection.Map`).
* No "infix" syntax for methods other than operators. For example, don't write `table containsKey myKey`; replace it with `table.containsKey(myKey)`.
- Make sure that your code passes the unit tests. You can run the tests with `sbt/sbt test` in the root directory of Spark.
- But first, make sure that you have [configured a spark-env.sh]({{HOME_PATH}}configuration.html) with at least
+ But first, make sure that you have [configured a spark-env.sh](configuration.html) with at least
`SCALA_HOME`, as some of the tests try to spawn subprocesses using this.
- Add new unit tests for your code. We use [ScalaTest](http://www.scalatest.org/) for testing. Just add a new Suite in `core/src/test`, or methods to an existing Suite.
- If you'd like to report a bug but don't have time to fix it, you can still post it to our [issues page](https://github.com/mesos/spark/issues), or email the [mailing list](http://www.spark-project.org/mailing-lists.html).
diff --git a/docs/ec2-scripts.md b/docs/ec2-scripts.md
index a7b1d6f200..6e1f7fd3b1 100644
--- a/docs/ec2-scripts.md
+++ b/docs/ec2-scripts.md
@@ -106,7 +106,7 @@ This file needs to be copied to **every machine** to reflect the change. The eas
is to use a script we provide called `copy-dir`. First edit your `spark-env.sh` file on the master,
then run `~/mesos-ec2/copy-dir /root/spark/conf` to RSYNC it to all the workers.
-The [configuration guide]({{HOME_PATH}}configuration.html) describes the available configuration options.
+The [configuration guide](configuration.html) describes the available configuration options.
# Terminating a Cluster
@@ -146,7 +146,7 @@ section.
- Support for spot instances is limited.
If you have a patch or suggestion for one of these limitations, feel free to
-[contribute]({{HOME_PATH}}contributing-to-spark.html) it!
+[contribute](contributing-to-spark.html) it!
# Using a Newer Spark Version
diff --git a/docs/index.md b/docs/index.md
index 5a53da7024..b6f08b5377 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -37,7 +37,7 @@ For example, `./run spark.examples.SparkPi` will run a sample program that estim
examples prints usage help if no params are given.
Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
-to connect to. This can be a [URL for a distributed cluster]({{HOME_PATH}}scala-programming-guide.html#master-urls),
+to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` for testing.
@@ -56,27 +56,27 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
**Quick start:**
-* [Spark Quick Start]({{HOME_PATH}}quick-start.html): a quick intro to the Spark API
+* [Spark Quick Start](quick-start.html): a quick intro to the Spark API
**Programming guides:**
-* [Spark Programming Guide]({{HOME_PATH}}scala-programming-guide.html): how to get started using Spark, and details on the Scala API
-* [Java Programming Guide]({{HOME_PATH}}java-programming-guide.html): using Spark from Java
+* [Spark Programming Guide](scala-programming-guide.html): how to get started using Spark, and details on the Scala API
+* [Java Programming Guide](java-programming-guide.html): using Spark from Java
**Deployment guides:**
-* [Running Spark on Amazon EC2]({{HOME_PATH}}ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes
-* [Standalone Deploy Mode]({{HOME_PATH}}spark-standalone.html): launch a standalone cluster quickly without Mesos
-* [Running Spark on Mesos]({{HOME_PATH}}running-on-mesos.html): deploy a private cluster using
+* [Running Spark on Amazon EC2](ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes
+* [Standalone Deploy Mode](spark-standalone.html): launch a standalone cluster quickly without Mesos
+* [Running Spark on Mesos](running-on-mesos.html): deploy a private cluster using
[Apache Mesos](http://incubator.apache.org/mesos)
-* [Running Spark on YARN]({{HOME_PATH}}running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
+* [Running Spark on YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
**Other documents:**
-* [Configuration]({{HOME_PATH}}configuration.html): customize Spark via its configuration system
-* [Tuning Guide]({{HOME_PATH}}tuning.html): best practices to optimize performance and memory use
-* [API Docs (Scaladoc)]({{HOME_PATH}}api/core/index.html)
-* [Bagel]({{HOME_PATH}}bagel-programming-guide.html): an implementation of Google's Pregel on Spark
+* [Configuration](configuration.html): customize Spark via its configuration system
+* [Tuning Guide](tuning.html): best practices to optimize performance and memory use
+* [API Docs (Scaladoc)](api/core/index.html)
+* [Bagel](bagel-programming-guide.html): an implementation of Google's Pregel on Spark
* [Contributing to Spark](contributing-to-spark.html)
**External resources:**
@@ -96,4 +96,4 @@ To get help using Spark or keep up with Spark development, sign up for the [spar
If you're in the San Francisco Bay Area, there's a regular [Spark meetup](http://www.meetup.com/spark-users/) every few weeks. Come by to meet the developers and other users.
-Finally, if you'd like to contribute code to Spark, read [how to contribute]({{HOME_PATH}}contributing-to-spark.html).
+Finally, if you'd like to contribute code to Spark, read [how to contribute](contributing-to-spark.html).
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md
index 4a36934553..24aa2d5c6b 100644
--- a/docs/java-programming-guide.md
+++ b/docs/java-programming-guide.md
@@ -5,14 +5,14 @@ title: Java Programming Guide
The Spark Java API exposes all the Spark features available in the Scala version to Java.
To learn the basics of Spark, we recommend reading through the
-[Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html) first; it should be
+[Scala Programming Guide](scala-programming-guide.html) first; it should be
easy to follow even if you don't know Scala.
This guide will show how to use the Spark features described there in Java.
The Spark Java API is defined in the
-[`spark.api.java`]({{HOME_PATH}}api/core/index.html#spark.api.java.package) package, and includes
-a [`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) for
-initializing Spark and [`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) classes,
+[`spark.api.java`](api/core/index.html#spark.api.java.package) package, and includes
+a [`JavaSparkContext`](api/core/index.html#spark.api.java.JavaSparkContext) for
+initializing Spark and [`JavaRDD`](api/core/index.html#spark.api.java.JavaRDD) classes,
which support the same methods as their Scala counterparts but take Java functions and return
Java data and collection types. The main differences have to do with passing functions to RDD
operations (e.g. map) and handling RDDs of different types, as discussed next.
@@ -23,12 +23,12 @@ There are a few key differences between the Java and Scala APIs:
* Java does not support anonymous or first-class functions, so functions must
be implemented by extending the
- [`spark.api.java.function.Function`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function),
- [`Function2`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function2), etc.
+ [`spark.api.java.function.Function`](api/core/index.html#spark.api.java.function.Function),
+ [`Function2`](api/core/index.html#spark.api.java.function.Function2), etc.
classes.
* To maintain type safety, the Java API defines specialized Function and RDD
classes for key-value pairs and doubles. For example,
- [`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD)
+ [`JavaPairRDD`](api/core/index.html#spark.api.java.JavaPairRDD)
stores key-value pairs.
* RDD methods like `collect()` and `countByKey()` return Java collections types,
such as `java.util.List` and `java.util.Map`.
@@ -44,8 +44,8 @@ In the Scala API, these methods are automatically added using Scala's
[implicit conversions](http://www.scala-lang.org/node/130) mechanism.
In the Java API, the extra methods are defined in the
-[`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD)
-and [`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD)
+[`JavaPairRDD`](api/core/index.html#spark.api.java.JavaPairRDD)
+and [`JavaDoubleRDD`](api/core/index.html#spark.api.java.JavaDoubleRDD)
classes. RDD methods like `map` are overloaded by specialized `PairFunction`
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
types. Common methods like `filter` and `sample` are implemented by
@@ -76,9 +76,9 @@ class has a single abstract method, `call()`, that must be implemented.
# Other Features
The Java API supports other Spark features, including
-[accumulators]({{HOME_PATH}}scala-programming-guide.html#accumulators),
-[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast-variables), and
-[caching]({{HOME_PATH}}scala-programming-guide.html#rdd-persistence).
+[accumulators](scala-programming-guide.html#accumulators),
+[broadcast variables](scala-programming-guide.html#broadcast-variables), and
+[caching](scala-programming-guide.html#rdd-persistence).
# Example
@@ -173,7 +173,7 @@ just a matter of style.
# Javadoc
We currently provide documentation for the Java API as Scaladoc, in the
-[`spark.api.java` package]({{HOME_PATH}}api/core/index.html#spark.api.java.package), because
+[`spark.api.java` package](api/core/index.html#spark.api.java.package), because
some of the classes are implemented in Scala. The main downside is that the types and function
definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of
`T reduce(Function2<T, T> func)`).
diff --git a/docs/quick-start.md b/docs/quick-start.md
index f9356afe9a..2e88fd863e 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -8,7 +8,7 @@ title: Spark Quick Start
# Introduction
-This document provides a quick-and-dirty look at Spark's API. See the [programming guide]({{HOME_PATH}}/scala-programming-guide.html) for a complete reference. To follow along with this guide, you only need to have successfully [built spark]({{HOME_PATH}}) on one machine. Building Spark is as simple as running
+This document provides a quick-and-dirty look at Spark's API. See the [programming guide](scala-programming-guide.html) for a complete reference. To follow along with this guide, you only need to have successfully [built spark]() on one machine. Building Spark is as simple as running
{% highlight bash %}
$ sbt/sbt package
@@ -29,7 +29,7 @@ scala> val textFile = sc.textFile("README.md")
textFile: spark.RDD[String] = spark.MappedRDD@2ee9b6e3
{% endhighlight %}
-RDD's have _[actions]({{HOME_PATH}}/scala-programming-guide.html#actions)_, which return values, and _[transformations]({{HOME_PATH}}/scala-programming-guide.html#transformations)_, which return pointers to new RDD's. Let's start with a few actions:
+RDD's have _[actions](scala-programming-guide.html#actions)_, which return values, and _[transformations](scala-programming-guide.html#transformations)_, which return pointers to new RDD's. Let's start with a few actions:
{% highlight scala %}
scala> textFile.count() // Number of items in this RDD
@@ -39,7 +39,7 @@ scala> textFile.first() // First item in this RDD
res1: String = # Spark
{% endhighlight %}
-Now let's use a transformation. We will use the [filter]({{HOME_PATH}}/scala-programming-guide.html#transformations)() transformation to return a new RDD with a subset of the items in the file.
+Now let's use a transformation. We will use the [filter](scala-programming-guide.html#transformations)() transformation to return a new RDD with a subset of the items in the file.
{% highlight scala %}
scala> val sparkLinesOnly = textFile.filter(line => line.contains("Spark"))
@@ -61,7 +61,7 @@ scala> textFile.map(line => line.split(" ").size).reduce((a, b) => if (a < b) {b
res4: Long = 16
{% endhighlight %}
-This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to [map]({{HOME_PATH}}/scala-programming-guide.html#transformations)() and [reduce]({{HOME_PATH}}/scala-programming-guide.html#actions)() are scala closures. We can easily include functions declared elsewhere, or include existing functions in our anonymous closures. For instance, we can use `Math.max()` to make this code easier to understand.
+This first maps a line to an integer value, creating a new RDD. `reduce` is called on that RDD to find the largest line count. The arguments to [map](scala-programming-guide.html#transformations)() and [reduce](scala-programming-guide.html#actions)() are scala closures. We can easily include functions declared elsewhere, or include existing functions in our anonymous closures. For instance, we can use `Math.max()` to make this code easier to understand.
{% highlight scala %}
scala> import java.lang.Math;
@@ -78,7 +78,7 @@ scala> val wordCountRDD = textFile.flatMap(line => line.split(" ")).map(word =>
wordCountRDD: spark.RDD[(java.lang.String, Int)] = spark.ShuffledAggregatedRDD@71f027b8
{% endhighlight %}
-Here, we combined the [flatMap]({{HOME_PATH}}/scala-programming-guide.html#transformations)(), [map]({{HOME_PATH}}/scala-programming-guide.html#transformations)() and [reduceByKey]({{HOME_PATH}}/scala-programming-guide.html#transformations)() transformations to create per-word counts in the file. To collect the word counts in our shell, we can use the [collect]({{HOME_PATH}}/scala-programming-guide.html#actions)() action:
+Here, we combined the [flatMap](scala-programming-guide.html#transformations)(), [map](scala-programming-guide.html#transformations)() and [reduceByKey](scala-programming-guide.html#transformations)() transformations to create per-word counts in the file. To collect the word counts in our shell, we can use the [collect](scala-programming-guide.html#actions)() action:
{% highlight scala %}
scala> wordCountRDD.collect()
@@ -158,7 +158,7 @@ $ sbt run
Lines with a: 8422, Lines with b: 1836
{% endhighlight %}
-This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode]({{HOME_PATH}}/spark-standalone.html) documentation and consider using a distributed input source, such as HDFS.
+This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation and consider using a distributed input source, such as HDFS.
# A Spark Job In Java
Now say we wanted to write custom job using the Spark API. We will walk through a simple job in both Scala (with sbt) and Java (with maven). If you using other build systems, please reference the Spark assembly jar in the developer guide. The first step is to publish Spark to our local Ivy/Maven repositories. From the Spark directory:
@@ -235,5 +235,5 @@ $ mvn exec:java -Dexec.mainClass="SimpleJob"
Lines with a: 8422, Lines with b: 1836
{% endhighlight %}
-This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode]({{HOME_PATH}}/spark-standalone.html) documentation and consider using a distributed input source, such as HDFS.
+This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation and consider using a distributed input source, such as HDFS.
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
index d9c9c897aa..0fc71bfbd5 100644
--- a/docs/running-on-mesos.md
+++ b/docs/running-on-mesos.md
@@ -5,7 +5,7 @@ title: Running Spark on Mesos
Spark can run on private clusters managed by the [Apache Mesos](http://incubator.apache.org/mesos/) resource manager. Follow the steps below to install Mesos and Spark:
-1. Download and build Spark using the instructions [here]({{HOME_PATH}}index.html).
+1. Download and build Spark using the instructions [here](index.html).
2. Download Mesos 0.9.0 from a [mirror](http://www.apache.org/dyn/closer.cgi/incubator/mesos/mesos-0.9.0-incubating/).
3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.macosx`, that you can run. See the README file in Mesos for other options. **Note:** If you want to run Mesos without installing it into the default paths on your system (e.g. if you don't have administrative privileges to install it), you should also pass the `--prefix` option to `configure` to tell it where to install. For example, pass `--prefix=/home/user/mesos`. By default the prefix is `/usr/local`.
4. Build Mesos using `make`, and then install it using `make install`.
@@ -24,7 +24,7 @@ Spark can run on private clusters managed by the [Apache Mesos](http://incubator
new SparkContext("mesos://HOST:5050", "My Job Name", "/home/user/spark", List("my-job.jar"))
{% endhighlight %}
-If you want to run Spark on Amazon EC2, you can use the Spark [EC2 launch scripts]({{HOME_PATH}}ec2-scripts.html), which provide an easy way to launch a cluster with Mesos, Spark, and HDFS pre-configured. This will get you a cluster in about five minutes without any configuration on your part.
+If you want to run Spark on Amazon EC2, you can use the Spark [EC2 launch scripts](ec2-scripts.html), which provide an easy way to launch a cluster with Mesos, Spark, and HDFS pre-configured. This will get you a cluster in about five minutes without any configuration on your part.
# Mesos Run Modes
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index 983e291543..70d1dc988c 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -35,7 +35,7 @@ This is done through the following constructor:
new SparkContext(master, jobName, [sparkHome], [jars])
{% endhighlight %}
-The `master` parameter is a string specifying a [Mesos]({{HOME_PATH}}running-on-mesos.html) cluster to connect to, or a special "local" string to run in local mode, as described below. `jobName` is a name for your job, which will be shown in the Mesos web UI when running on a cluster. Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described later.
+The `master` parameter is a string specifying a [Mesos](running-on-mesos.html) cluster to connect to, or a special "local" string to run in local mode, as described below. `jobName` is a name for your job, which will be shown in the Mesos web UI when running on a cluster. Finally, the last two parameters are needed to deploy your code to a cluster if running in distributed mode, as described later.
In the Spark interpreter, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `MASTER` environment variable. For example, run `MASTER=local[4] ./spark-shell` to run locally with four cores.
@@ -48,16 +48,16 @@ The master URL passed to Spark can be in one of the following formats:
<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
<tr><td> local[K] </td><td> Run Spark locally with K worker threads (ideally, set this to the number of cores on your machine).
</td></tr>
-<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="{{HOME_PATH}}spark-standalone.html">Spark standalone
+<tr><td> spark://HOST:PORT </td><td> Connect to the given <a href="spark-standalone.html">Spark standalone
cluster</a> master. The port must be whichever one your master is configured to use, which is 7077 by default.
</td></tr>
-<tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="{{HOME_PATH}}running-on-mesos.html">Mesos</a> cluster.
+<tr><td> mesos://HOST:PORT </td><td> Connect to the given <a href="running-on-mesos.html">Mesos</a> cluster.
The host parameter is the hostname of the Mesos master. The port must be whichever one the master is configured to use,
which is 5050 by default.
</td></tr>
</table>
-For running on YARN, Spark launches an instance of the standalone deploy cluster within YARN; see [running on YARN]({{HOME_PATH}}running-on-yarn.html) for details.
+For running on YARN, Spark launches an instance of the standalone deploy cluster within YARN; see [running on YARN](running-on-yarn.html) for details.
### Deploying Code on a Cluster
@@ -116,7 +116,7 @@ All transformations in Spark are <i>lazy</i>, in that they do not compute their
By default, each transformed RDD is recomputed each time you run an action on it. However, you may also *persist* an RDD in memory using the `persist` (or `cache`) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting datasets on disk, or replicated across the cluster. The next section in this document describes these options.
-The following tables list the transformations and actions currently supported (see also the [RDD API doc]({{HOME_PATH}}api/core/index.html#spark.RDD) for details):
+The following tables list the transformations and actions currently supported (see also the [RDD API doc](api/core/index.html#spark.RDD) for details):
### Transformations
@@ -185,7 +185,7 @@ The following tables list the transformations and actions currently supported (s
</tr>
</table>
-A complete list of transformations is available in the [RDD API doc]({{HOME_PATH}}api/core/index.html#spark.RDD).
+A complete list of transformations is available in the [RDD API doc](api/core/index.html#spark.RDD).
### Actions
@@ -233,7 +233,7 @@ A complete list of transformations is available in the [RDD API doc]({{HOME_PATH
</tr>
</table>
-A complete list of actions is available in the [RDD API doc]({{HOME_PATH}}api/core/index.html#spark.RDD).
+A complete list of actions is available in the [RDD API doc](api/core/index.html#spark.RDD).
## RDD Persistence
@@ -241,7 +241,7 @@ One of the most important capabilities in Spark is *persisting* (or *caching*) a
You can mark an RDD to be persisted using the `persist()` or `cache()` methods on it. The first time it is computed in an action, it will be kept in memory on the nodes. The cache is fault-tolerant -- if any partition of an RDD is lost, it will automatically be recomputed using the transformations that originally created it.
-In addition, each RDD can be stored using a different *storage level*, allowing you, for example, to persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space), or even replicate it across nodes. These levels are chosen by passing a [`spark.storage.StorageLevel`]({{HOME_PATH}}api/core/index.html#spark.storage.StorageLevel) object to `persist()`. The `cache()` method is a shorthand for using the default storage level, which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The complete set of available storage levels is:
+In addition, each RDD can be stored using a different *storage level*, allowing you, for example, to persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space), or even replicate it across nodes. These levels are chosen by passing a [`spark.storage.StorageLevel`](api/core/index.html#spark.storage.StorageLevel) object to `persist()`. The `cache()` method is a shorthand for using the default storage level, which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The complete set of available storage levels is:
<table class="table">
<tr><th style="width:23%">Storage Level</th><th>Meaning</th></tr>
@@ -259,7 +259,7 @@ In addition, each RDD can be stored using a different *storage level*, allowing
<td> MEMORY_ONLY_SER </td>
<td> Store RDD as <i>serialized</i> Java objects (one byte array per partition).
This is generally more space-efficient than deserialized objects, especially when using a
- <a href="{{HOME_PATH}}tuning.html">fast serializer</a>, but more CPU-intensive to read.
+ <a href="tuning.html">fast serializer</a>, but more CPU-intensive to read.
</td>
</tr>
<tr>
@@ -284,7 +284,7 @@ We recommend going through the following process to select one:
* If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY`), leave them that way. This is the most
CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
-* If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization library]({{HOME_PATH}}tuning.html) to make the objects
+* If not, try using `MEMORY_ONLY_SER` and [selecting a fast serialization library](tuning.html) to make the objects
much more space-efficient, but still reasonably fast to access.
* Don't spill to disk unless the functions that computed your datasets are expensive, or they filter a large
amount of the data. Otherwise, recomputing a partition is about as fast as reading it from disk.
@@ -339,6 +339,6 @@ res2: Int = 10
You can see some [example Spark programs](http://www.spark-project.org/examples.html) on the Spark website.
In addition, Spark includes several sample programs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.
-For help on optimizing your program, the [configuration]({{HOME_PATH}}configuration.html) and
-[tuning]({{HOME_PATH}}tuning.html) guides provide information on best practices. They are especially important for
+For help on optimizing your program, the [configuration](configuration.html) and
+[tuning](tuning.html) guides provide information on best practices. They are especially important for
making sure that your data is stored in memory in an efficient format.
diff --git a/docs/spark-standalone.md b/docs/spark-standalone.md
index 7bad006a23..ae630a0371 100644
--- a/docs/spark-standalone.md
+++ b/docs/spark-standalone.md
@@ -68,7 +68,7 @@ Finally, the following configuration options can be passed to the master and wor
To launch a Spark standalone cluster with the deploy scripts, you need to set up two files, `conf/spark-env.sh` and `conf/slaves`. The `conf/spark-env.sh` file lets you specify global settings for the master and slave instances, such as memory, or port numbers to bind to, while `conf/slaves` is a list of slave nodes. The system requires that all the slave machines have the same configuration files, so *copy these files to each machine*.
-In `conf/spark-env.sh`, you can set the following parameters, in addition to the [standard Spark configuration settongs]({{HOME_PATH}}configuration.html):
+In `conf/spark-env.sh`, you can set the following parameters, in addition to the [standard Spark configuration settongs](configuration.html):
<table class="table">
<tr><th style="width:21%">Environment Variable</th><th>Meaning</th></tr>
@@ -123,7 +123,7 @@ Note that the scripts must be executed on the machine you want to run the Spark
# Connecting a Job to the Cluster
To run a job on the Spark cluster, simply pass the `spark://IP:PORT` URL of the master as to the [`SparkContext`
-constructor]({{HOME_PATH}}scala-programming-guide.html#initializing-spark).
+constructor](scala-programming-guide.html#initializing-spark).
To run an interactive Spark shell against the cluster, run the following command:
diff --git a/docs/tuning.md b/docs/tuning.md
index 9ce9d4d2ef..58b52b3376 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -7,15 +7,15 @@ Because of the in-memory nature of most Spark computations, Spark programs can b
by any resource in the cluster: CPU, network bandwidth, or memory.
Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
also need to do some tuning, such as
-[storing RDDs in serialized form]({{HOME_PATH}}/scala-programming-guide.html#rdd-persistence), to
+[storing RDDs in serialized form](scala-programming-guide.html#rdd-persistence), to
make the memory usage smaller.
This guide will cover two main topics: data serialization, which is crucial for good network
performance, and memory tuning. We also sketch several smaller topics.
-This document assumes that you have familiarity with the Spark API and have already read the [Scala]({{HOME_PATH}}/scala-programming-guide.html) or [Java]({{HOME_PATH}}/java-programming-guide.html) programming guides. After reading this guide, do not hesitate to reach out to the [Spark mailing list](http://groups.google.com/group/spark-users) with performance related concerns.
+This document assumes that you have familiarity with the Spark API and have already read the [Scala](scala-programming-guide.html) or [Java](java-programming-guide.html) programming guides. After reading this guide, do not hesitate to reach out to the [Spark mailing list](http://groups.google.com/group/spark-users) with performance related concerns.
# The Spark Storage Model
-Spark's key abstraction is a distributed dataset, or RDD. RDD's consist of partitions. RDD partitions are stored either in memory or on disk, with replication or without replication, depending on the chosen [persistence options]({{HOME_PATH}}/scala-programming-guide.html#rdd-persistence). When RDD's are stored in memory, they can be stored as deserialized Java objects, or in a serialized form, again depending on the persistence option chosen. When RDD's are transferred over the network, or spilled to disk, they are always serialized. Spark can use different serializers, configurable with the `spark.serializer` option.
+Spark's key abstraction is a distributed dataset, or RDD. RDD's consist of partitions. RDD partitions are stored either in memory or on disk, with replication or without replication, depending on the chosen [persistence options](scala-programming-guide.html#rdd-persistence). When RDD's are stored in memory, they can be stored as deserialized Java objects, or in a serialized form, again depending on the persistence option chosen. When RDD's are transferred over the network, or spilled to disk, they are always serialized. Spark can use different serializers, configurable with the `spark.serializer` option.
# Serialization Options
@@ -45,7 +45,7 @@ You can switch to using Kryo by calling `System.setProperty("spark.serializer",
registration requirement, but we recommend trying it in any network-intensive application.
Finally, to register your classes with Kryo, create a public class that extends
-[`spark.KryoRegistrator`]({{HOME_PATH}}api/core/index.html#spark.KryoRegistrator) and set the
+[`spark.KryoRegistrator`](api/core/index.html#spark.KryoRegistrator) and set the
`spark.kryo.registrator` system property to point to it, as follows:
{% highlight scala %}
@@ -107,11 +107,11 @@ There are several ways to reduce this cost and still make Java objects efficient
3. If you have less than 32 GB of RAM, set the JVM flag `-XX:+UseCompressedOops` to make pointers be
four bytes instead of eight. Also, on Java 7 or later, try `-XX:+UseCompressedStrings` to store
ASCII strings as just 8 bits per character. You can add these options in
- [`spark-env.sh`]({{HOME_PATH}}configuration.html#environment-variables-in-spark-envsh).
+ [`spark-env.sh`](configuration.html#environment-variables-in-spark-envsh).
When your objects are still too large to efficiently store despite this tuning, a much simpler way
to reduce memory usage is to store them in *serialized* form, using the serialized StorageLevels in
-the [RDD persistence API]({{HOME_PATH}}scala-programming-guide#rdd-persistence).
+the [RDD persistence API](scala-programming-guide#rdd-persistence).
Spark will then store each RDD partition as one large byte array.
The only downside of storing data in serialized form is slower access times, due to having to
deserialize each object on the fly.
@@ -196,7 +196,7 @@ enough. Spark automatically sets the number of "map" tasks to run on each file a
(though you can control it through optional parameters to `SparkContext.textFile`, etc), but for
distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses a default value of 8.
You can pass the level of parallelism as a second argument (see the
-[`spark.PairRDDFunctions`]({{HOME_PATH}}api/core/index.html#spark.PairRDDFunctions) documentation),
+[`spark.PairRDDFunctions`](api/core/index.html#spark.PairRDDFunctions) documentation),
or set the system property `spark.default.parallelism` to change the default.
In general, we recommend 2-3 tasks per CPU core in your cluster.
@@ -213,7 +213,7 @@ number of cores in your clusters.
## Broadcasting Large Variables
-Using the [broadcast functionality]({{HOME_PATH}}scala-programming-guide#broadcast-variables)
+Using the [broadcast functionality](scala-programming-guide#broadcast-variables)
available in `SparkContext` can greatly reduce the size of each serialized task, and the cost
of launching a job over a cluster. If your tasks use any large object from the driver program
inside of them (e.g. a static lookup table), consider turning it into a broadcast variable.