aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMatei Zaharia <matei@eecs.berkeley.edu>2012-09-12 19:37:51 -0700
committerMatei Zaharia <matei@eecs.berkeley.edu>2012-09-12 19:37:51 -0700
commitc92e6169cf83d0fb87220999db993869912e6438 (patch)
tree8813df080f1c04e276c0134c97f24c55d4d43cb7
parent4b8b0d5c08f82ce2ce28c6495a9875a549cf8b3a (diff)
parent1bcd09e09313fb7306a6c11e7e21d8ee73eb8edf (diff)
downloadspark-c92e6169cf83d0fb87220999db993869912e6438.tar.gz
spark-c92e6169cf83d0fb87220999db993869912e6438.tar.bz2
spark-c92e6169cf83d0fb87220999db993869912e6438.zip
Merge pull request #198 from andyk/doc
Fix links and make things a bit prettier.
-rw-r--r--docs/EC2-Scripts.md146
-rw-r--r--docs/README.md15
-rw-r--r--docs/_config.yml1
-rwxr-xr-xdocs/_layouts/global.html17
-rw-r--r--docs/bagel-programming-guide.md2
-rw-r--r--docs/configuration.md4
-rw-r--r--docs/contributing-to-spark.md23
-rwxr-xr-xdocs/css/main.css25
-rw-r--r--docs/css/pygments-default.css76
-rw-r--r--docs/ec2-scripts.md14
-rw-r--r--docs/index.md16
-rw-r--r--docs/programming-guide.md67
-rw-r--r--docs/running-on-amazon-ec2.md2
-rw-r--r--docs/running-on-mesos.md14
14 files changed, 199 insertions, 223 deletions
diff --git a/docs/EC2-Scripts.md b/docs/EC2-Scripts.md
deleted file mode 100644
index 35d28c47d0..0000000000
--- a/docs/EC2-Scripts.md
+++ /dev/null
@@ -1,146 +0,0 @@
----
-layout: global
-title: Using the Spark EC2 Scripts
----
-The `spark-ec2` script located in the Spark's `ec2` directory allows you
-to launch, manage and shut down Spark clusters on Amazon EC2. It builds
-on the [Mesos EC2 script](https://github.com/mesos/mesos/wiki/EC2-Scripts)
-in Apache Mesos.
-
-`spark-ec2` is designed to manage multiple named clusters. You can
-launch a new cluster (telling the script its size and giving it a name),
-shutdown an existing cluster, or log into a cluster. Each cluster is
-identified by placing its machines into EC2 security groups whose names
-are derived from the name of the cluster. For example, a cluster named
-`test` will contain a master node in a security group called
-`test-master`, and a number of slave nodes in a security group called
-`test-slaves`. The `spark-ec2` script will create these security groups
-for you based on the cluster name you request. You can also use them to
-identify machines belonging to each cluster in the EC2 Console or
-ElasticFox.
-
-This guide describes how to get set up to run clusters, how to launch
-clusters, how to run jobs on them, and how to shut them down.
-
-Before You Start
-================
-
-- Create an Amazon EC2 key pair for yourself. This can be done by
- logging into your Amazon Web Services account through the [AWS
- console](http://aws.amazon.com/console/), clicking Key Pairs on the
- left sidebar, and creating and downloading a key. Make sure that you
- set the permissions for the private key file to `600` (i.e. only you
- can read and write it) so that `ssh` will work.
-- Whenever you want to use the `spark-ec2` script, set the environment
- variables `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY` to your
- Amazon EC2 access key ID and secret access key. These can be
- obtained from the [AWS homepage](http://aws.amazon.com/) by clicking
- Account \> Security Credentials \> Access Credentials.
-
-Launching a Cluster
-===================
-
-- Go into the `ec2` directory in the release of Spark you downloaded.
-- Run
- `./spark-ec2 -k <keypair> -i <key-file> -s <num-slaves> launch <cluster-name>`,
- where `<keypair>` is the name of your EC2 key pair (that you gave it
- when you created it), `<key-file>` is the private key file for your
- key pair, `<num-slaves>` is the number of slave nodes to launch (try
- 1 at first), and `<cluster-name>` is the name to give to your
- cluster.
-- After everything launches, check that Mesos is up and sees all the
- slaves by going to the Mesos Web UI link printed at the end of the
- script (`http://<master-hostname>:8080`).
-
-You can also run `./spark-ec2 --help` to see more usage options. The
-following options are worth pointing out:
-
-- `--instance-type=<INSTANCE_TYPE>` can be used to specify an EC2
-instance type to use. For now, the script only supports 64-bit instance
-types, and the default type is `m1.large` (which has 2 cores and 7.5 GB
-RAM). Refer to the Amazon pages about [EC2 instance
-types](http://aws.amazon.com/ec2/instance-types) and [EC2
-pricing](http://aws.amazon.com/ec2/#pricing) for information about other
-instance types.
-- `--zone=<EC2_ZONE>` can be used to specify an EC2 availability zone
-to launch instances in. Sometimes, you will get an error because there
-is not enough capacity in one zone, and you should try to launch in
-another. This happens mostly with the `m1.large` instance types;
-extra-large (both `m1.xlarge` and `c1.xlarge`) instances tend to be more
-available.
-- `--ebs-vol-size=GB` will attach an EBS volume with a given amount
- of space to each node so that you can have a persistent HDFS cluster
- on your nodes across cluster restarts (see below).
-- If one of your launches fails due to e.g. not having the right
-permissions on your private key file, you can run `launch` with the
-`--resume` option to restart the setup process on an existing cluster.
-
-Running Jobs
-============
-
-- Go into the `ec2` directory in the release of Spark you downloaded.
-- Run `./spark-ec2 -k <keypair> -i <key-file> login <cluster-name>` to
- SSH into the cluster, where `<keypair>` and `<key-file>` are as
- above. (This is just for convenience; you could also use
- the EC2 console.)
-- To deploy code or data within your cluster, you can log in and use the
- provided script `~/mesos-ec2/copy-dir`, which,
- given a directory path, RSYNCs it to the same location on all the slaves.
-- If your job needs to access large datasets, the fastest way to do
- that is to load them from Amazon S3 or an Amazon EBS device into an
- instance of the Hadoop Distributed File System (HDFS) on your nodes.
- The `spark-ec2` script already sets up a HDFS instance for you. It's
- installed in `/root/ephemeral-hdfs`, and can be accessed using the
- `bin/hadoop` script in that directory. Note that the data in this
- HDFS goes away when you stop and restart a machine.
-- There is also a *persistent HDFS* instance in
- `/root/presistent-hdfs` that will keep data across cluster restarts.
- Typically each node has relatively little space of persistent data
- (about 3 GB), but you can use the `--ebs-vol-size` option to
- `spark-ec2` to attach a persistent EBS volume to each node for
- storing the persistent HDFS.
-- Finally, if you get errors while running your jobs, look at the slave's logs
- for that job using the Mesos web UI (`http://<master-hostname>:8080`).
-
-Terminating a Cluster
-=====================
-
-***Note that there is no way to recover data on EC2 nodes after shutting
-them down! Make sure you have copied everything important off the nodes
-before stopping them.***
-
-- Go into the `ec2` directory in the release of Spark you downloaded.
-- Run `./spark-ec2 destroy <cluster-name>`.
-
-Pausing and Restarting Clusters
-===============================
-
-The `spark-ec2` script also supports pausing a cluster. In this case,
-the VMs are stopped but not terminated, so they
-***lose all data on ephemeral disks*** but keep the data in their
-root partitions and their `persistent-hdfs`. Stopped machines will not
-cost you any EC2 cycles, but ***will*** continue to cost money for EBS
-storage.
-
-- To stop one of your clusters, go into the `ec2` directory and run
-`./spark-ec2 stop <cluster-name>`.
-- To restart it later, run
-`./spark-ec2 -i <key-file> start <cluster-name>`.
-- To ultimately destroy the cluster and stop consuming EBS space, run
-`./spark-ec2 destroy <cluster-name>` as described in the previous
-section.
-
-Limitations
-===========
-
-- `spark-ec2` currently only launches machines in the US-East region of EC2.
- It should not be hard to make it launch VMs in other zones, but you will need
- to create your own AMIs in them.
-- Support for "cluster compute" nodes is limited -- there's no way to specify a
- locality group. However, you can launch slave nodes in your `<clusterName>-slaves`
- group manually and then use `spark-ec2 launch --resume` to start a cluster with
- them.
-- Support for spot instances is limited.
-
-If you have a patch or suggestion for one of these limitations, feel free to
-[[contribute|Contributing to Spark]] it!
diff --git a/docs/README.md b/docs/README.md
index e2ae05722f..9f179a437a 100644
--- a/docs/README.md
+++ b/docs/README.md
@@ -4,10 +4,25 @@ This readme will walk you through navigating and building the Spark documentatio
Read on to learn more about viewing documentation in plain text (i.e., markdown) or building the documentation yourself that corresponds to whichever version of Spark you currently have checked out of revision control.
+## Generating the Documentation HTML
+
We include the Spark documentation as part of the source (as opposed to using a hosted wiki as the definitive documentation) to enable the documentation to evolve along with the source code and be captured by revision control (currently git). This way the code automatically includes the version of the documentation that is relevant regardless of which version or release you have checked out or downloaded.
In this directory you will find textfiles formatted using Markdown, with an ".md" suffix. You can read those text files directly if you want. Start with index.md.
To make things quite a bit prettier and make the links easier to follow, generate the html version of the documentation based on the src directory by running `jekyll` in the docs directory (You will need to have Jekyll installed, the easiest way to do this is via a Ruby Gem). This will create a directory called _site which will contain index.html as well as the rest of the compiled files. Read more about Jekyll at https://github.com/mojombo/jekyll/wiki.
+## Pygments
+
+We also use pygments (http://pygments.org) for syntax highlighting, so you will also need to install that (it requires Python) by running `sudo easy_install Pygments`.
+
+To mark a block of code in your markdown to be syntax highlighted by jekyll during the compile phase, use the following sytax:
+
+ {% highlight scala %}
+ // Your scala code goes here, you can replace scala with many other
+ // supported languages too.
+ {% endhighlight %}
+
+## Scaladoc
+
You can build just the Spark scaladoc by running `sbt/sbt doc` from the SPARK_PROJECT_ROOT directory.
diff --git a/docs/_config.yml b/docs/_config.yml
new file mode 100644
index 0000000000..b136b57555
--- /dev/null
+++ b/docs/_config.yml
@@ -0,0 +1 @@
+pygments: true
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index a2f1927e6b..402adca72c 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -10,17 +10,18 @@
<meta name="description" content="">
<meta name="viewport" content="width=device-width">
- <link rel="stylesheet" href="css/bootstrap.min.css">
+ <link rel="stylesheet" href="{{HOME_PATH}}css/bootstrap.min.css">
<style>
body {
padding-top: 60px;
padding-bottom: 40px;
}
</style>
- <link rel="stylesheet" href="css/bootstrap-responsive.min.css">
- <link rel="stylesheet" href="css/main.css">
+ <link rel="stylesheet" href="{{HOME_PATH}}css/bootstrap-responsive.min.css">
+ <link rel="stylesheet" href="{{HOME_PATH}}css/main.css">
- <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
+ <script src="{{HOME_PATH}}js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
+ <link rel="stylesheet" href="{{HOME_PATH}}css/pygments-default.css">
</head>
<body>
<!--[if lt IE 7]>
@@ -37,13 +38,13 @@
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</a>
- <a class="brand" href="#">Spark</a>
+ <a class="brand" href="{{HOME_PATH}}index.html">Spark</a>
<div class="nav-collapse collapse">
<ul class="nav">
<!--TODO(andyk): Add class="active" attribute to li some how.-->
- <li><a href="/">Home</a></li>
- <li><a href="/programming-guide.html">Programming Guide</a></li>
- <li><a href="/api">API (Scaladoc)</a></li>
+ <li><a href="{{HOME_PATH}}index.html">Home</a></li>
+ <li><a href="{{HOME_PATH}}programming-guide.html">Programming Guide</a></li>
+ <li><a href="{{HOME_PATH}}api">API (Scaladoc)</a></li>
<!--
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Versions ({{ page.spark-version }})<b class="caret"></b></a>
diff --git a/docs/bagel-programming-guide.md b/docs/bagel-programming-guide.md
index d4d08f8cb1..23f69a3ded 100644
--- a/docs/bagel-programming-guide.md
+++ b/docs/bagel-programming-guide.md
@@ -20,7 +20,7 @@ To write a Bagel application, you will need to add Spark, its dependencies, and
## Programming Model
-Bagel operates on a graph represented as a [[distributed dataset|Spark Programming Guide]] of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
+Bagel operates on a graph represented as a [distributed dataset]({{HOME_PATH}}programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.
diff --git a/docs/configuration.md b/docs/configuration.md
index 07190b2931..ab854de386 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -9,7 +9,7 @@ Spark is configured primarily through the `conf/spark-env.sh` script. This scrip
Inside this script, you can set several environment variables:
* `SCALA_HOME` to point to your Scala installation.
-* `MESOS_NATIVE_LIBRARY` if you are [[running on a Mesos cluster|Running Spark on Mesos]].
+* `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster]({{HOME_PATH}}running-on-mesos.html).
* `SPARK_MEM` to set the amount of memory used per node (this should be in the same format as the JVM's -Xmx option, e.g. `300m` or `1g`)
* `SPARK_JAVA_OPTS` to add JVM options. This includes system properties that you'd like to pass with `-D`.
* `SPARK_CLASSPATH` to add elements to Spark's classpath.
@@ -21,4 +21,4 @@ The most important thing to set first will probably be the memory (`SPARK_MEM`).
## Logging Configuration
-Spark uses [[log4j|http://logging.apache.org/log4j/]] for logging. You can configure it by adding a `log4j.properties` file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there.
+Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties` file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there.
diff --git a/docs/contributing-to-spark.md b/docs/contributing-to-spark.md
index fc7544887b..3585bda2d3 100644
--- a/docs/contributing-to-spark.md
+++ b/docs/contributing-to-spark.md
@@ -4,23 +4,14 @@ title: How to Contribute to Spark
---
# Contributing to Spark
-The Spark team welcomes contributions in the form of GitHub pull requests.
-Here are a few tips to get your contribution in:
+The Spark team welcomes contributions in the form of GitHub pull requests. Here are a few tips to get your contribution in:
-- Break your work into small, single-purpose patches if possible. It's much harder to merge
- in a large change with a lot of disjoint features.
-- Submit the patch as a GitHub pull request. For a tutorial, see
- the GitHub guides on [[forking a repo|https://help.github.com/articles/fork-a-repo]]
- and [[sending a pull request|https://help.github.com/articles/using-pull-requests]].
-- Follow the style of the existing codebase. Specifically, we use [[standard Scala
- style guide|http://docs.scala-lang.org/style/]], but with the following changes:
+- Break your work into small, single-purpose patches if possible. It's much harder to merge in a large change with a lot of disjoint features.
+- Submit the patch as a GitHub pull request. For a tutorial, see the GitHub guides on [forking a repo](https://help.github.com/articles/fork-a-repo) and [sending a pull request](https://help.github.com/articles/using-pull-requests).
+- Follow the style of the existing codebase. Specifically, we use [standard Scala style guide](http://docs.scala-lang.org/style/), but with the following changes:
* Maximum line length of 100 characters.
* Always import packages using absolute paths (e.g. `scala.collection.Map` instead of `collection.Map`).
- * No "infix" syntax for methods other than operators. For example, don't write
- `table containsKey myKey`; replace it with `table.containsKey(myKey)`.
-- Add unit tests to your new code. We use [[ScalaTest|http://www.scalatest.org/]] for
- testing. Just add a new Suite in `core/src/test`, or methods to an existing Suite.
+ * No "infix" syntax for methods other than operators. For example, don't write `table containsKey myKey`; replace it with `table.containsKey(myKey)`.
+- Add unit tests to your new code. We use [ScalaTest](http://www.scalatest.org/) for testing. Just add a new Suite in `core/src/test`, or methods to an existing Suite.
-If you'd like to report a bug but don't have time to fix it, you can still post it to
-our [[issues page|https://github.com/mesos/spark/issues]]. Also, feel free to email
-the [[mailing list|http://www.spark-project.org/mailing-lists.html]].
+If you'd like to report a bug but don't have time to fix it, you can still post it to our [issues page](https://github.com/mesos/spark/issues). Also, feel free to email the [mailing list](http://www.spark-project.org/mailing-lists.html).
diff --git a/docs/css/main.css b/docs/css/main.css
index b351c82415..8432d0f911 100755
--- a/docs/css/main.css
+++ b/docs/css/main.css
@@ -1,3 +1,28 @@
+---
+---
/* ==========================================================================
Author's custom styles
========================================================================== */
+
+/*.brand {
+ background: url({{HOME_PATH}}img/spark-logo.jpg) no-repeat left center;
+ height: 40px;
+ width: 100px;
+}
+*/
+
+body {
+ line-height: 1.6; /* Inspired by Github's wiki style */
+}
+
+h1 {
+ font-size: 28px;
+}
+
+code {
+ color: #333;
+}
+
+.container {
+ max-width: 914px;
+}
diff --git a/docs/css/pygments-default.css b/docs/css/pygments-default.css
new file mode 100644
index 0000000000..f5815c25ca
--- /dev/null
+++ b/docs/css/pygments-default.css
@@ -0,0 +1,76 @@
+/*
+Documentation for pygments (and Jekyll for that matter) is super sparse.
+To generate this, I had to run
+ `pygmentize -S default -f html > pygments-default.css`
+But first I had to install pygments via easy_install pygments
+
+I had to override the conflicting bootstrap style rules by linking to
+this stylesheet lower in the html than the bootstap css.
+
+Also, I was thrown off for a while at first when I was using markdown
+code block inside my {% highlight scala %} ... {% endhighlight %} tags
+(I was using 4 spaces for this), when it turns out that pygments will
+insert the code (or pre?) tags for you.
+
+*/
+.hll { background-color: #ffffcc }
+.c { color: #408080; font-style: italic } /* Comment */
+.err { border: 1px solid #FF0000 } /* Error */
+.k { color: #008000; font-weight: bold } /* Keyword */
+.o { color: #666666 } /* Operator */
+.cm { color: #408080; font-style: italic } /* Comment.Multiline */
+.cp { color: #BC7A00 } /* Comment.Preproc */
+.c1 { color: #408080; font-style: italic } /* Comment.Single */
+.cs { color: #408080; font-style: italic } /* Comment.Special */
+.gd { color: #A00000 } /* Generic.Deleted */
+.ge { font-style: italic } /* Generic.Emph */
+.gr { color: #FF0000 } /* Generic.Error */
+.gh { color: #000080; font-weight: bold } /* Generic.Heading */
+.gi { color: #00A000 } /* Generic.Inserted */
+.go { color: #808080 } /* Generic.Output */
+.gp { color: #000080; font-weight: bold } /* Generic.Prompt */
+.gs { font-weight: bold } /* Generic.Strong */
+.gu { color: #800080; font-weight: bold } /* Generic.Subheading */
+.gt { color: #0040D0 } /* Generic.Traceback */
+.kc { color: #008000; font-weight: bold } /* Keyword.Constant */
+.kd { color: #008000; font-weight: bold } /* Keyword.Declaration */
+.kn { color: #008000; font-weight: bold } /* Keyword.Namespace */
+.kp { color: #008000 } /* Keyword.Pseudo */
+.kr { color: #008000; font-weight: bold } /* Keyword.Reserved */
+.kt { color: #B00040 } /* Keyword.Type */
+.m { color: #666666 } /* Literal.Number */
+.s { color: #BA2121 } /* Literal.String */
+.na { color: #7D9029 } /* Name.Attribute */
+.nb { color: #008000 } /* Name.Builtin */
+.nc { color: #0000FF; font-weight: bold } /* Name.Class */
+.no { color: #880000 } /* Name.Constant */
+.nd { color: #AA22FF } /* Name.Decorator */
+.ni { color: #999999; font-weight: bold } /* Name.Entity */
+.ne { color: #D2413A; font-weight: bold } /* Name.Exception */
+.nf { color: #0000FF } /* Name.Function */
+.nl { color: #A0A000 } /* Name.Label */
+.nn { color: #0000FF; font-weight: bold } /* Name.Namespace */
+.nt { color: #008000; font-weight: bold } /* Name.Tag */
+.nv { color: #19177C } /* Name.Variable */
+.ow { color: #AA22FF; font-weight: bold } /* Operator.Word */
+.w { color: #bbbbbb } /* Text.Whitespace */
+.mf { color: #666666 } /* Literal.Number.Float */
+.mh { color: #666666 } /* Literal.Number.Hex */
+.mi { color: #666666 } /* Literal.Number.Integer */
+.mo { color: #666666 } /* Literal.Number.Oct */
+.sb { color: #BA2121 } /* Literal.String.Backtick */
+.sc { color: #BA2121 } /* Literal.String.Char */
+.sd { color: #BA2121; font-style: italic } /* Literal.String.Doc */
+.s2 { color: #BA2121 } /* Literal.String.Double */
+.se { color: #BB6622; font-weight: bold } /* Literal.String.Escape */
+.sh { color: #BA2121 } /* Literal.String.Heredoc */
+.si { color: #BB6688; font-weight: bold } /* Literal.String.Interpol */
+.sx { color: #008000 } /* Literal.String.Other */
+.sr { color: #BB6688 } /* Literal.String.Regex */
+.s1 { color: #BA2121 } /* Literal.String.Single */
+.ss { color: #19177C } /* Literal.String.Symbol */
+.bp { color: #008000 } /* Name.Builtin.Pseudo */
+.vc { color: #19177C } /* Name.Variable.Class */
+.vg { color: #19177C } /* Name.Variable.Global */
+.vi { color: #19177C } /* Name.Variable.Instance */
+.il { color: #666666 } /* Literal.Number.Integer.Long */
diff --git a/docs/ec2-scripts.md b/docs/ec2-scripts.md
index 35d28c47d0..73578c8457 100644
--- a/docs/ec2-scripts.md
+++ b/docs/ec2-scripts.md
@@ -122,11 +122,11 @@ root partitions and their `persistent-hdfs`. Stopped machines will not
cost you any EC2 cycles, but ***will*** continue to cost money for EBS
storage.
-- To stop one of your clusters, go into the `ec2` directory and run
+- To stop one of your clusters, go into the `ec2` directory and run
`./spark-ec2 stop <cluster-name>`.
-- To restart it later, run
+- To restart it later, run
`./spark-ec2 -i <key-file> start <cluster-name>`.
-- To ultimately destroy the cluster and stop consuming EBS space, run
+- To ultimately destroy the cluster and stop consuming EBS space, run
`./spark-ec2 destroy <cluster-name>` as described in the previous
section.
@@ -137,10 +137,10 @@ Limitations
It should not be hard to make it launch VMs in other zones, but you will need
to create your own AMIs in them.
- Support for "cluster compute" nodes is limited -- there's no way to specify a
- locality group. However, you can launch slave nodes in your `<clusterName>-slaves`
- group manually and then use `spark-ec2 launch --resume` to start a cluster with
- them.
+ locality group. However, you can launch slave nodes in your
+ `<clusterName>-slaves` group manually and then use `spark-ec2 launch
+ --resume` to start a cluster with them.
- Support for spot instances is limited.
If you have a patch or suggestion for one of these limitations, feel free to
-[[contribute|Contributing to Spark]] it!
+[contribute]({{HOME_PATH}}contributing-to-spark.html) it!
diff --git a/docs/index.md b/docs/index.md
index 3d1c0a45ba..48ab151e41 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -14,7 +14,7 @@ Get Spark by checking out the master branch of the Git repository, using `git cl
Spark requires [Scala 2.9](http://www.scala-lang.org/).
In addition, to run Spark on a cluster, you will need to install [Mesos](http://incubator.apache.org/mesos/), using the steps in
-[[Running Spark on Mesos]]. However, if you just want to run Spark on a single machine (possibly using multiple cores),
+[Running Spark on Mesos]({{HOME_PATH}}running-on-mesos.html). However, if you just want to run Spark on a single machine (possibly using multiple cores),
you do not need Mesos.
To build and run Spark, you will need to have Scala's `bin` directory in your `PATH`,
@@ -51,12 +51,12 @@ of `project/SparkBuild.scala`, then rebuilding Spark (`sbt/sbt clean compile`).
# Where to Go from Here
-* [Spark Programming Guide](/programming-guide.html): how to get started using Spark, and details on the API
-* [Running Spark on Amazon EC2](/running-on-amazon-ec2.html): scripts that let you launch a cluster on EC2 in about 5 minutes
-* [Running Spark on Mesos](/running-on-mesos.html): instructions on how to deploy to a private cluster
-* [Configuration](/configuration.html)
-* [Bagel Programming Guide](/bagel-programming-guide.html): implementation of Google's Pregel on Spark
-* [Spark Debugger](/spark-debugger.html): experimental work on a debugger for Spark jobs
+* [Spark Programming Guide]({{HOME_PATH}}programming-guide.html): how to get started using Spark, and details on the API
+* [Running Spark on Amazon EC2]({{HOME_PATH}}running-on-amazon-ec2.html): scripts that let you launch a cluster on EC2 in about 5 minutes
+* [Running Spark on Mesos]({{HOME_PATH}}running-on-mesos.html): instructions on how to deploy to a private cluster
+* [Configuration]({{HOME_PATH}}configuration.html)
+* [Bagel Programming Guide]({{HOME_PATH}}bagel-programming-guide.html): implementation of Google's Pregel on Spark
+* [Spark Debugger]({{HOME_PATH}}spark-debugger.html): experimental work on a debugger for Spark jobs
* [Contributing to Spark](contributing-to-spark.html)
# Other Resources
@@ -73,4 +73,4 @@ To keep up with Spark development or get help, sign up for the [spark-users mail
If you're in the San Francisco Bay Area, there's a regular [Spark meetup](http://www.meetup.com/spark-users/) every few weeks. Come by to meet the developers and other users.
-If you'd like to contribute code to Spark, read [how to contribute](Contributing to Spark).
+If you'd like to contribute code to Spark, read [how to contribute]({{HOME_PATH}}contributing-to-spark.html).
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 8106e5bee6..94d304e23a 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -14,17 +14,21 @@ To write a Spark application, you will need to add both Spark and its dependenci
In addition, you'll need to import some Spark classes and implicit conversions. Add the following lines at the top of your program:
- import spark.SparkContext
- import SparkContext._
+{% highlight scala %}
+import spark.SparkContext
+import SparkContext._
+{% endhighlight %}
# Initializing Spark
The first thing a Spark program must do is to create a `SparkContext` object, which tells Spark how to access a cluster.
This is done through the following constructor:
- new SparkContext(master, jobName, [sparkHome], [jars])
+{% highlight scala %}
+new SparkContext(master, jobName, [sparkHome], [jars])
+{% endhighlight %}
-The `master` parameter is a string specifying a [Mesos](Running Spark on Mesos) cluster to connect to, or a special "local" string to run in local mode, as described below. `jobName` is a name for your job, which will be shown in the Mesos web UI when running on a cluster. Finally, the last two parameters are needed to deploy your code to a cluster if running on Mesos, as described later.
+The `master` parameter is a string specifying a [Mesos]({{HOME_PATH}}running-on-mesos.html) cluster to connect to, or a special "local" string to run in local mode, as described below. `jobName` is a name for your job, which will be shown in the Mesos web UI when running on a cluster. Finally, the last two parameters are needed to deploy your code to a cluster if running on Mesos, as described later.
In the Spark interpreter, a special interpreter-aware SparkContext is already created for you, in the variable called `sc`. Making your own SparkContext will not work. You can set which master the context connects to using the `MASTER` environment variable. For example, run `MASTER=local[4] ./spark-shell` to run locally with four cores.
@@ -36,7 +40,7 @@ The master name can be in one of three formats:
<tr><th>Master Name</th><th>Meaning</th></tr>
<tr><td> local </td><td> Run Spark locally with one worker thread (i.e. no parallelism at all). </td></tr>
<tr><td> local[K] </td><td> Run Spark locally with K worker threads (which should be set to the number of cores on your machine). </td></tr>
-<tr><td> HOST:PORT </td><td> Connect Spark to the given <a href="https://github.com/mesos/spark/wiki/Running-spark-on-mesos">Mesos</a> master to run on a cluster. The host parameter is the hostname of the Mesos master. The port must be whichever one the master is configured to use, which is 5050 by default.
+<tr><td> HOST:PORT </td><td> Connect Spark to the given (Mesos)({{HOME_PATH}}running-on-mesos.html) master to run on a cluster. The host parameter is the hostname of the Mesos master. The port must be whichever one the master is configured to use, which is 5050 by default.
<br /><br />
<strong>NOTE:</strong> In earlier versions of Mesos (the <code>old-mesos</code> branch of Spark), you need to use master@HOST:PORT.
</td></tr>
@@ -49,7 +53,7 @@ If you want to run your job on a cluster, you will need to specify the two optio
* `sparkHome`: The path at which Spark is installed on your worker machines (it should be the same on all of them).
* `jars`: A list of JAR files on the local machine containing your job's code and any dependencies, which Spark will deploy to all the worker nodes. You'll need to package your job into a set of JARs using your build system. For example, if you're using SBT, the [sbt-assembly](https://github.com/sbt/sbt-assembly) plugin is a good way to make a single JAR with your code and dependencies.
-If some classes will be shared across _all_ your jobs, it's also possible to copy them to the workers manually and set the `SPARK_CLASSPATH` environment variable in `conf/spark-env.sh` to point to them; see [[Configuration]] for details.
+If some classes will be shared across _all_ your jobs, it's also possible to copy them to the workers manually and set the `SPARK_CLASSPATH` environment variable in `conf/spark-env.sh` to point to them; see [Configuration]({{HOME_PATH}}configuration.html) for details.
# Distributed Datasets
@@ -60,11 +64,13 @@ Spark revolves around the concept of a _resilient distributed dataset_ (RDD), wh
Parallelized collections are created by calling `SparkContext`'s `parallelize` method on an existing Scala collection (a `Seq` object). The elements of the collection are copied to form a distributed dataset that can be operated on in parallel. For example, here is some interpreter output showing how to create a parallel collection from an array:
- scala> val data = Array(1, 2, 3, 4, 5)
- data: Array[Int] = Array(1, 2, 3, 4, 5)
-
- scala> val distData = sc.parallelize(data)
- distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
+{% highlight scala %}
+scala> val data = Array(1, 2, 3, 4, 5)
+data: Array[Int] = Array(1, 2, 3, 4, 5)
+
+scala> val distData = sc.parallelize(data)
+distData: spark.RDD[Int] = spark.ParallelCollection@10d13e3e
+{% endhighlight %}
Once created, the distributed dataset (`distData` here) can be operated on in parallel. For example, we might call `distData.reduce(_ + _)` to add up the elements of the array. We describe operations on distributed datasets later on.
@@ -72,12 +78,14 @@ One important parameter for parallel collections is the number of *slices* to cu
## Hadoop Datasets
-Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, [Amazon S3|http://wiki.apache.org/hadoop/AmazonS3]], Hypertable, HBase, etc). Spark supports text files, [[SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop InputFormat.
+Spark can create distributed datasets from any file stored in the Hadoop distributed file system (HDFS) or other storage systems supported by Hadoop (including your local file system, [Amazon S3](http://wiki.apache.org/hadoop/AmazonS3), Hypertable, HBase, etc). Spark supports text files, [SequenceFiles](http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/mapred/SequenceFileInputFormat.html), and any other Hadoop InputFormat.
Text file RDDs can be created using `SparkContext`'s `textFile` method. This method takes an URI for the file (either a local path on the machine, or a `hdfs://`, `s3n://`, `kfs://`, etc URI). Here is an example invocation:
- scala> val distFile = sc.textFile("data.txt")
- distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
+{% highlight scala %}
+scala> val distFile = sc.textFile("data.txt")
+distFile: spark.RDD[String] = spark.HadoopRDD@1d4cee08
+{% endhighlight %}
Once created, `distFile` can be acted on by dataset operations. For example, we can add up the sizes of all the lines using the `map` and `reduce` operations as follows: `distFile.map(_.size).reduce(_ + _)`.
@@ -142,11 +150,13 @@ Broadcast variables allow the programmer to keep a read-only variable cached on
Broadcast variables are created from a variable `v` by calling `SparkContext.broadcast(v)`. The broadcast variable is a wrapper around `v`, and its value can be accessed by calling the `value` method. The interpreter session below shows this:
- scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
- broadcastVar: spark.Broadcast[Array[Int]] = spark.Broadcast(b5c40191-a864-4c7d-b9bf-d87e1a4e787c)
+{% highlight scala %}
+scala> val broadcastVar = sc.broadcast(Array(1, 2, 3))
+broadcastVar: spark.Broadcast[Array[Int]] = spark.Broadcast(b5c40191-a864-4c7d-b9bf-d87e1a4e787c)
- scala> broadcastVar.value
- res0: Array[Int] = Array(1, 2, 3)
+scala> broadcastVar.value
+res0: Array[Int] = Array(1, 2, 3)
+{% endhighlight %}
After the broadcast variable is created, it should be used instead of the value `v` in any functions run on the cluster so that `v` is not shipped to the nodes more than once. In addition, the object `v` should not be modified after it is broadcast in order to ensure that all nodes get the same value of the broadcast variable (e.g. if the variable is shipped to a new node later).
@@ -157,15 +167,18 @@ Accumulators are variables that are only "added" to through an associative opera
An accumulator is created from an initial value `v` by calling `SparkContext.accumulator(v)`. Tasks running on the cluster can then add to it using the `+=` operator. However, they cannot read its value. Only the driver program can read the accumulator's value, using its `value` method.
The interpreter session below shows an accumulator being used to add up the elements of an array:
- scala> val accum = sc.accumulator(0)
- accum: spark.Accumulator[Int] = 0
-
- scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
- ...
- 10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
-
- scala> accum.value
- res2: Int = 10
+
+{% highlight scala %}
+scala> val accum = sc.accumulator(0)
+accum: spark.Accumulator[Int] = 0
+
+scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
+...
+10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
+
+scala> accum.value
+res2: Int = 10
+{% endhighlight %}
# Where to Go from Here
diff --git a/docs/running-on-amazon-ec2.md b/docs/running-on-amazon-ec2.md
index 26cf9bd767..4e1c191bda 100644
--- a/docs/running-on-amazon-ec2.md
+++ b/docs/running-on-amazon-ec2.md
@@ -6,7 +6,7 @@ This guide describes how to get Spark running on an EC2 cluster. It assumes you
# For Spark 0.5
-Spark now includes some [EC2 Scripts](/ec2-scripts.html) for launching and managing clusters on EC2. You can typically launch a cluster in about five minutes. Follow the instructions at this link for details.
+Spark now includes some [EC2 Scripts]({{HOME_PATH}}ec2-scripts.html) for launching and managing clusters on EC2. You can typically launch a cluster in about five minutes. Follow the instructions at this link for details.
# For older versions of Spark
diff --git a/docs/running-on-mesos.md b/docs/running-on-mesos.md
index b6bfff9da3..9807228121 100644
--- a/docs/running-on-mesos.md
+++ b/docs/running-on-mesos.md
@@ -4,12 +4,12 @@ title: Running Spark on Mesos
---
# Running Spark on Mesos
-To run on a cluster, Spark uses the [[Apache Mesos|http://incubator.apache.org/mesos/]] resource manager. Follow the steps below to install Mesos and Spark:
+To run on a cluster, Spark uses the [Apache Mesos](http://incubator.apache.org/mesos/) resource manager. Follow the steps below to install Mesos and Spark:
### For Spark 0.5:
-1. Download and build Spark using the instructions [[here|Home]].
-2. Download Mesos 0.9.0 from a [[mirror|http://www.apache.org/dyn/closer.cgi/incubator/mesos/mesos-0.9.0-incubating/]].
+1. Download and build Spark using the instructions [here]({{ HOME_DIR }}Home).
+2. Download Mesos 0.9.0 from a [mirror](http://www.apache.org/dyn/closer.cgi/incubator/mesos/mesos-0.9.0-incubating/).
3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.macosx`, that you can run. See the README file in Mesos for other options. **Note:** If you want to run Mesos without installing it into the default paths on your system (e.g. if you don't have administrative privileges to install it), you should also pass the `--prefix` option to `configure` to tell it where to install. For example, pass `--prefix=/home/user/mesos`. By default the prefix is `/usr/local`.
4. Build Mesos using `make`, and then install it using `make install`.
5. Create a file called `spark-env.sh` in Spark's `conf` directory, by copying `conf/spark-env.sh.template`, and add the following lines in it:
@@ -26,7 +26,7 @@ To run on a cluster, Spark uses the [[Apache Mesos|http://incubator.apache.org/m
### For Spark versions before 0.5:
-1. Download and build Spark using the instructions [[here|Home]].
+1. Download and build Spark using the instructions [here]({{ HOME_DIR }}Home).
2. Download either revision 1205738 of Mesos if you're using the master branch of Spark, or the pre-protobuf branch of Mesos if you're using Spark 0.3 or earlier (note that for new users, _we recommend the master branch instead of 0.3_). For revision 1205738 of Mesos, use:
<pre>
svn checkout -r 1205738 http://svn.apache.org/repos/asf/incubator/mesos/trunk mesos
@@ -35,20 +35,20 @@ For the pre-protobuf branch (for Spark 0.3 and earlier), use:
<pre>git clone git://github.com/mesos/mesos
cd mesos
git checkout --track origin/pre-protobuf</pre>
-3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.template.macosx`, so you can just run the one on your platform if it exists. See the [[Mesos wiki|https://github.com/mesos/mesos/wiki]] for other configuration options.
+3. Configure Mesos using the `configure` script, passing the location of your `JAVA_HOME` using `--with-java-home`. Mesos comes with "template" configure scripts for different platforms, such as `configure.template.macosx`, so you can just run the one on your platform if it exists. See the [Mesos wiki](https://github.com/mesos/mesos/wiki) for other configuration options.
4. Build Mesos using `make`.
5. In Spark's `conf/spark-env.sh` file, add `export MESOS_HOME=<path to Mesos directory>`. If you don't have a `spark-env.sh`, copy `conf/spark-env.sh.template`. You should also set `SCALA_HOME` there if it's not on your system's default path.
6. Copy Spark and Mesos to the _same_ path on all the nodes in the cluster.
7. Configure Mesos for deployment:
* On your master node, edit `MESOS_HOME/conf/masters` to list your master and `MESOS_HOME/conf/slaves` to list the slaves. Also, edit `MESOS_HOME/conf/mesos.conf` and add the line `failover_timeout=1` to change a timeout parameter that is too high by default.
* Run `MESOS_HOME/deploy/start-mesos` to start it up. If all goes well, you should see Mesos's web UI on port 8080 of the master machine.
- * See Mesos's [[deploy instructions|https://github.com/mesos/mesos/wiki/Deploy-Scripts]] for more information on deploying it.
+ * See Mesos's [deploy instructions](https://github.com/mesos/mesos/wiki/Deploy-Scripts) for more information on deploying it.
8. To run a Spark job against the cluster, when you create your `SparkContext`, pass the string `master@HOST:5050` as the first parameter, where `HOST` is the machine running your Mesos master. In addition, pass the location of Spark on your nodes as the third parameter, and a list of JAR files containing your JAR's code as the fourth (these will automatically get copied to the workers). For example:
<pre>new SparkContext("master@HOST:5050", "My Job Name", "/home/user/spark", List("my-job.jar"))</pre>
## Running on Amazon EC2
-If you want to run Spark on Amazon EC2, there's an easy way to launch a cluster with Mesos, Spark, and HDFS pre-configured: the [[EC2 launch scripts|Running-Spark-on-Amazon-EC2]]. This will get you a cluster in about five minutes without any configuration on your part.
+If you want to run Spark on Amazon EC2, there's an easy way to launch a cluster with Mesos, Spark, and HDFS pre-configured: the [EC2 launch scripts]({{HOME_PATH}}running-on-amazon-ec2.html). This will get you a cluster in about five minutes without any configuration on your part.
## Running Alongside Hadoop