aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorMatei Zaharia <matei@eecs.berkeley.edu>2013-08-30 15:04:43 -0700
committerMatei Zaharia <matei@eecs.berkeley.edu>2013-08-30 15:04:43 -0700
commit4293533032bd5c354bb011f8d508b99615c6e0f0 (patch)
treee82fd2cc72c90ed98f5b0f1f4a74593cf3e6c54b /docs
parentf3a964848dd2ba65491f3eea8a54439069aa1b29 (diff)
downloadspark-4293533032bd5c354bb011f8d508b99615c6e0f0.tar.gz
spark-4293533032bd5c354bb011f8d508b99615c6e0f0.tar.bz2
spark-4293533032bd5c354bb011f8d508b99615c6e0f0.zip
Update docs about HDFS versions
Diffstat (limited to 'docs')
-rw-r--r--docs/index.md27
-rw-r--r--docs/quick-start.md20
-rw-r--r--docs/scala-programming-guide.md14
3 files changed, 41 insertions, 20 deletions
diff --git a/docs/index.md b/docs/index.md
index 5aa7f74059..cb51d4cadc 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -3,42 +3,37 @@ layout: global
title: Spark Overview
---
-Apache Spark is a cluster computing engine that aims to make data analytics both easier and faster.
-It provides rich, language-integrated APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html), and a powerful execution engine that supports general operator graphs.
+Apache Spark is a cluster computing system that aims to make data analytics faster to run and faster to write.
+It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html), and a general execution engine that supports rich operator graphs.
Spark can run on the Apache Mesos cluster manager, Hadoop YARN, Amazon EC2, or without an independent resource manager ("standalone mode").
# Downloading
-Get Spark from the [downloads page](http://spark.incubator.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}.
+Get Spark by visiting the [downloads page](http://spark.incubator.apache.org/downloads.html) of the Apache Spark site. This documentation is for Spark version {{site.SPARK_VERSION}}.
# Building
-Spark requires [Scala {{site.SCALA_VERSION}}](http://www.scala-lang.org/). You will need to have Scala's `bin` directory in your `PATH`,
-or you will need to set the `SCALA_HOME` environment variable to point
-to where you've installed Scala. Scala must also be accessible through one
-of these methods on slave nodes on your cluster.
-
Spark uses [Simple Build Tool](http://www.scala-sbt.org), which is bundled with it. To compile the code, go into the top-level Spark directory and run
sbt/sbt assembly
-Spark also supports building using Maven. If you would like to build using Maven, see the [instructions for building Spark with Maven](building-with-maven.html).
+For its Scala API, Spark {{site.SPARK_VERSION}} depends on Scala {{site.SCALA_VERSION}}. If you write applications in Scala, you will need to use this same version of Scala in your own program -- newer major versions may not work. You can get the right version of Scala from [scala-lang.org](http://www.scala-lang.org/download/).
# Testing the Build
-Spark comes with a number of sample programs in the `examples` directory.
+Spark comes with several sample programs in the `examples` directory.
To run one of the samples, use `./run-example <class> <params>` in the top-level Spark directory
-(the `run` script sets up the appropriate paths and launches that program).
-For example, `./run-example spark.examples.SparkPi` will run a sample program that estimates Pi. Each of the
-examples prints usage help if no params are given.
+(the `run-example` script sets up the appropriate paths and launches that program).
+For example, `./run-example spark.examples.SparkPi` will run a sample program that estimates Pi. Each
+example prints usage help if no params are given.
Note that all of the sample programs take a `<master>` parameter specifying the cluster URL
to connect to. This can be a [URL for a distributed cluster](scala-programming-guide.html#master-urls),
or `local` to run locally with one thread, or `local[N]` to run locally with N threads. You should start by using
`local` for testing.
-Finally, Spark can be used interactively from a modified version of the Scala interpreter that you can start through
-`./spark-shell`. This is a great way to learn Spark.
+Finally, Spark can be used interactively through modified versions of the Scala shell (`./spark-shell`) or
+Python interpreter (`./pyspark`). These are a great way to learn Spark.
# A Note About Hadoop Versions
@@ -50,7 +45,7 @@ You can do this by setting the `SPARK_HADOOP_VERSION` variable when compiling:
SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
In addition, if you wish to run Spark on [YARN](running-on-yarn.md), you should also
-set `SPARK_YARN` to `true`:
+set `SPARK_YARN`:
SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 4e9deadbaa..bac5d690a6 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -142,7 +142,13 @@ resolvers ++= Seq(
"Spray Repository" at "http://repo.spray.cc/")
{% endhighlight %}
-Of course, for sbt to work correctly, we'll need to layout `SimpleJob.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the job's code, then use `sbt run` to execute our example job.
+If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS:
+
+{% highlight scala %}
+libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "<your-hdfs-version>"
+{% endhighlight %}
+
+Finally, for sbt to work correctly, we'll need to layout `SimpleJob.scala` and `simple.sbt` according to the typical directory structure. Once that is in place, we can create a JAR package containing the job's code, then use `sbt run` to execute our example job.
{% highlight bash %}
$ find .
@@ -223,6 +229,16 @@ To build the job, we also write a Maven `pom.xml` file that lists Spark as a dep
</project>
{% endhighlight %}
+If you also wish to read data from Hadoop's HDFS, you will also need to add a dependency on `hadoop-client` for your version of HDFS:
+
+{% highlight xml %}
+ <dependency>
+ <groupId>org.apache.hadoop</groupId>
+ <artifactId>hadoop-client</artifactId>
+ <version>...</version>
+ </dependency>
+{% endhighlight %}
+
We lay out these files according to the canonical Maven directory structure:
{% highlight bash %}
$ find .
@@ -281,3 +297,5 @@ Lines with a: 46, Lines with b: 23
{% endhighlight python %}
This example only runs the job locally; for a tutorial on running jobs across several machines, see the [Standalone Mode](spark-standalone.html) documentation, and consider using a distributed input source, such as HDFS.
+
+Also, this example links against the default version of HDFS that Spark builds with (1.0.4). You can run it against other HDFS versions by [building Spark with another HDFS version](index.html#a-note-about-hadoop-versions).
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index db584d2096..e321b8f5b8 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -17,15 +17,23 @@ This guide shows each of these features and walks through some samples. It assum
# Linking with Spark
-To write a Spark application, you will need to add both Spark and its dependencies to your CLASSPATH. If you use sbt or Maven, Spark is available through Maven Central at:
+Spark {{site.SPARK_VERSION}} uses Scala {{site.SCALA_VERSION}}. If you write applications in Scala, you'll need to use this same version of Scala in your program -- newer major versions may not work.
+
+To write a Spark application, you need to add a dependency on Spark. If you use SBT or Maven, Spark is available through Maven Central at:
groupId = org.spark-project
artifactId = spark-core_{{site.SCALA_VERSION}}
version = {{site.SPARK_VERSION}}
-For other build systems or environments, you can run `sbt/sbt assembly` to build both Spark and its dependencies into one JAR (`core/target/spark-core-assembly-0.6.0.jar`), then add this to your CLASSPATH.
+In addition, if you wish to access an HDFS cluster, you need to add a dependency on `hadoop-client` for your version of HDFS:
+
+ groupId = org.apache.hadoop
+ artifactId = hadoop-client
+ version = <your-hdfs-version>
+
+For other build systems, you can run `sbt/sbt assembly` to pack Spark and its dependencies into one JAR (`assembly/target/scala-{{site.SCALA_VERSION}}/spark-assembly-{{site.SPARK_VERSION}}-hadoop*.jar`), then add this to your CLASSPATH. Set the HDFS version as described [here](index.html#a-note-about-hadoop-versions).
-In addition, you'll need to import some Spark classes and implicit conversions. Add the following lines at the top of your program:
+Finally, you need to import some Spark classes and implicit conversions into your program. Add the following lines:
{% highlight scala %}
import spark.SparkContext