aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorPatrick Wendell <pwendell@gmail.com>2013-08-24 14:50:58 -0700
committerPatrick Wendell <pwendell@gmail.com>2013-08-24 14:50:58 -0700
commit4879685910a1ee9f314fce1efe8d3ed879f3e64c (patch)
tree3aef0af8b939a2b25b13b8b16fae006ba8b111fc /docs
parentc02585ea130045ef27e579172ac2acc71bc8da63 (diff)
parentd282c1ebbbe1aebbd409c06efedf95fb77833c35 (diff)
downloadspark-4879685910a1ee9f314fce1efe8d3ed879f3e64c.tar.gz
spark-4879685910a1ee9f314fce1efe8d3ed879f3e64c.tar.bz2
spark-4879685910a1ee9f314fce1efe8d3ed879f3e64c.zip
Merge remote-tracking branch 'mesos/master' into ec2-updates
Diffstat (limited to 'docs')
-rw-r--r--docs/_plugins/copy_api_dirs.rb2
-rw-r--r--docs/building-with-maven.md35
-rw-r--r--docs/configuration.md2
-rw-r--r--docs/running-on-yarn.md20
-rw-r--r--docs/streaming-custom-receivers.md51
-rw-r--r--docs/streaming-programming-guide.md3
6 files changed, 84 insertions, 29 deletions
diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb
index 217254c59f..c574ea7f5c 100644
--- a/docs/_plugins/copy_api_dirs.rb
+++ b/docs/_plugins/copy_api_dirs.rb
@@ -18,7 +18,7 @@
require 'fileutils'
include FileUtils
-if ENV['SKIP_API'] != '1'
+if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1')
# Build Scaladoc for Java/Scala
projects = ["core", "examples", "repl", "bagel", "streaming", "mllib"]
diff --git a/docs/building-with-maven.md b/docs/building-with-maven.md
index 04cd79d039..a9f2cb8a7a 100644
--- a/docs/building-with-maven.md
+++ b/docs/building-with-maven.md
@@ -8,22 +8,26 @@ title: Building Spark with Maven
Building Spark using Maven Requires Maven 3 (the build process is tested with Maven 3.0.4) and Java 1.6 or newer.
-Building with Maven requires that a Hadoop profile be specified explicitly at the command line, there is no default. There are two profiles to choose from, one for building for Hadoop 1 or Hadoop 2.
+## Specifying the Hadoop version ##
-for Hadoop 1 (using 0.20.205.0) use:
+To enable support for HDFS and other Hadoop-supported storage systems, specify the exact Hadoop version by setting the "hadoop.version" property. If unset, Spark will build against Hadoop 1.0.4 by default.
- $ mvn -Phadoop1 clean install
+For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop versions without YARN, use:
+ # Apache Hadoop 1.2.1
+ $ mvn -Dhadoop.version=1.2.1 clean install
-for Hadoop 2 (using 2.0.0-mr1-cdh4.1.1) use:
+ # Cloudera CDH 4.2.0 with MapReduce v1
+ $ mvn -Dhadoop.version=2.0.0-mr1-cdh4.2.0 clean install
- $ mvn -Phadoop2 clean install
+For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions with YARN, enable the "hadoop2-yarn" profile:
-It uses the scala-maven-plugin which supports incremental and continuous compilation. E.g.
+ # Apache Hadoop 2.0.5-alpha
+ $ mvn -Phadoop2-yarn -Dhadoop.version=2.0.5-alpha clean install
- $ mvn -Phadoop2 scala:cc
+ # Cloudera CDH 4.2.0 with MapReduce v2
+ $ mvn -Phadoop2-yarn -Dhadoop.version=2.0.0-cdh4.2.0 clean install
-…should run continuous compilation (i.e. wait for changes). However, this has not been tested extensively.
## Spark Tests in Maven ##
@@ -31,11 +35,11 @@ Tests are run by default via the scalatest-maven-plugin. With this you can do th
Skip test execution (but not compilation):
- $ mvn -DskipTests -Phadoop2 clean install
+ $ mvn -Dhadoop.version=... -DskipTests clean install
To run a specific test suite:
- $ mvn -Phadoop2 -Dsuites=spark.repl.ReplSuite test
+ $ mvn -Dhadoop.version=... -Dsuites=spark.repl.ReplSuite test
## Setting up JVM Memory Usage Via Maven ##
@@ -53,6 +57,15 @@ To fix these, you can do the following:
export MAVEN_OPTS="-Xmx1024m -XX:MaxPermSize=128M"
+## Continuous Compilation ##
+
+We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.
+
+ $ mvn scala:cc
+
+…should run continuous compilation (i.e. wait for changes). However, this has not been tested extensively.
+
+
## Using With IntelliJ IDEA ##
This setup works fine in IntelliJ IDEA 11.1.4. After opening the project via the pom.xml file in the project root folder, you only need to activate either the hadoop1 or hadoop2 profile in the "Maven Properties" popout. We have not tried Eclipse/Scala IDE with this.
@@ -61,6 +74,6 @@ This setup works fine in IntelliJ IDEA 11.1.4. After opening the project via the
It includes support for building a Debian package containing a 'fat-jar' which includes the repl, the examples and bagel. This can be created by specifying the deb profile:
- $ mvn -Phadoop2,deb clean install
+ $ mvn -Pdeb clean install
The debian package can then be found under repl/target. We added the short commit hash to the file name so that we can distinguish individual packages build for SNAPSHOT versions.
diff --git a/docs/configuration.md b/docs/configuration.md
index dff08a06f5..b125eeb03c 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -146,7 +146,7 @@ Apart from these, the following properties are also available, and may be useful
</tr>
<tr>
<td>spark.ui.port</td>
- <td>33000</td>
+ <td>3030</td>
<td>
Port for your application's dashboard, which shows memory and workload data
</td>
diff --git a/docs/running-on-yarn.md b/docs/running-on-yarn.md
index 9c2cedfd88..6bada9bdd7 100644
--- a/docs/running-on-yarn.md
+++ b/docs/running-on-yarn.md
@@ -6,7 +6,7 @@ title: Launching Spark on YARN
Experimental support for running over a [YARN (Hadoop
NextGen)](http://hadoop.apache.org/docs/r2.0.2-alpha/hadoop-yarn/hadoop-yarn-site/YARN.html)
cluster was added to Spark in version 0.6.0. This was merged into master as part of 0.7 effort.
-To build spark core with YARN support, please use the hadoop2-yarn profile.
+To build spark with YARN support, please use the hadoop2-yarn profile.
Ex: mvn -Phadoop2-yarn clean install
# Building spark core consolidated jar.
@@ -15,18 +15,12 @@ We need a consolidated spark core jar (which bundles all the required dependenci
This can be built either through sbt or via maven.
- Building spark assembled jar via sbt.
- It is a manual process of enabling it in project/SparkBuild.scala.
-Please comment out the
- HADOOP_VERSION, HADOOP_MAJOR_VERSION and HADOOP_YARN
-variables before the line 'For Hadoop 2 YARN support'
-Next, uncomment the subsequent 3 variable declaration lines (for these three variables) which enable hadoop yarn support.
+Enable YARN support by setting `SPARK_WITH_YARN=true` when invoking sbt:
-Assembly of the jar Ex:
-
- ./sbt/sbt clean assembly
+ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_WITH_YARN=true ./sbt/sbt clean assembly
The assembled jar would typically be something like :
-`./core/target/spark-core-assembly-0.8.0-SNAPSHOT.jar`
+`./yarn/target/spark-yarn-assembly-0.8.0-SNAPSHOT.jar`
- Building spark assembled jar via Maven.
@@ -34,16 +28,16 @@ The assembled jar would typically be something like :
Something like this. Ex:
- mvn -Phadoop2-yarn clean package -DskipTests=true
+ mvn -Phadoop2-yarn -Dhadoop.version=2.0.5-alpha clean package -DskipTests=true
This will build the shaded (consolidated) jar. Typically something like :
-`./repl-bin/target/spark-repl-bin-<VERSION>-shaded-hadoop2-yarn.jar`
+`./yarn/target/spark-yarn-bin-<VERSION>-shaded.jar`
# Preparations
-- Building spark core assembled jar (see above).
+- Building spark-yarn assembly (see above).
- Your application code must be packaged into a separate JAR file.
If you want to test out the YARN deployment mode, you can use the current Spark examples. A `spark-examples_{{site.SCALA_VERSION}}-{{site.SPARK_VERSION}}` file can be generated by running `sbt/sbt package`. NOTE: since the documentation you're reading is for Spark version {{site.SPARK_VERSION}}, we are assuming here that you have downloaded Spark {{site.SPARK_VERSION}} or checked it out of source control. If you are using a different version of Spark, the version numbers in the jar generated by the sbt package command will obviously be different.
diff --git a/docs/streaming-custom-receivers.md b/docs/streaming-custom-receivers.md
index 5476c00d02..dfa343bf94 100644
--- a/docs/streaming-custom-receivers.md
+++ b/docs/streaming-custom-receivers.md
@@ -7,10 +7,45 @@ A "Spark Streaming" receiver can be a simple network stream, streams of messages
This guide shows the programming model and features by walking through a simple sample receiver and corresponding Spark Streaming application.
+### Write a simple receiver
-## A quick and naive walk-through
+This starts with implementing [NetworkReceiver](#References)
-### Write a simple receiver
+Following is a simple socket text-stream receiver.
+
+{% highlight scala %}
+
+ class SocketTextStreamReceiver(host: String,
+ port: Int
+ ) extends NetworkReceiver[String] {
+
+ protected lazy val blocksGenerator: BlockGenerator =
+ new BlockGenerator(StorageLevel.MEMORY_ONLY_SER_2)
+
+ protected def onStart() = {
+ blocksGenerator.start()
+ val socket = new Socket(host, port)
+ val dataInputStream = new BufferedReader(new InputStreamReader(socket.getInputStream(), "UTF-8"))
+ var data: String = dataInputStream.readLine()
+ while (data != null) {
+ blocksGenerator += data
+ data = dataInputStream.readLine()
+ }
+ }
+
+ protected def onStop() {
+ blocksGenerator.stop()
+ }
+
+ }
+
+{% endhighlight %}
+
+
+All we did here is extended NetworkReceiver and called blockGenerator's API method (i.e. +=) to push our blocks of data. Please refer to scala-docs of NetworkReceiver for more details.
+
+
+### An Actor as Receiver.
This starts with implementing [Actor](#References)
@@ -46,7 +81,16 @@ All we did here is mixed in trait Receiver and called pushBlock api method to pu
{% endhighlight %}
-* Plug-in the actor configuration into the spark streaming context and create a DStream.
+* Plug-in the custom receiver into the spark streaming context and create a DStream.
+
+{% highlight scala %}
+
+ val lines = ssc.networkStream[String](new SocketTextStreamReceiver(
+ "localhost", 8445))
+
+{% endhighlight %}
+
+* OR Plug-in the actor as receiver into the spark streaming context and create a DStream.
{% highlight scala %}
@@ -99,3 +143,4 @@ _A more comprehensive example is provided in the spark streaming examples_
## References
1.[Akka Actor documentation](http://doc.akka.io/docs/akka/2.0.5/scala/actors.html)
+2.[NetworkReceiver](http://spark-project.org/docs/latest/api/streaming/index.html#spark.streaming.dstream.NetworkReceiver)
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index 8cd1b0cd66..a74c17bdb7 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -301,6 +301,9 @@ dstream.checkpoint(checkpointInterval) // checkpointInterval must be a multiple
For DStreams that must be checkpointed (that is, DStreams created by `updateStateByKey` and `reduceByKeyAndWindow` with inverse function), the checkpoint interval of the DStream is by default set to a multiple of the DStream's sliding interval such that its at least 10 seconds.
+## Customizing Receiver
+Spark comes with a built in support for most common usage scenarios where input stream source can be either a network socket stream to support for a few message queues. Apart from that it is also possible to supply your own custom receiver via a convenient API. Find more details at [Custom Receiver Guide](streaming-custom-receivers.html)
+
# Performance Tuning
Getting the best performance of a Spark Streaming application on a cluster requires a bit of tuning. This section explains a number of the parameters and configurations that can tuned to improve the performance of you application. At a high level, you need to consider two things:
<ol>