<!DOCTYPE html>
<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
<title>Building Spark - Spark 2.0.0 Documentation</title>
<link rel="stylesheet" href="css/bootstrap.min.css">
<style>
body {
padding-top: 60px;
padding-bottom: 40px;
}
</style>
<meta name="viewport" content="width=device-width">
<link rel="stylesheet" href="css/bootstrap-responsive.min.css">
<link rel="stylesheet" href="css/main.css">
<script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
<link rel="stylesheet" href="css/pygments-default.css">
</head>
<body>
<!--[if lt IE 7]>
<p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p>
<![endif]-->
<!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html -->
<div class="navbar navbar-fixed-top" id="topbar">
<div class="navbar-inner">
<div class="container">
<div class="brand"><a href="index.html">
<img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">2.0.0</span>
</div>
<ul class="nav">
<!--TODO(andyk): Add class="active" attribute to li some how.-->
<li><a href="index.html">Overview</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quick-start.html">Quick Start</a></li>
<li><a href="programming-guide.html">Spark Programming Guide</a></li>
<li class="divider"></li>
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
<li><a href="sql-programming-guide.html">DataFrames, Datasets and SQL</a></li>
<li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
<li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
<li><a href="sparkr.html">SparkR (R on Spark)</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
<li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li>
<li><a href="api/R/index.html">R</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="cluster-overview.html">Overview</a></li>
<li><a href="submitting-applications.html">Submitting Applications</a></li>
<li class="divider"></li>
<li><a href="spark-standalone.html">Spark Standalone</a></li>
<li><a href="running-on-mesos.html">Mesos</a></li>
<li><a href="running-on-yarn.html">YARN</a></li>
</ul>
</li>
<li class="dropdown">
<a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="configuration.html">Configuration</a></li>
<li><a href="monitoring.html">Monitoring</a></li>
<li><a href="tuning.html">Tuning Guide</a></li>
<li><a href="job-scheduling.html">Job Scheduling</a></li>
<li><a href="security.html">Security</a></li>
<li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
<li class="divider"></li>
<li><a href="building-spark.html">Building Spark</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects">Supplemental Projects</a></li>
</ul>
</li>
</ul>
<!--<p class="navbar-text pull-right"><span class="version-text">v2.0.0</span></p>-->
</div>
</div>
</div>
<div class="container-wrapper">
<div class="content" id="content">
<h1 class="title">Building Spark</h1>
<ul id="markdown-toc">
<li><a href="#building-apache-spark" id="markdown-toc-building-apache-spark">Building Apache Spark</a> <ul>
<li><a href="#apache-maven" id="markdown-toc-apache-maven">Apache Maven</a> <ul>
<li><a href="#setting-up-mavens-memory-usage" id="markdown-toc-setting-up-mavens-memory-usage">Setting up Maven’s Memory Usage</a></li>
<li><a href="#buildmvn" id="markdown-toc-buildmvn">build/mvn</a></li>
</ul>
</li>
<li><a href="#building-a-runnable-distribution" id="markdown-toc-building-a-runnable-distribution">Building a Runnable Distribution</a></li>
<li><a href="#specifying-the-hadoop-version" id="markdown-toc-specifying-the-hadoop-version">Specifying the Hadoop Version</a></li>
<li><a href="#building-with-hive-and-jdbc-support" id="markdown-toc-building-with-hive-and-jdbc-support">Building With Hive and JDBC Support</a></li>
<li><a href="#packaging-without-hadoop-dependencies-for-yarn" id="markdown-toc-packaging-without-hadoop-dependencies-for-yarn">Packaging without Hadoop Dependencies for YARN</a></li>
<li><a href="#building-for-scala-210" id="markdown-toc-building-for-scala-210">Building for Scala 2.10</a></li>
<li><a href="#building-submodules-individually" id="markdown-toc-building-submodules-individually">Building submodules individually</a></li>
<li><a href="#continuous-compilation" id="markdown-toc-continuous-compilation">Continuous Compilation</a></li>
<li><a href="#speeding-up-compilation-with-zinc" id="markdown-toc-speeding-up-compilation-with-zinc">Speeding up Compilation with Zinc</a></li>
<li><a href="#building-with-sbt" id="markdown-toc-building-with-sbt">Building with SBT</a></li>
<li><a href="#intellij-idea-or-eclipse" id="markdown-toc-intellij-idea-or-eclipse">IntelliJ IDEA or Eclipse</a></li>
</ul>
</li>
<li><a href="#running-tests" id="markdown-toc-running-tests">Running Tests</a> <ul>
<li><a href="#testing-with-sbt" id="markdown-toc-testing-with-sbt">Testing with SBT</a></li>
<li><a href="#running-java-8-test-suites" id="markdown-toc-running-java-8-test-suites">Running Java 8 Test Suites</a></li>
<li><a href="#pyspark-tests-with-maven" id="markdown-toc-pyspark-tests-with-maven">PySpark Tests with Maven</a></li>
<li><a href="#running-r-tests" id="markdown-toc-running-r-tests">Running R Tests</a></li>
<li><a href="#running-docker-based-integration-test-suites" id="markdown-toc-running-docker-based-integration-test-suites">Running Docker-based Integration Test Suites</a></li>
</ul>
</li>
</ul>
<h1 id="building-apache-spark">Building Apache Spark</h1>
<h2 id="apache-maven">Apache Maven</h2>
<p>The Maven-based build is the build of reference for Apache Spark.
Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+.</p>
<h3 id="setting-up-mavens-memory-usage">Setting up Maven’s Memory Usage</h3>
<p>You’ll need to configure Maven to use more memory than usual by setting <code>MAVEN_OPTS</code>. We recommend the following settings:</p>
<pre><code>export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m"
</code></pre>
<p>If you don’t run this, you may see errors like the following:</p>
<pre><code>[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.11/classes...
[ERROR] PermGen space -> [Help 1]
[INFO] Compiling 203 Scala sources and 9 Java sources to /Users/me/Development/spark/core/target/scala-2.11/classes...
[ERROR] Java heap space -> [Help 1]
</code></pre>
<p>You can fix this by setting the <code>MAVEN_OPTS</code> variable as discussed before.</p>
<p><strong>Note:</strong></p>
<ul>
<li>For Java 8 and above this step is not required.</li>
<li>If using <code>build/mvn</code> with no <code>MAVEN_OPTS</code> set, the script will automate this for you.</li>
</ul>
<h3 id="buildmvn">build/mvn</h3>
<p>Spark now comes packaged with a self-contained Maven installation to ease building and deployment of Spark from source located under the <code>build/</code> directory. This script will automatically download and setup all necessary build requirements (<a href="https://maven.apache.org/">Maven</a>, <a href="http://www.scala-lang.org/">Scala</a>, and <a href="https://github.com/typesafehub/zinc">Zinc</a>) locally within the <code>build/</code> directory itself. It honors any <code>mvn</code> binary if present already, however, will pull down its own copy of Scala and Zinc regardless to ensure proper version requirements are met. <code>build/mvn</code> execution acts as a pass through to the <code>mvn</code> call allowing easy transition from previous build methods. As an example, one can build a version of Spark as follows:</p>
<pre><code>./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -DskipTests clean package
</code></pre>
<p>Other build examples can be found below.</p>
<h2 id="building-a-runnable-distribution">Building a Runnable Distribution</h2>
<p>To create a Spark distribution like those distributed by the
<a href="http://spark.apache.org/downloads.html">Spark Downloads</a> page, and that is laid out so as
to be runnable, use <code>./dev/make-distribution.sh</code> in the project root directory. It can be configured
with Maven profile settings and so on like the direct Maven build. Example:</p>
<pre><code>./dev/make-distribution.sh --name custom-spark --tgz -Psparkr -Phadoop-2.4 -Phive -Phive-thriftserver -Pyarn
</code></pre>
<p>For more information on usage, run <code>./dev/make-distribution.sh --help</code></p>
<h2 id="specifying-the-hadoop-version">Specifying the Hadoop Version</h2>
<p>Because HDFS is not protocol-compatible across versions, if you want to read from HDFS, you’ll need to build Spark against the specific HDFS version in your environment. You can do this through the <code>hadoop.version</code> property. If unset, Spark will build against Hadoop 2.2.0 by default. Note that certain build profiles are required for particular Hadoop versions:</p>
<table class="table">
<thead>
<tr><th>Hadoop version</th><th>Profile required</th></tr>
</thead>
<tbody>
<tr><td>2.2.x</td><td>hadoop-2.2</td></tr>
<tr><td>2.3.x</td><td>hadoop-2.3</td></tr>
<tr><td>2.4.x</td><td>hadoop-2.4</td></tr>
<tr><td>2.6.x</td><td>hadoop-2.6</td></tr>
<tr><td>2.7.x and later 2.x</td><td>hadoop-2.7</td></tr>
</tbody>
</table>
<p>You can enable the <code>yarn</code> profile and optionally set the <code>yarn.version</code> property if it is different from <code>hadoop.version</code>. Spark only supports YARN versions 2.2.0 and later.</p>
<p>Examples:</p>
<pre><code># Apache Hadoop 2.2.X
./build/mvn -Pyarn -Phadoop-2.2 -DskipTests clean package
# Apache Hadoop 2.3.X
./build/mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -DskipTests clean package
# Apache Hadoop 2.4.X or 2.5.X
./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=VERSION -DskipTests clean package
# Apache Hadoop 2.6.X
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -DskipTests clean package
# Apache Hadoop 2.7.X and later
./build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=VERSION -DskipTests clean package
# Different versions of HDFS and YARN.
./build/mvn -Pyarn -Phadoop-2.3 -Dhadoop.version=2.3.0 -Dyarn.version=2.2.0 -DskipTests clean package
</code></pre>
<h2 id="building-with-hive-and-jdbc-support">Building With Hive and JDBC Support</h2>
<p>To enable Hive integration for Spark SQL along with its JDBC server and CLI,
add the <code>-Phive</code> and <code>Phive-thriftserver</code> profiles to your existing build options.
By default Spark will build with Hive 1.2.1 bindings.</p>
<pre><code># Apache Hadoop 2.4.X with Hive 1.2.1 support
./build/mvn -Pyarn -Phadoop-2.4 -Dhadoop.version=2.4.0 -Phive -Phive-thriftserver -DskipTests clean package
</code></pre>
<h2 id="packaging-without-hadoop-dependencies-for-yarn">Packaging without Hadoop Dependencies for YARN</h2>
<p>The assembly directory produced by <code>mvn package</code> will, by default, include all of Spark’s
dependencies, including Hadoop and some of its ecosystem projects. On YARN deployments, this
causes multiple versions of these to appear on executor classpaths: the version packaged in
the Spark assembly and the version on each node, included with <code>yarn.application.classpath</code>.
The <code>hadoop-provided</code> profile builds the assembly without including Hadoop-ecosystem projects,
like ZooKeeper and Hadoop itself.</p>
<h2 id="building-for-scala-210">Building for Scala 2.10</h2>
<p>To produce a Spark package compiled with Scala 2.10, use the <code>-Dscala-2.10</code> property:</p>
<pre><code>./dev/change-scala-version.sh 2.10
./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
</code></pre>
<h2 id="building-submodules-individually">Building submodules individually</h2>
<p>It’s possible to build Spark sub-modules using the <code>mvn -pl</code> option.</p>
<p>For instance, you can build the Spark Streaming module using:</p>
<pre><code>./build/mvn -pl :spark-streaming_2.11 clean install
</code></pre>
<p>where <code>spark-streaming_2.11</code> is the <code>artifactId</code> as defined in <code>streaming/pom.xml</code> file.</p>
<h2 id="continuous-compilation">Continuous Compilation</h2>
<p>We use the scala-maven-plugin which supports incremental and continuous compilation. E.g.</p>
<pre><code>./build/mvn scala:cc
</code></pre>
<p>should run continuous compilation (i.e. wait for changes). However, this has not been tested
extensively. A couple of gotchas to note:</p>
<ul>
<li>
<p>it only scans the paths <code>src/main</code> and <code>src/test</code> (see
<a href="http://scala-tools.org/mvnsites/maven-scala-plugin/usage_cc.html">docs</a>), so it will only work
from within certain submodules that have that structure.</p>
</li>
<li>
<p>you’ll typically need to run <code>mvn install</code> from the project root for compilation within
specific submodules to work; this is because submodules that depend on other submodules do so via
the <code>spark-parent</code> module).</p>
</li>
</ul>
<p>Thus, the full flow for running continuous-compilation of the <code>core</code> submodule may look more like:</p>
<pre><code>$ ./build/mvn install
$ cd core
$ ../build/mvn scala:cc
</code></pre>
<h2 id="speeding-up-compilation-with-zinc">Speeding up Compilation with Zinc</h2>
<p><a href="https://github.com/typesafehub/zinc">Zinc</a> is a long-running server version of SBT’s incremental
compiler. When run locally as a background process, it speeds up builds of Scala-based projects
like Spark. Developers who regularly recompile Spark with Maven will be the most interested in
Zinc. The project site gives instructions for building and running <code>zinc</code>; OS X users can
install it using <code>brew install zinc</code>.</p>
<p>If using the <code>build/mvn</code> package <code>zinc</code> will automatically be downloaded and leveraged for all
builds. This process will auto-start after the first time <code>build/mvn</code> is called and bind to port
3030 unless the <code>ZINC_PORT</code> environment variable is set. The <code>zinc</code> process can subsequently be
shut down at any time by running <code>build/zinc-<version>/bin/zinc -shutdown</code> and will automatically
restart whenever <code>build/mvn</code> is called.</p>
<h2 id="building-with-sbt">Building with SBT</h2>
<p>Maven is the official build tool recommended for packaging Spark, and is the <em>build of reference</em>.
But SBT is supported for day-to-day development since it can provide much faster iterative
compilation. More advanced developers may wish to use SBT.</p>
<p>The SBT build is derived from the Maven POM files, and so the same Maven profiles and variables
can be set to control the SBT build. For example:</p>
<pre><code>./build/sbt -Pyarn -Phadoop-2.3 package
</code></pre>
<p>To avoid the overhead of launching sbt each time you need to re-compile, you can launch sbt
in interactive mode by running <code>build/sbt</code>, and then run all build commands at the command
prompt. For more recommendations on reducing build time, refer to the
<a href="https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-ReducingBuildTimes">wiki page</a>.</p>
<p>## Encrypted Filesystems</p>
<p>When building on an encrypted filesystem (if your home directory is encrypted, for example), then the Spark build might fail with a “Filename too long” error. As a workaround, add the following in the configuration args of the <code>scala-maven-plugin</code> in the project <code>pom.xml</code>:</p>
<pre><code><arg>-Xmax-classfile-name</arg>
<arg>128</arg>
</code></pre>
<p>and in <code>project/SparkBuild.scala</code> add:</p>
<pre><code>scalacOptions in Compile ++= Seq("-Xmax-classfile-name", "128"),
</code></pre>
<p>to the <code>sharedSettings</code> val. See also <a href="https://github.com/apache/spark/pull/2883/files">this PR</a> if you are unsure of where to add these lines.</p>
<h2 id="intellij-idea-or-eclipse">IntelliJ IDEA or Eclipse</h2>
<p>For help in setting up IntelliJ IDEA or Eclipse for Spark development, and troubleshooting, refer to the
<a href="https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools#UsefulDeveloperTools-IDESetup">wiki page for IDE setup</a>.</p>
<h1 id="running-tests">Running Tests</h1>
<p>Tests are run by default via the <a href="http://www.scalatest.org/user_guide/using_the_scalatest_maven_plugin">ScalaTest Maven plugin</a>.</p>
<p>Some of the tests require Spark to be packaged first, so always run <code>mvn package</code> with <code>-DskipTests</code> the first time. The following is an example of a correct (build, test) sequence:</p>
<pre><code>./build/mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive -Phive-thriftserver clean package
./build/mvn -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
</code></pre>
<p>The ScalaTest plugin also supports running only a specific Scala test suite as follows:</p>
<pre><code>./build/mvn -P... -Dtest=none -DwildcardSuites=org.apache.spark.repl.ReplSuite test
./build/mvn -P... -Dtest=none -DwildcardSuites=org.apache.spark.repl.* test
</code></pre>
<p>or a Java test:</p>
<pre><code>./build/mvn test -P... -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite
</code></pre>
<h2 id="testing-with-sbt">Testing with SBT</h2>
<p>Some of the tests require Spark to be packaged first, so always run <code>build/sbt package</code> the first time. The following is an example of a correct (build, test) sequence:</p>
<pre><code>./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver package
./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver test
</code></pre>
<p>To run only a specific test suite as follows:</p>
<pre><code>./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver "test-only org.apache.spark.repl.ReplSuite"
./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver "test-only org.apache.spark.repl.*"
</code></pre>
<p>To run test suites of a specific sub project as follows:</p>
<pre><code>./build/sbt -Pyarn -Phadoop-2.3 -Phive -Phive-thriftserver core/test
</code></pre>
<h2 id="running-java-8-test-suites">Running Java 8 Test Suites</h2>
<p>Running only Java 8 tests and nothing else.</p>
<pre><code>./build/mvn install -DskipTests
./build/mvn -pl :java8-tests_2.11 test
</code></pre>
<p>or</p>
<pre><code>./build/sbt java8-tests/test
</code></pre>
<p>Java 8 tests are automatically enabled when a Java 8 JDK is detected.
If you have JDK 8 installed but it is not the system default, you can set JAVA_HOME to point to JDK 8 before running the tests.</p>
<h2 id="pyspark-tests-with-maven">PySpark Tests with Maven</h2>
<p>If you are building PySpark and wish to run the PySpark tests you will need to build Spark with Hive support.</p>
<pre><code>./build/mvn -DskipTests clean package -Phive
./python/run-tests
</code></pre>
<p>The run-tests script also can be limited to a specific Python version or a specific module</p>
<pre><code>./python/run-tests --python-executables=python --modules=pyspark-sql
</code></pre>
<p><strong>Note:</strong> You can also run Python tests with an sbt build, provided you build Spark with Hive support.</p>
<h2 id="running-r-tests">Running R Tests</h2>
<p>To run the SparkR tests you will need to install the R package <code>testthat</code>
(run <code>install.packages(testthat)</code> from R shell). You can run just the SparkR tests using
the command:</p>
<pre><code>./R/run-tests.sh
</code></pre>
<h2 id="running-docker-based-integration-test-suites">Running Docker-based Integration Test Suites</h2>
<p>In order to run Docker integration tests, you have to install the <code>docker</code> engine on your box.
The instructions for installation can be found at <a href="https://docs.docker.com/engine/installation/">the Docker site</a>.
Once installed, the <code>docker</code> service needs to be started, if not already running.
On Linux, this can be done by <code>sudo service docker start</code>.</p>
<pre><code>./build/mvn install -DskipTests
./build/mvn -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.11
</code></pre>
<p>or</p>
<pre><code>./build/sbt docker-integration-tests/test
</code></pre>
</div>
<!-- /container -->
</div>
<script src="js/vendor/jquery-1.8.0.min.js"></script>
<script src="js/vendor/bootstrap.min.js"></script>
<script src="js/vendor/anchor.min.js"></script>
<script src="js/main.js"></script>
<!-- MathJax Section -->
<script type="text/x-mathjax-config">
MathJax.Hub.Config({
TeX: { equationNumbers: { autoNumber: "AMS" } }
});
</script>
<script>
// Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS.
// We could use "//cdn.mathjax...", but that won't support "file://".
(function(d, script) {
script = d.createElement('script');
script.type = 'text/javascript';
script.async = true;
script.onload = function(){
MathJax.Hub.Config({
tex2jax: {
inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
processEscapes: true,
skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
}
});
};
script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
d.getElementsByTagName('head')[0].appendChild(script);
}(document));
</script>
</body>
</html>