<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>
Documentation | Apache Spark
</title>
<!-- Bootstrap core CSS -->
<link href="/css/cerulean.min.css" rel="stylesheet">
<link href="/css/custom.css" rel="stylesheet">
<script type="text/javascript">
<!-- Google Analytics initialization -->
var _gaq = _gaq || [];
_gaq.push(['_setAccount', 'UA-32518208-2']);
_gaq.push(['_trackPageview']);
(function() {
var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
})();
<!-- Adds slight delay to links to allow async reporting -->
function trackOutboundLink(link, category, action) {
try {
_gaq.push(['_trackEvent', category , action]);
} catch(err){}
setTimeout(function() {
document.location.href = link.href;
}, 100);
}
</script>
<!-- HTML5 shim and Respond.js IE8 support of HTML5 elements and media queries -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/libs/html5shiv/3.7.0/html5shiv.js"></script>
<script src="https://oss.maxcdn.com/libs/respond.js/1.3.0/respond.min.js"></script>
<![endif]-->
</head>
<body>
<div class="container" style="max-width: 1200px;">
<div class="masthead">
<p class="lead">
<a href="/">
<img src="/images/spark-logo.png"
style="height:100px; width:auto; vertical-align: bottom; margin-top: 20px;"></a><span class="tagline">
Lightning-fast cluster computing
</span>
</p>
</div>
<nav class="navbar navbar-default" role="navigation">
<!-- Brand and toggle get grouped for better mobile display -->
<div class="navbar-header">
<button type="button" class="navbar-toggle" data-toggle="collapse"
data-target="#navbar-collapse-1">
<span class="sr-only">Toggle navigation</span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
<span class="icon-bar"></span>
</button>
</div>
<!-- Collect the nav links, forms, and other content for toggling -->
<div class="collapse navbar-collapse" id="navbar-collapse-1">
<ul class="nav navbar-nav">
<li><a href="/downloads.html">Download</a></li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Related Projects <b class="caret"></b>
</a>
<ul class="dropdown-menu">
<li><a href="http://shark.cs.berkeley.edu">Shark (SQL)</a></li>
<li><a href="/streaming/">Spark Streaming</a></li>
<li><a href="/mllib/">MLlib (machine learning)</a></li>
<li><a href="http://amplab.github.io/graphx/">GraphX (graph)</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Documentation <b class="caret"></b>
</a>
<ul class="dropdown-menu">
<li><a href="/documentation.html">Overview</a></li>
<li><a href="/docs/latest/">Latest Release</a></li>
<li><a href="/examples.html">Examples</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">
Community <b class="caret"></b>
</a>
<ul class="dropdown-menu">
<li><a href="/community.html">Mailing Lists</a></li>
<li><a href="/community.html#events">Events and Meetups</a></li>
<li><a href="/community.html#history">Project History</a></li>
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark">Powered By</a></li>
</ul>
</li>
<li><a href="/faq.html">FAQ</a></li>
</ul>
</div>
<!-- /.navbar-collapse -->
</nav>
<div class="row">
<div class="col-md-3 col-md-push-9">
<div class="news" style="margin-bottom: 20px;">
<h5>Latest News</h5>
<ul class="list-unstyled">
<li><a href="/news/spark-1-0-0-released.html">Spark 1.0.0 released</a>
<span class="small">(May 30, 2014)</span></li>
<li><a href="/news/spark-summit-agenda-posted.html">Spark Summit agenda posted</a>
<span class="small">(May 11, 2014)</span></li>
<li><a href="/news/spark-0-9-1-released.html">Spark 0.9.1 released</a>
<span class="small">(Apr 09, 2014)</span></li>
<li><a href="/news/submit-talks-to-spark-summit-2014.html">Submissions and registration open for Spark Summit 2014</a>
<span class="small">(Mar 20, 2014)</span></li>
</ul>
<p class="small" style="text-align: right;"><a href="/news/index.html">Archive</a></p>
</div>
<div class="hidden-xs hidden-sm">
<a href="/downloads.html" class="btn btn-success btn-lg btn-block" style="margin-bottom: 30px;">
Download Spark
</a>
<p style="font-size: 16px; font-weight: 500; color: #555;">
Related Projects:
</p>
<ul class="list-narrow">
<li><a href="http://shark.cs.berkeley.edu">Shark (SQL)</a></li>
<li><a href="/streaming/">Spark Streaming</a></li>
<li><a href="/mllib/">MLlib (machine learning)</a></li>
<li><a href="http://amplab.github.io/graphx/">GraphX (graph)</a></li>
</ul>
</div>
</div>
<div class="col-md-9 col-md-pull-3">
<h2>Spark Documentation</h2>
<p>Setup instructions, programming guides, and other documentation are available for each version of Spark below:</p>
<ul>
<li><a href="/docs/latest/">Spark 1.0.0 (latest release)</a></li>
<li><a href="/docs/0.9.1/">Spark 0.9.1</a></li>
<li><a href="/docs/0.8.1/">Spark 0.8.1</a></li>
<li><a href="/docs/0.7.3/">Spark 0.7.3</a></li>
<li><a href="/docs/0.6.2/">Spark 0.6.2</a></li>
</ul>
<p>The documentation linked to above covers getting started with Spark, as well the built-in components <a href="/docs/latest/mllib-guide.html">MLlib</a>,
<a href="/docs/latest/streaming-programming-guide.html">Spark Streaming</a>, and <a href="/docs/latest/graphx-guide.html">GraphX</a>.</p>
<p>In addition, this page lists other resources for learning Spark.</p>
<h3>Videos</h3>
<p>See the <a href="http://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w">Apache Spark YouTube Channel</a> for videos from Spark events. There are separate <a href="http://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w/playlists">playlists</a> for videos of different topics. Besides browsing through playlists, you can also find direct links to videos below.</p>
<h4>Screencast Tutorial Videos</h4>
<ul>
<li><a href="/screencasts/1-first-steps-with-spark.html">Screencast 1: First Steps with Spark</a></li>
<li><a href="/screencasts/2-spark-documentation-overview.html">Screencast 2: Spark Documentation Overview</a></li>
<li><a href="/screencasts/3-transformations-and-caching.html">Screencast 3: Transformations and Caching</a></li>
<li><a href="/screencasts/4-a-standalone-job-in-spark.html">Screencast 4: A Spark Standalone Job in Scala</a></li>
</ul>
<h4>Spark Summit Videos</h4>
<ul>
<li>Videos from Spark Summit 2013, San Francisco, Dec 2-3 2013
<ul>
<li>See the <a href="http://spark-summit.org/2013#agendapluginwidget-4">event agenda</a> for the full agenda with inline links to all videos and slides</li>
<li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwjXj33QvAXN0Vlx0gc6u0je">YouTube playist of all Keynotes</a></li>
<li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwiNcKwIkDEQZBejiqxEJ79U">YouTube playist of Track A (Spark Applications)</a></li>
<li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwiNcKwIkDEQZBejiqxEJ79U">YouTube playist of Track B (Spark Deployment, Scheduling & Perf, Related projects)</a></li>
<li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwjR1Umntxz52zv3EcKpbzCp">YouTube playist of the Training Day (i.e. the 2nd day of the summit)</a></li>
</ul>
</li>
</ul>
<h4><a name="meetup-videos"></a>Meetup Talk Videos</h4>
<p>In addition to the videos listed below, you can also view <a href="http://www.meetup.com/spark-users/files/">all slides from Bay Area meetups here</a>.
<style type="text/css">
.video-meta-info {
font-size: 0.95em;
}
</style></p>
<ul>
<li><a href="http://www.youtube.com/watch?v=NUQ-8to2XAk&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Spark 1.0 and Beyond</a> (<a href="http://files.meetup.com/3138542/Spark%201.0%20Meetup.ppt">slides</a>) <span class="video-meta-info">by Patrick Wendell, at Cisco in San Jose, 2014-04-23</span></li>
<li><a href="http://www.youtube.com/watch?v=ju2OQEXqONU&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Adding Native SQL Support to Spark with Catalyst</a> (<a href="http://files.meetup.com/3138542/Spark%20SQL%20Meetup%20-%204-8-2012.pdf">slides</a>) <span class="video-meta-info">by Michael Armbrust, at Tagged in SF, 2014-04-08</span></li>
<li><a href="http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">SparkR and GraphX</a> (slides: <a href="http://files.meetup.com/3138542/SparkR-meetup.pdf">SparkR</a>, <a href="http://files.meetup.com/3138542/graphx%40spark_meetup03_2014.pdf">GraphX</a>) <span class="video-meta-info">by Shivaram Venkataraman & Dan Crankshaw, at SkyDeck in Berkeley, 2014-03-25</span></li>
<li><a href="http://www.youtube.com/watch?v=5niXiiEX5pE&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Simple deployment w/ SIMR & Advanced Shark Analytics w/ TGFs</a> (<a href="http://files.meetup.com/3138542/tgf.pptx">slides</a>) <span class="video-meta-info">by Ali Ghodsi, at Huawei in Santa Clara, 2014-02-05</span></li>
<li><a href="http://www.youtube.com/watch?v=C7gWtxelYNM&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Stores, Monoids & Dependency Injection - Abstractions for Spark</a> (<a href="http://files.meetup.com/3138542/Abstractions%20for%20spark%20streaming%20-%20spark%20meetup%20presentation.pdf">slides</a>) <span class="video-meta-info">by Ryan Weald, at Sharethrough in SF, 2014-01-17</span></li>
<li><a href="https://www.youtube.com/watch?v=IxDnF_X4M-8">Distributed Machine Learning using MLbase</a> (<a href="http://files.meetup.com/3138542/sparkmeetup_8_6_13_final_reduced.pdf">slides</a>) <span class="video-meta-info">by Evan Sparks & Ameet Talwalkar, at Twitter in SF, 2013-08-06</span></li>
<li><a href="https://www.youtube.com/watch?v=vJQ2RZj9hqs">GraphX Preview: Graph Analysis on Spark</a> <span class="video-meta-info">by Reynold Xin & Joseph Gonzalez, at Flurry in SF, 2013-07-02</span></li>
<li><a href="http://www.youtube.com/watch?v=D1knCQZQQnw">Deep Dive with Spark Streaming</a> (<a href="http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617">slides</a>) <span class="video-meta-info">by Tathagata Das, at Plug and Play in Sunnyvale, 2013-06-17</span></li>
<li><a href="https://www.youtube.com/watch?v=cAZ624-69PQ">Tachyon and Shark update</a> (slides: <a href="http://files.meetup.com/3138542/2013-05-09%20Shark%20%40%20Spark%20Meetup.pdf">Shark</a>, <a href="http://files.meetup.com/3138542/Tachyon_2013-05-09_Spark_Meetup.pdf">Tachyon</a>) <span class="video-meta-info">by Ali Ghodsi, Haoyuan Li, Reynold Xin, Google Ventures, 2013-05-09</span></li>
<li><a href="https://www.youtube.com/playlist?list=PLxwbieuTaYXmWTBovyyw2NibPfUaJk-h4">Spark 0.7: Overview, pySpark, & Streaming</a> <span class="video-meta-info">by Matei Zaharia, Josh Rosen, Tathagata Das, at Conviva on 2013-02-21</span></li>
<li><a href="https://www.youtube.com/watch?v=49Hr5xZyTEA">Introduction to Spark Internals</a> (<a href="http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx">slides</a>) <span class="video-meta-info">by Matei Zaharia, at Yahoo in Sunnyvale, 2012-12-18</span></li>
</ul>
<h3>Hands-On Exercises</h3>
<ul>
<li><a href="http://spark-summit.org/2013/exercises/">Hands-on exercises</a> are available online from Spark Summit 2013. These exercises let you launch a small EC2 cluster, load a dataset, and query it with Spark, Shark, Spark Streaming, and MLlib.</li>
</ul>
<p><a name="summit"></a></p>
<h3>Training Materials</h3>
<ul>
<li>The 2nd day of <a href="http://spark-summit.org/2013">Spark Summit 2013</a> was a training session, and you can find the slides and videos from that inline in <a href="http://spark-summit.org/summit-2013/#agendapluginwidget-5">the training day agenda</a>.
The session also included <a href="http://spark-summit.org/2013/exercises/">exercises</a> that you can walk through yourself, which will guide you through launching a Spark cluster on EC2 and using various Spark components to analyze real data.</li>
<li>The <a href="https://amplab.cs.berkeley.edu/">UC Berkeley AMPLab</a> regularly hosts training camps on Spark and related projects.
Slides, videos and EC2-based exercises from each of these are available online:
<ul>
<li><a href="http://ampcamp.berkeley.edu/4/">AMP Camp 4</a> (Strata Santa Clara, Feb 2014) — focus on BlinkDB, MLlib, GraphX, Tachyon</li>
<li><a href="http://ampcamp.berkeley.edu/3/">AMP Camp 3</a> (Berkeley, CA, Aug 2013)</li>
<li><a href="http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/">AMP Camp 2</a> (Strata Santa Clara, Feb 2013)</li>
<li><a href="http://ampcamp.berkeley.edu/agenda-2012/">AMP Camp 1</a> (Berkeley, CA, Aug 2012)</li>
</ul>
</li>
</ul>
<h3>External Tutorials, Blog Posts, and Talks</h3>
<ul>
<li><a href="http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark">Using Parquet and Scrooge with Spark</a> — Scala-friendly Parquet and Avro usage tutorial from Ooyala's Evan Chan</li>
<li><a href="http://codeforhire.com/2014/02/18/using-spark-with-mongodb/">Using Spark with MongoDB</a> — by Sampo Niskanen from Wellmo</li>
<li><a href="http://spark-summit.org/2013">Spark Summit 2013</a> — contained 30 talks about Spark use cases, available as slides and videos</li>
<li><a href="http://www.pwendell.com/2013/09/28/declarative-streams.html">Sampling Twitter Using Declarative Streams</a> — Spark Streaming tutorial by Patrick Wendell</li>
<li><a href="http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/">A Powerful Big Data Trio: Spark, Parquet and Avro</a> — Using Parquet in Spark by Matt Massie</li>
<li><a href="http://www.slideshare.net/EvanChan2/cassandra2013-spark-talk-final">Real-time Analytics with Cassandra, Spark, and Shark</a> — Presentation by Evan Chan from Ooyala at 2013 Cassandra Summit</li>
<li><a href="http://syndeticlogic.net/?p=311">Getting Spark Setup in Eclipse</a> — Developer blog post by James Percent</li>
<li><a href="http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923">Run Spark and Shark on Amazon Elastic MapReduce</a> — Article by Amazon Elastic MapReduce team member Parviz Deyhim</li>
<li><a href="http://blog.quantifind.com/posts/spark-unit-test/">Unit testing with Spark</a> — Quantifind tech blog post by Imran Rashid</li>
<li><a href="http://blog.quantifind.com/posts/logging-post/">Configuring Spark logs</a> — Quantifind tech blog by Imran Rashid</li>
<li><a href="http://www.ibm.com/developerworks/library/os-spark/">Spark, an alternative for fast data analytics</a> — IBM Developer Works article by M. Tim Jones</li>
</ul>
<h3>Books</h3>
<ul>
<li><a href="http://www.packtpub.com/fast-data-processing-with-spark/book">Fast Data Processing with Spark</a>, by Holden Karau (Packt Publishing)</li>
</ul>
<h3>Examples</h3>
<ul>
<li>The <a href="/examples.html">Spark examples page</a> shows the basic API in Scala, Java and Python.</li>
</ul>
<h3>Wiki</h3>
<ul><li>
The <a href="https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage">Spark wiki</a> contains
information for developers, such as architecture documents and how to <a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">contribute</a> to Spark.
</li></ul>
<h3>Research Papers</h3>
<p>
Spark was initially developed as a UC Berkeley research project, and much of the design is documented in papers.
The <a href="/research.html">research page</a> lists some of the original motivation and direction.
The following papers have been published about Spark and related projects.
</p>
<ul>
<li>
<a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf">Shark: SQL and Rich Analytics at Scale</a>. Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>Technical Report UCB/EECS-2012-214</em>. November 2012.
</li>
<li>
<a href="http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf">Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters</a>. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica. <em>HotCloud 2012</em>. June 2012.
</li>
<li>
<a href="http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf">Shark: Fast Data Analysis Using Coarse-grained Distributed Memory</a> (demo). Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica. <em>SIGMOD 2012</em>. May 2012. <b>Best Demo Award</b>.
</li>
<li>
<a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a>. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>NSDI 2012</em>. April 2012. <b>Best Paper Award</b> and <b>Honorable Mention for Community Award</b>.
</li>
<li>
<a href="http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a>. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>Technical Report UCB/EECS-2011-82</em>. July 2011.</li>
<li>
<a href="http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf">Spark: Cluster Computing with Working Sets</a>. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>HotCloud 2010</em>. June 2010.
</li>
</ul>
</div>
</div>
<footer class="small">
<hr>
Apache Spark, Spark, Apache, and the Spark logo are trademarks of
<a href="http://www.apache.org">The Apache Software Foundation</a>.
</footer>
</div>
<script src="https://code.jquery.com/jquery.js"></script>
<script src="//netdna.bootstrapcdn.com/bootstrap/3.0.3/js/bootstrap.min.js"></script>
<script src="/js/lang-tabs.js"></script>
</body>
</html>