documentation.md


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143

---
layout: global
title: Documentation
type: "page singular"
navigation:
  weight: 3
  show: true
---

<h2>Spark Documentation</h2>

<p>Setup instructions, programming guides, and other documentation are available for each version of Spark below:</p>

<ul>
  <li><a href="{{site.url}}docs/latest/">Spark 0.9.1 (latest release)</a></li>
  <li><a href="{{site.url}}docs/0.8.1/">Spark 0.8.1</a></li>
  <li><a href="{{site.url}}docs/0.7.3/">Spark 0.7.3</a></li>
  <li><a href="{{site.url}}docs/0.6.2/">Spark 0.6.2</a></li>
</ul>

<p>The documentation linked to above covers getting started with Spark, as well the built-in components <a href="{{site.url}}docs/latest/mllib-guide.html">MLlib</a>,
<a href="{{site.url}}docs/latest/streaming-programming-guide.html">Spark Streaming</a>, and <a href="{{site.url}}docs/latest/graphx-guide.html">GraphX</a>.</p>

<p>In addition, this page lists other resources for learning Spark.</p>

<h3>Videos</h3>
See the <a href="http://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w">Apache Spark YouTube Channel</a> for videos from Spark events. There are separate <a href="http://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w/playlists">playlists</a> for videos of different topics. Besides browsing through playlists, you can also find direct links to videos below.

<h4>Screencast Tutorial Videos</h4>
<ul>
  <li><a href="{{site.url}}screencasts/1-first-steps-with-spark.html">Screencast 1: First Steps with Spark</a></li>
  <li><a href="{{site.url}}screencasts/2-spark-documentation-overview.html">Screencast 2: Spark Documentation Overview</a></li>
<li><a href="{{site.url}}screencasts/3-transformations-and-caching.html">Screencast 3: Transformations and Caching</a></li>
<li><a href="{{site.url}}screencasts/4-a-standalone-job-in-spark.html">Screencast 4: A Spark Standalone Job in Scala</a></li>

</ul>

<h4>Spark Summit Videos</h4>
<ul>
  <li>Videos from Spark Summit 2013, San Francisco, Dec 2-3 2013
    <ul>
      <li>See the <a href="http://spark-summit.org/2013#agendapluginwidget-4">event agenda</a> for the full agenda with inline links to all videos and slides</li>
      <li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwjXj33QvAXN0Vlx0gc6u0je">YouTube playist of all Keynotes</a></li>
      <li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwiNcKwIkDEQZBejiqxEJ79U">YouTube playist of Track A (Spark Applications)</a></li>
      <li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwiNcKwIkDEQZBejiqxEJ79U">YouTube playist of Track B (Spark Deployment, Scheduling & Perf, Related projects)</a></li>
      <li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwjR1Umntxz52zv3EcKpbzCp">YouTube playist of the Training Day (i.e. the 2nd day of the summit)</a></li>
    </ul>
  </li>
</ul>

<h4>Meetup Talk Videos</h4>
<ul>
  <li><a href="http://www.youtube.com/watch?v=ju2OQEXqONU&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Adding Native SQL Support to Spark with Catalyst</a> (<a href="http://files.meetup.com/3138542/Spark%20SQL%20Meetup%20-%204-8-2012.pdf">slides</a>) by Michael Armbrust, at Tagged in San Francisco, 2014-04-08</li>
  <li><a href="http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">SparkR and GraphX</a> (<a href="http://files.meetup.com/3138542/SparkR-meetup.pdf">SparkR slides</a>, <a href="http://files.meetup.com/3138542/graphx%40spark_meetup03_2014.pdf">GraphX slides</a>) by Shivaram Venkataraman &amp; Dan Crankshaw, at SkyDeck in Berkeley, 2014-03-25</li>
  <li><a href="http://www.youtube.com/watch?v=5niXiiEX5pE&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Simple deployment with SIMR and Advanced Shark Analytics with TGFs</a> (<a href="http://files.meetup.com/3138542/tgf.pptx">slides</a>) by Ali Ghodsi, at Huawei in Santa Clara, 2014-02-05</li>
  <li><a href="http://www.youtube.com/watch?v=C7gWtxelYNM&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Stores, Monoids and Dependency Injection - Abstractions for Spark</a> (<a href="http://files.meetup.com/3138542/Abstractions%20for%20spark%20streaming%20-%20spark%20meetup%20presentation.pdf">slides</a>) by Ryan Weald, at Sharethrough in San Francisco, 2014-01-17</li>
</ul>


<h3>Hands-On Exercises</h3>

<ul>
  <li><a href="http://spark-summit.org/2013/exercises/">Hands-on exercises</a> are available online from Spark Summit 2013. These exercises let you launch a small EC2 cluster, load a dataset, and query it with Spark, Shark, Spark Streaming, and MLLib.</li>
</ul>

<a name="summit"></a>
<h3>Training Materials</h3>
<ul>
  <li>The 2nd day of <a href="http://spark-summit.org/2013">Spark Summit 2013</a> was a training session, and you can find the slides and videos from that inline in <a href="http://spark-summit.org/summit-2013/#agendapluginwidget-5">the training day agenda</a>.
    The session also included <a href="http://spark-summit.org/2013/exercises/">exercises</a> that you can walk through yourself, which will guide you through launching a Spark cluster on EC2 and using various Spark components to analyze real data.</li>
  <li>The <a href="https://amplab.cs.berkeley.edu/">UC Berkeley AMPLab</a> regularly hosts training camps on Spark and related projects.
Slides, videos and EC2-based exercises from each of these are available online:
<ul>
    <li><a href="http://ampcamp.berkeley.edu/4/">AMP Camp 4</a> (Strata Santa Clara, Feb 2014) &mdash; focus on BlinkDB, MLlib, GraphX, Tachyon</li>
    <li><a href="http://ampcamp.berkeley.edu/3/">AMP Camp 3</a> (Berkeley, CA, Aug 2013)</li>
    <li><a href="http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/">AMP Camp 2</a> (Strata Santa Clara, Feb 2013)</li>
    <li><a href="http://ampcamp.berkeley.edu/agenda-2012/">AMP Camp 1</a> (Berkeley, CA, Aug 2012)</li>
    </ul>
  </li>
</ul>

<h3>External Tutorials, Blog Posts, and Talks</h3>

<ul>
  <li><a href="http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark">Using Parquet and Scrooge with Spark</a> &mdash; Scala-friendly Parquet and Avro usage tutorial from Ooyala's Evan Chan</li>
  <li><a href="http://codeforhire.com/2014/02/18/using-spark-with-mongodb/">Using Spark with MongoDB</a> &mdash; by Sampo Niskanen from Wellmo</li>
  <li><a href="http://spark-summit.org/2013">Spark Summit 2013</a> &mdash; contained 30 talks about Spark use cases, available as slides and videos</li>
  <li><a href="http://www.pwendell.com/2013/09/28/declarative-streams.html">Sampling Twitter Using Declarative Streams</a> &mdash; Spark Streaming tutorial by Patrick Wendell</li>
  <li><a href="http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/">A Powerful Big Data Trio: Spark, Parquet and Avro</a> &mdash; Using Parquet in Spark by Matt Massie</li>
  <li><a href="http://www.slideshare.net/EvanChan2/cassandra2013-spark-talk-final">Real-time Analytics with Cassandra, Spark, and Shark</a> &mdash; Presentation by Evan Chan from Ooyala at 2013 Cassandra Summit</li>
  <li><a href="http://syndeticlogic.net/?p=311">Getting Spark Setup in Eclipse</a> &mdash; Developer blog post by James Percent</li>
  <li><a href="http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923">Run Spark and Shark on Amazon Elastic MapReduce</a> &mdash; Article by Amazon Elastic MapReduce team member Parviz Deyhim</li>
  <li><a href="http://blog.quantifind.com/posts/spark-unit-test/">Unit testing with Spark</a> &mdash; Quantifind tech blog post by Imran Rashid</li>
  <li><a href="http://blog.quantifind.com/posts/logging-post/">Configuring Spark logs</a> &mdash; Quantifind tech blog by Imran Rashid</li>
  <li><a href="http://www.ibm.com/developerworks/library/os-spark/">Spark, an alternative for fast data analytics</a> &mdash; IBM Developer Works article by M. Tim Jones</li>
</ul>

<h3>Books</h3>

<ul>
  <li><a href="http://www.packtpub.com/fast-data-processing-with-spark/book">Fast Data Processing with Spark</a>, by Holden Karau (Packt Publishing)</li>
</ul>

<h3>Examples</h3>

<ul>
  <li>The <a href="{{site.url}}examples.html">Spark examples page</a> shows the basic API in Scala, Java and Python.</li>
</ul>

<h3>Wiki</h3>

<ul><li>
The <a href="https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage">Spark wiki</a> contains
information for developers, such as architecture documents and how to <a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">contribute</a> to Spark.
</li></ul>

<h3>Research Papers</h3>

<p>
Spark was initially developed as a UC Berkeley research project, and much of the design is documented in papers.
The <a href="{{site.url}}research.html">research page</a> lists some of the original motivation and direction.
The following papers have been published about Spark and related projects.
</p>

<ul>
  <li>
    <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf">Shark: SQL and Rich Analytics at Scale</a>. Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>Technical Report UCB/EECS-2012-214</em>. November 2012.
  </li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf">Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters</a>.  Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica. <em>HotCloud 2012</em>. June 2012.
  </li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf">Shark: Fast Data Analysis Using Coarse-grained Distributed Memory</a> (demo). Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica. <em>SIGMOD 2012</em>. May 2012. <b>Best Demo Award</b>.
  </li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a>.  Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>NSDI 2012</em>. April 2012. <b>Best Paper Award</b> and <b>Honorable Mention for Community Award</b>.
  </li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a>.  Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>Technical Report UCB/EECS-2011-82</em>.  July 2011.</li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf">Spark: Cluster Computing with Working Sets</a>. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>HotCloud 2010</em>. June 2010.
  </li>
</ul>