Spark Documentation
Setup instructions, programming guides, and other documentation are available for each version of Spark below:
The documentation linked to above covers getting started with Spark, as well the built-in components MLlib,
Spark Streaming, and GraphX.
In addition, this page lists other resources for learning Spark.
Videos
See the Apache Spark YouTube Channel for videos from Spark events. There are separate playlists for videos of different topics. Besides browsing through playlists, you can also find direct links to videos below.
Screencast Tutorial Videos
Spark Summit Videos
- Videos from Spark Summit 2013, San Francisco, Dec 2-3 2013
Meetup Talk Videos
In addition to the videos listed below, you can also view all slides from Bay Area meetups here.
- Spark 1.0 and Beyond (slides) by Patrick Wendell, at Cisco in San Jose, 2014-04-23
- Adding Native SQL Support to Spark with Catalyst (slides) by Michael Armbrust, at Tagged in SF, 2014-04-08
- SparkR and GraphX (slides: SparkR, GraphX) by Shivaram Venkataraman & Dan Crankshaw, at SkyDeck in Berkeley, 2014-03-25
- Simple deployment w/ SIMR & Advanced Shark Analytics w/ TGFs (slides) by Ali Ghodsi, at Huawei in Santa Clara, 2014-02-05
- Stores, Monoids & Dependency Injection - Abstractions for Spark (slides) by Ryan Weald, at Sharethrough in SF, 2014-01-17
- Distributed Machine Learning using MLbase (slides) by Evan Sparks & Ameet Talwalkar, at Twitter in SF, 2013-08-06
- GraphX Preview: Graph Analysis on Spark by Reynold Xin & Joseph Gonzalez, at Flurry in SF, 2013-07-02
- Deep Dive with Spark Streaming (slides) by Tathagata Das, at Plug and Play in Sunnyvale, 2013-06-17
- Tachyon and Shark update (slides: Shark, Tachyon) by Ali Ghodsi, Haoyuan Li, Reynold Xin, Google Ventures, 2013-05-09
- Spark 0.7: Overview, pySpark, & Streaming by Matei Zaharia, Josh Rosen, Tathagata Das, at Conviva on 2013-02-21
- Introduction to Spark Internals (slides) by Matei Zaharia, at Yahoo in Sunnyvale, 2012-12-18
Hands-On Exercises
- Hands-on exercises are available online from Spark Summit 2013. These exercises let you launch a small EC2 cluster, load a dataset, and query it with Spark, Shark, Spark Streaming, and MLlib.
Training Materials
- The 2nd day of Spark Summit 2013 was a training session, and you can find the slides and videos from that inline in the training day agenda.
The session also included exercises that you can walk through yourself, which will guide you through launching a Spark cluster on EC2 and using various Spark components to analyze real data.
- The UC Berkeley AMPLab regularly hosts training camps on Spark and related projects.
Slides, videos and EC2-based exercises from each of these are available online:
External Tutorials, Blog Posts, and Talks
Books
Examples
Wiki
-
The Spark wiki contains
information for developers, such as architecture documents and how to contribute to Spark.
Research Papers
Spark was initially developed as a UC Berkeley research project, and much of the design is documented in papers.
The research page lists some of the original motivation and direction.
The following papers have been published about Spark and related projects.
-
Shark: SQL and Rich Analytics at Scale. Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. Technical Report UCB/EECS-2012-214. November 2012.
-
Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica. HotCloud 2012. June 2012.
-
Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo). Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica. SIGMOD 2012. May 2012. Best Demo Award.
-
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award and Honorable Mention for Community Award.
-
Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. Technical Report UCB/EECS-2011-82. July 2011.
-
Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.