Spark Documentation
Setup instructions, programming guides, and other documentation are available for each version of Spark below:
- Spark 0.8.1 (latest release)
- Spark 0.8.0
- Spark 0.7.3
- Spark 0.6.2
- Spark 0.5.x (hosted on GitHub)
Read these documents to get started with Spark. In addition, this page lists some external resources for learning Spark.
Video Tutorials
- Screencast 1: First Steps with Spark
- Screencast 2: Spark Documentation Overview
- Screencast 3: Transformations and Caching
- Screencast 4: A Spark Standalone Job in Scala
Hands-On Exercises
- Hands-on exercises are available online. These exercises let you launch a small EC2 cluster, load a dataset, and query it with Spark, Shark, Spark Streaming, and MLLib.
Spark Summit Slides and Videos
- Spark Summit 2013 was held in downtown San Francisco in December 2013. Slides and Videos of all talks are available for free. Look for links next to talk titles on the event agenda.
AMP Camp Slides and Videos
- The UC Berkeley AMPLab regularly hosts two-day training camps on Spark and related "big data" components.
Slides and videos from each camp are posted online:
AMP Camp Three Big Data Bootcamp Berkeley (August 2013)
AMP Camp Two Big Data Bootcamp Strata (February 2013)
AMP Camp One Big Data Bootcamp Berkeley (August 2012)
Books
- Fast Data Processing with Spark, by Holden Karau (Packt Publishing)
External Tutorials, Development Blogs, and Talks
- Sampling Twitter Using Declarative Streams -- Spark Streaming tutorial by Patrick Wendell
- A Powerful Big Data Trio: Spark, Parquet and Avro -- Using Parquet in Spark by Matt Massie
- Real-time Analytics with Cassandra, Spark, and Shark -- Presentation by Evan Chan from Ooyala at the 2013 Cassandra Summit
- Getting Spark Setup in Eclipse -- Developer blog post by James Percent
- Run Spark and Shark on Amazon Elastic MapReduce -- Article by Amazon AWS Elastic MapReduce team member Parviz Deyhim
- Unit testing with Spark -- Quantifind tech blog post by Imran Rashid
- Configuring Spark logs -- Quantifind tech blog by Imran Rashid
- Spark, an alternative for fast data analytics -- IBM Developer Works article by M. Tim Jones
Spark Internals
Research Papers
- Shark: SQL and Rich Analytics at Scale. Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. Technical Report UCB/EECS-2012-214. November 2012.
- Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters. Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica. HotCloud 2012. June 2012.
- Shark: Fast Data Analysis Using Coarse-grained Distributed Memory (demo). Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica. SIGMOD 2012. May 2012. Best Demo Award.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. NSDI 2012. April 2012. Best Paper Award and Honorable Mention for Community Award.
- Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. Technical Report UCB/EECS-2011-82. July 2011.
- Spark: Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. HotCloud 2010. June 2010.