summaryrefslogblamecommitdiff
path: root/documentation.md
blob: 8ded7a9a1a25056d7f88f1dcf4dc2627a2e5eacf (plain) (tree)
1
2
3
4
5
6
7
8
9
10
11
12
13
14













                                                                                                                     
                                                                              
                                                            
                                                            
                                                            
                                                            

     

                                                                                                                                                                     
 
                                                                       
 
               
                                                                                                                                                                                                                                                                                                                                                                         
 









                                                                                                                               






                                                                                                                                                                                       
                                                                                                                                                                         



         
                                                       





                                                                                                                                                       
    
                                                                                                                                                                                                                                                                                                          
 
                                                                                                                                                                                                                                                                                                                                                 
 
















                                                                                                                                                                                                                                                                                                                                                                                                                           
 
 
 
 


     


                           
                                                                                                                                                                                                                                                                

     


                           

                                                                                                                                                                                                                                                                     
                                                                                                                                        
                                                                               
    



                                                                                                                                                    

         

     
                                                  

    

                                                                                                                                                                                                               








                                                                                                                                                                                                                       

     





                                                                                                                                                         
                 

    
                                                                                                                         

     
             
 



                                                                                                                                                                                


                        





                                                                                                                

















                                                                                                                                                                                                                                                                                                                                                                                                                                 
     
---
layout: global
title: Documentation
type: "page singular"
navigation:
  weight: 3
  show: true
---

<h2>Spark Documentation</h2>

<p>Setup instructions, programming guides, and other documentation are available for each version of Spark below:</p>

<ul>
  <li><a href="{{site.url}}docs/latest/">Spark 1.0.1 (latest release)</a></li>
  <li><a href="{{site.url}}docs/0.9.1/">Spark 0.9.1</a></li>
  <li><a href="{{site.url}}docs/0.8.1/">Spark 0.8.1</a></li>
  <li><a href="{{site.url}}docs/0.7.3/">Spark 0.7.3</a></li>
  <li><a href="{{site.url}}docs/0.6.2/">Spark 0.6.2</a></li>
</ul>

<p>The documentation linked to above covers getting started with Spark, as well the built-in components <a href="{{site.url}}docs/latest/mllib-guide.html">MLlib</a>,
<a href="{{site.url}}docs/latest/streaming-programming-guide.html">Spark Streaming</a>, and <a href="{{site.url}}docs/latest/graphx-guide.html">GraphX</a>.</p>

<p>In addition, this page lists other resources for learning Spark.</p>

<h3>Videos</h3>
See the <a href="http://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w">Apache Spark YouTube Channel</a> for videos from Spark events. There are separate <a href="http://www.youtube.com/channel/UCRzsq7k4-kT-h3TDUBQ82-w/playlists">playlists</a> for videos of different topics. Besides browsing through playlists, you can also find direct links to videos below.

<h4>Screencast Tutorial Videos</h4>
<ul>
  <li><a href="{{site.url}}screencasts/1-first-steps-with-spark.html">Screencast 1: First Steps with Spark</a></li>
  <li><a href="{{site.url}}screencasts/2-spark-documentation-overview.html">Screencast 2: Spark Documentation Overview</a></li>
<li><a href="{{site.url}}screencasts/3-transformations-and-caching.html">Screencast 3: Transformations and Caching</a></li>
<li><a href="{{site.url}}screencasts/4-a-standalone-job-in-spark.html">Screencast 4: A Spark Standalone Job in Scala</a></li>

</ul>

<h4>Spark Summit Videos</h4>
<ul>
  <li>Videos from Spark Summit 2013, San Francisco, Dec 2-3 2013
    <ul>
      <li>See the <a href="http://spark-summit.org/2013#agendapluginwidget-4">event agenda</a> for the full agenda with inline links to all videos and slides</li>
      <li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwjXj33QvAXN0Vlx0gc6u0je">YouTube playist of all Keynotes</a></li>
      <li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwiNcKwIkDEQZBejiqxEJ79U">YouTube playist of Track A (Spark Applications)</a></li>
      <li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwiNcKwIkDEQZBejiqxEJ79U">YouTube playist of Track B (Spark Deployment, Scheduling & Perf, Related projects)</a></li>
      <li><a href="http://www.youtube.com/playlist?list=PL-x35fyliRwjR1Umntxz52zv3EcKpbzCp">YouTube playist of the Training Day (i.e. the 2nd day of the summit)</a></li>
    </ul>
  </li>
</ul>

<h4><a name="meetup-videos"></a>Meetup Talk Videos</h4>
In addition to the videos listed below, you can also view <a href="http://www.meetup.com/spark-users/files/">all slides from Bay Area meetups here</a>.
<style type="text/css">
  .video-meta-info {
    font-size: 0.95em;
  }
</style>
<ul>
  <li><a href="http://www.youtube.com/watch?v=NUQ-8to2XAk&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Spark 1.0 and Beyond</a> (<a href="http://files.meetup.com/3138542/Spark%201.0%20Meetup.ppt">slides</a>) <span class="video-meta-info">by Patrick Wendell, at Cisco in San Jose, 2014-04-23</span></li>

  <li><a href="http://www.youtube.com/watch?v=ju2OQEXqONU&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Adding Native SQL Support to Spark with Catalyst</a> (<a href="http://files.meetup.com/3138542/Spark%20SQL%20Meetup%20-%204-8-2012.pdf">slides</a>) <span class="video-meta-info">by Michael Armbrust, at Tagged in SF, 2014-04-08</span></li>

  <li><a href="http://www.youtube.com/watch?v=MY0NkZY_tJw&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">SparkR and GraphX</a> (slides: <a href="http://files.meetup.com/3138542/SparkR-meetup.pdf">SparkR</a>, <a href="http://files.meetup.com/3138542/graphx%40spark_meetup03_2014.pdf">GraphX</a>) <span class="video-meta-info">by Shivaram Venkataraman &amp; Dan Crankshaw, at SkyDeck in Berkeley, 2014-03-25</span></li>

  <li><a href="http://www.youtube.com/watch?v=5niXiiEX5pE&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Simple deployment w/ SIMR &amp; Advanced Shark Analytics w/ TGFs</a> (<a href="http://files.meetup.com/3138542/tgf.pptx">slides</a>) <span class="video-meta-info">by Ali Ghodsi, at Huawei in Santa Clara, 2014-02-05</span></li>

  <li><a href="http://www.youtube.com/watch?v=C7gWtxelYNM&list=PL-x35fyliRwiP3YteXbnhk0QGOtYLBT3a">Stores, Monoids &amp; Dependency Injection - Abstractions for Spark</a> (<a href="http://files.meetup.com/3138542/Abstractions%20for%20spark%20streaming%20-%20spark%20meetup%20presentation.pdf">slides</a>) <span class="video-meta-info">by Ryan Weald, at Sharethrough in SF, 2014-01-17</span></li>

  <li><a href="https://www.youtube.com/watch?v=IxDnF_X4M-8">Distributed Machine Learning using MLbase</a> (<a href="http://files.meetup.com/3138542/sparkmeetup_8_6_13_final_reduced.pdf">slides</a>) <span class="video-meta-info">by Evan Sparks &amp; Ameet Talwalkar, at Twitter in SF, 2013-08-06</span></li>

  <li><a href="https://www.youtube.com/watch?v=vJQ2RZj9hqs">GraphX Preview: Graph Analysis on Spark</a> <span class="video-meta-info">by Reynold Xin &amp; Joseph Gonzalez, at Flurry in SF, 2013-07-02</span></li>

  <li><a href="http://www.youtube.com/watch?v=D1knCQZQQnw">Deep Dive with Spark Streaming</a> (<a href="http://www.slideshare.net/spark-project/deep-divewithsparkstreaming-tathagatadassparkmeetup20130617">slides</a>) <span class="video-meta-info">by Tathagata Das, at Plug and Play in Sunnyvale, 2013-06-17</span></li>

  <li><a href="https://www.youtube.com/watch?v=cAZ624-69PQ">Tachyon and Shark update</a> (slides: <a href="http://files.meetup.com/3138542/2013-05-09%20Shark%20%40%20Spark%20Meetup.pdf">Shark</a>, <a href="http://files.meetup.com/3138542/Tachyon_2013-05-09_Spark_Meetup.pdf">Tachyon</a>) <span class="video-meta-info">by Ali Ghodsi, Haoyuan Li, Reynold Xin, Google Ventures, 2013-05-09</span></li>

  <li><a href="https://www.youtube.com/playlist?list=PLxwbieuTaYXmWTBovyyw2NibPfUaJk-h4">Spark 0.7: Overview, pySpark, &amp; Streaming</a> <span class="video-meta-info">by Matei Zaharia, Josh Rosen, Tathagata Das, at Conviva on 2013-02-21</span></li>

  <li><a href="https://www.youtube.com/watch?v=49Hr5xZyTEA">Introduction to Spark Internals</a> (<a href="http://files.meetup.com/3138542/dev-meetup-dec-2012.pptx">slides</a>) <span class="video-meta-info">by Matei Zaharia, at Yahoo in Sunnyvale, 2012-12-18</span></li>




</ul>


<h3>Hands-On Exercises</h3>

<ul>
  <li><a href="http://spark-summit.org/2013/exercises/">Hands-on exercises</a> are available online from Spark Summit 2013. These exercises let you launch a small EC2 cluster, load a dataset, and query it with Spark, Shark, Spark Streaming, and MLlib.</li>
</ul>

<a name="summit"></a>
<h3>Training Materials</h3>
<ul>
  <li>The 2nd day of <a href="http://spark-summit.org/2013">Spark Summit 2013</a> was a training session, and you can find the slides and videos from that inline in <a href="http://spark-summit.org/summit-2013/#agendapluginwidget-5">the training day agenda</a>.
    The session also included <a href="http://spark-summit.org/2013/exercises/">exercises</a> that you can walk through yourself, which will guide you through launching a Spark cluster on EC2 and using various Spark components to analyze real data.</li>
  <li>The <a href="https://amplab.cs.berkeley.edu/">UC Berkeley AMPLab</a> regularly hosts training camps on Spark and related projects.
Slides, videos and EC2-based exercises from each of these are available online:
<ul>
    <li><a href="http://ampcamp.berkeley.edu/4/">AMP Camp 4</a> (Strata Santa Clara, Feb 2014) &mdash; focus on BlinkDB, MLlib, GraphX, Tachyon</li>
    <li><a href="http://ampcamp.berkeley.edu/3/">AMP Camp 3</a> (Berkeley, CA, Aug 2013)</li>
    <li><a href="http://ampcamp.berkeley.edu/amp-camp-two-strata-2013/">AMP Camp 2</a> (Strata Santa Clara, Feb 2013)</li>
    <li><a href="http://ampcamp.berkeley.edu/agenda-2012/">AMP Camp 1</a> (Berkeley, CA, Aug 2012)</li>
    </ul>
  </li>
</ul>

<h3>External Tutorials, Blog Posts, and Talks</h3>

<ul>
  <li><a href="http://engineering.ooyala.com/blog/using-parquet-and-scrooge-spark">Using Parquet and Scrooge with Spark</a> &mdash; Scala-friendly Parquet and Avro usage tutorial from Ooyala's Evan Chan</li>
  <li><a href="http://codeforhire.com/2014/02/18/using-spark-with-mongodb/">Using Spark with MongoDB</a> &mdash; by Sampo Niskanen from Wellmo</li>
  <li><a href="http://spark-summit.org/2013">Spark Summit 2013</a> &mdash; contained 30 talks about Spark use cases, available as slides and videos</li>
  <li><a href="http://www.pwendell.com/2013/09/28/declarative-streams.html">Sampling Twitter Using Declarative Streams</a> &mdash; Spark Streaming tutorial by Patrick Wendell</li>
  <li><a href="http://zenfractal.com/2013/08/21/a-powerful-big-data-trio/">A Powerful Big Data Trio: Spark, Parquet and Avro</a> &mdash; Using Parquet in Spark by Matt Massie</li>
  <li><a href="http://www.slideshare.net/EvanChan2/cassandra2013-spark-talk-final">Real-time Analytics with Cassandra, Spark, and Shark</a> &mdash; Presentation by Evan Chan from Ooyala at 2013 Cassandra Summit</li>
  <li><a href="http://syndeticlogic.net/?p=311">Getting Spark Setup in Eclipse</a> &mdash; Developer blog post by James Percent</li>
  <li><a href="http://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923">Run Spark and Shark on Amazon Elastic MapReduce</a> &mdash; Article by Amazon Elastic MapReduce team member Parviz Deyhim</li>
  <li><a href="http://blog.quantifind.com/posts/spark-unit-test/">Unit testing with Spark</a> &mdash; Quantifind tech blog post by Imran Rashid</li>
  <li><a href="http://blog.quantifind.com/posts/logging-post/">Configuring Spark logs</a> &mdash; Quantifind tech blog by Imran Rashid</li>
  <li><a href="http://www.ibm.com/developerworks/library/os-spark/">Spark, an alternative for fast data analytics</a> &mdash; IBM Developer Works article by M. Tim Jones</li>
</ul>

<h3>Books</h3>

<ul>
  <li><a href="http://www.packtpub.com/fast-data-processing-with-spark/book">Fast Data Processing with Spark</a>, by Holden Karau (Packt Publishing)</li>
</ul>

<h3>Examples</h3>

<ul>
  <li>The <a href="{{site.url}}examples.html">Spark examples page</a> shows the basic API in Scala, Java and Python.</li>
</ul>

<h3>Wiki</h3>

<ul><li>
The <a href="https://cwiki.apache.org/confluence/display/SPARK/Wiki+Homepage">Spark wiki</a> contains
information for developers, such as architecture documents and how to <a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">contribute</a> to Spark.
</li></ul>

<h3>Research Papers</h3>

<p>
Spark was initially developed as a UC Berkeley research project, and much of the design is documented in papers.
The <a href="{{site.url}}research.html">research page</a> lists some of the original motivation and direction.
The following papers have been published about Spark and related projects.
</p>

<ul>
  <li>
    <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf">Shark: SQL and Rich Analytics at Scale</a>. Reynold Xin, Joshua Rosen, Matei Zaharia, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>Technical Report UCB/EECS-2012-214</em>. November 2012.
  </li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf">Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters</a>.  Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica. <em>HotCloud 2012</em>. June 2012.
  </li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2012/sigmod_shark_demo.pdf">Shark: Fast Data Analysis Using Coarse-grained Distributed Memory</a> (demo). Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Haoyuan Li, Scott Shenker, Ion Stoica. <em>SIGMOD 2012</em>. May 2012. <b>Best Demo Award</b>.
  </li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a>.  Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>NSDI 2012</em>. April 2012. <b>Best Paper Award</b> and <b>Honorable Mention for Community Award</b>.
  </li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2011/tr_spark.pdf">Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing</a>.  Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>Technical Report UCB/EECS-2011-82</em>.  July 2011.</li>
  <li>
    <a href="http://www.cs.berkeley.edu/~matei/papers/2010/hotcloud_spark.pdf">Spark: Cluster Computing with Working Sets</a>. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, Ion Stoica. <em>HotCloud 2010</em>. June 2010.
  </li>
</ul>