summaryrefslogblamecommitdiff
path: root/site/examples.html
blob: f91c60c3d34fac556cd218bb79d4c55791bbc130 (plain) (tree)
1
2
3
4
5
6
7
8
9
10
11
12
13
14













                                            
                                                  

                                                       



                            



                                                                             









                                                                                                                     
 









                                                              
           











































































                                                                                                                                                                           
                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      


































































































                                                                                                                                                                                                                                                                                                                                                                                                                  





                                                                                                                                              









                           
<!DOCTYPE html>
<!--[if IE 6]>
<html id="ie6" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html id="ie7" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html id="ie8" dir="ltr" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8)  ]><!-->
<html dir="ltr" lang="en-US">
<!--<![endif]-->
<head>
  <link rel="shortcut icon" href="/favicon.ico" />
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width" />
  <title>
     Examples | Apache Spark
    
  </title>

  <link rel="stylesheet" type="text/css" media="all" href="/css/style.css" />
  <link rel="stylesheet" href="/css/pygments-default.css">

  <script type="text/javascript">
  <!-- Google Analytics initialization -->
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32518208-2']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

  <!-- Adds slight delay to links to allow async reporting -->
  function trackOutboundLink(link, category, action) {  
    try { 
      _gaq.push(['_trackEvent', category , action]); 
    } catch(err){}
 
    setTimeout(function() {
      document.location.href = link.href;
    }, 100);
  }
  </script>

  <link rel='canonical' href='/index.html' />

  <style type="text/css">
    #site-title,
    #site-description {
      position: absolute !important;
      clip: rect(1px 1px 1px 1px); /* IE6, IE7 */
      clip: rect(1px, 1px, 1px, 1px);
    }
  </style>
  <style type="text/css" id="custom-background-css">
    body.custom-background { background-color: #f1f1f1; }
  </style>
</head>

<!--body class="page singular"-->
<body class="page singular">
<div id="page" class="hfeed">

  <header id="branding" role="banner">
  <hgroup>
    <h1 id="site-title"><span><a href="/" title="Spark" rel="home">Spark</a></span></h1>
    <h2 id="site-description">Lightning-Fast Cluster Computing</h2>
  </hgroup>

  <a href="/">
    <img src="/images/spark-project-header1.png" width="1000" height="220" alt="Spark: Lightning-Fast Cluster Computing" title="Spark: Lightning-Fast Cluster Computing" />
  </a>

  <nav id="access" role="navigation">
    <h3 class="assistive-text">Main menu</h3>
    <div class="menu-main-menu-container">
      <ul id="menu-main-menu" class="menu">
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/index.html">Home</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/downloads.html">Downloads</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/documentation.html">Documentation</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page current-menu-item">
          <a href="/examples.html">Examples</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/mailing-lists.html">Mailing Lists</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/research.html">Research</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/faq.html">FAQ</a>
        </li>
        
      </ul></div>
  </nav><!-- #access -->
</header><!-- #branding -->



  <div id="main">
    <div id="primary">
      <div id="content" role="main">
        
          <article class="page type-page status-publish hentry">
            <h2>Spark Examples</h2>

<p>Spark is built around <em>distributed datasets</em> that support types of parallel operations: transformations, which are lazy and yield another distributed dataset (e.g., <code>map</code>, <code>filter</code>, and <code>join</code>), and actions, which force the computation of a dataset and return a result (e.g., <code>count</code>). The following examples show off some of the available operations and features. Several additional examples are distributed with Spark, both for core Spark (<a href="https://github.com/apache/incubator-spark/tree/master/examples/src/main/scala/org/apache/spark/examples">Scala examples</a>, <a href="https://github.com/apache/incubator-spark/tree/master/examples/src/main/java/org/apache/spark/examples">Java examples</a>, <a href="https://github.com/apache/incubator-spark/tree/master/python/examples">Python examples</a>) and streaming Spark (<a href="https://github.com/apache/incubator-spark/tree/master/examples/src/main/scala/org/apache/spark/streaming/examples">Scala examples</a>, <a href="https://github.com/apache/incubator-spark/tree/master/examples/src/main/java/org/apache/spark/streaming/examples">Java examples</a>).</p>

<h3>Text Search</h3>

<p>In this example, we search through the error messages in a log file:</p>

<p>
</p>
<div class="code">
<span class="keyword">val</span> file = spark.textFile(<span class="string">"hdfs://..."</span>)<br />
<span class="keyword">val</span> errors = file.<span class="sparkop">filter</span>(<span class="closure">line =&gt; line.contains("ERROR")</span>)<br />
<span class="comment">// Count all the errors</span><br />
errors.<span class="sparkop">count</span>()<br />
<span class="comment">// Count errors mentioning MySQL</span><br />
errors.<span class="sparkop">filter</span>(<span class="closure">line =&gt; line.contains("MySQL")</span>).<span class="sparkop">count</span>()<br />
<span class="comment">// Fetch the MySQL errors as an array of strings</span><br />
errors.<span class="sparkop">filter</span>(<span class="closure">line =&gt; line.contains("MySQL")</span>).<span class="sparkop">collect</span>()<br />
</div>
<p></p>

<p>The red code fragments are Scala function literals (closures) that get passed automatically to the cluster. The blue ones are Spark operations.</p>

<h3>In-Memory Text Search</h3>

<p>Spark can <em>cache</em> datasets in memory to speed up reuse. In the example above, we can load just the error messages in RAM using:</p>

<p>
</p>
<div class="code">
errors.<span class="sparkop">cache</span>()
</div>
<p></p>

<p>After the first action that uses <code>errors</code>, later ones will be much faster.</p>

<h3>Word Count</h3>

<p>In this example, we use a few more transformations to build a dataset of (String, Int) pairs called <code>counts</code> and then save it to a file.</p>

<p>
</p>
<div class="code">
<span class="keyword">val</span> file = spark.textFile(<span class="string">"hdfs://..."</span>)<br />
<span class="keyword">val</span> counts = file.<span class="sparkop">flatMap</span>(<span class="closure">line =&gt; line.split(" ")</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.<span class="sparkop">map</span>(<span class="closure">word =&gt; (word, 1)</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.<span class="sparkop">reduceByKey</span>(<span class="closure">_ + _</span>)<br />
counts.<span class="sparkop">saveAsTextFile</span>(<span class="string">"hdfs://..."</span>)
</div>
<p></p>

<h3>Estimating Pi</h3>

<p>Spark can also be used for compute-intensive tasks. This code estimates <span style="font-family: serif; font-size: 120%;">π</span> by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be <span style="font-family: serif; font-size: 120%;">π / 4</span>, so we use this to get our estimate.</p>

<p>
</p>
<div class="code">
<span class="keyword">val</span> count = spark.parallelize(1 to NUM_SAMPLES).<span class="sparkop">map</span>(<span class="closure">i =&gt;<br />
&nbsp;&nbsp;<span class="keyword">val</span> x = Math.random<br />
&nbsp;&nbsp;<span class="keyword">val</span> y = Math.random<br />
&nbsp;&nbsp;<span class="keyword">if</span> (x*x + y*y &lt; 1) 1.0 <span class="keyword">else</span> 0.0<br />
</span>).<span class="sparkop">reduce</span>(<span class="closure">_ + _</span>)<br />
println(<span class="string">"Pi is roughly "</span> + 4 * count / NUM_SAMPLES)<br />
</div>
<p></p>

<h3>Logistic Regression</h3>

<p>This is an iterative machine learning algorithm that seeks to find the best hyperplane that separates two sets of points in a multi-dimensional feature space. It can be used to classify messages into spam vs non-spam, for example. Because the algorithm applies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input data in RAM across iterations.</p>

<p>
</p>
<div class="code">
<span class="keyword">val</span> points = spark.textFile(...).<span class="sparkop">map</span>(parsePoint).<span class="sparkop">cache</span>()<br />
<span class="keyword">var</span> w = Vector.random(D) <span class="comment">// current separating plane</span><br />
<span class="keyword">for</span> (i &lt;- 1 to ITERATIONS) {<br />
&nbsp;&nbsp;<span class="keyword">val</span> gradient = points.<span class="sparkop">map</span>(<span class="closure">p =&gt;<br />
&nbsp;&nbsp;&nbsp;&nbsp;(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x<br />
&nbsp;&nbsp;</span>).<span class="sparkop">reduce</span>(<span class="closure">_ + _</span>)<br />
&nbsp;&nbsp;w -= gradient<br />
}<br />
println(<span class="string">"Final separating plane: "</span> + w)<br />
</div>
<p></p>

<p>Note that <code>w</code> gets shipped automatically to the cluster with every <code>map</code> call.</p>

<p>The graph below compares the performance of this Spark program against a Hadoop implementation on 30 GB of data on an 80-core cluster, showing the benefit of in-memory caching:</p>

<p align="center">
<img src="/images/spark-lr.png" alt="Logistic regression performance in Spark vs Hadoop" />
</p>


          </article><!-- #post -->
        
      </div><!-- #content -->
      
      <footer id="colophon" role="contentinfo">
  <div id="site-generator">
    <p style="padding-top: 0; padding-bottom: 15px;">
      Apache Spark is an effort undergoing incubation at The Apache Software Foundation.
      <a href="http://incubator.apache.org/" style="border: none;">
        <img style="vertical-align: middle; border: none;" src="/images/incubator-logo.png" alt="Apache Incubator" title="Apache Incubator" />
      </a>  
    </p>
  </div>
</footer><!-- #colophon -->

    </div><!-- #primary -->
  </div><!-- #main -->
</div><!-- #page -->


</body>
</html>