site/examples.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244

<!DOCTYPE html>
<!--[if IE 6]>
<html id="ie6" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html id="ie7" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html id="ie8" dir="ltr" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8)  ]><!-->
<html dir="ltr" lang="en-US">
<!--<![endif]-->
<head>
  <link rel="shortcut icon" href="/favicon.ico" />
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width" />
  <title>
     Examples | Apache Spark
    
  </title>

  <link rel="stylesheet" type="text/css" media="all" href="/css/style.css" />
  <link rel="stylesheet" href="/css/pygments-default.css">

  <script type="text/javascript">
  <!-- Google Analytics initialization -->
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32518208-2']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

  <!-- Adds slight delay to links to allow async reporting -->
  function trackOutboundLink(link, category, action) {  
    try { 
      _gaq.push(['_trackEvent', category , action]); 
    } catch(err){}
 
    setTimeout(function() {
      document.location.href = link.href;
    }, 100);
  }
  </script>

  <link rel='canonical' href='/index.html' />

  <style type="text/css">
    #site-title,
    #site-description {
      position: absolute !important;
      clip: rect(1px 1px 1px 1px); /* IE6, IE7 */
      clip: rect(1px, 1px, 1px, 1px);
    }
  </style>
  <style type="text/css" id="custom-background-css">
    body.custom-background { background-color: #f1f1f1; }
  </style>
</head>

<!--body class="page singular"-->
<body class="page singular">
<div id="page" class="hfeed">

  <header id="branding" role="banner">
  <hgroup>
    <h1 id="site-title"><span><a href="/" title="Spark" rel="home">Spark</a></span></h1>
    <h2 id="site-description">Lightning-Fast Cluster Computing</h2>
  </hgroup>

  <a href="/">
    <img src="/images/spark-project-header1.png" width="1000" height="220" alt="Spark: Lightning-Fast Cluster Computing" title="Spark: Lightning-Fast Cluster Computing" />
  </a>

  <nav id="access" role="navigation">
    <h3 class="assistive-text">Main menu</h3>
    <div class="menu-main-menu-container">
      <ul id="menu-main-menu" class="menu">
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/index.html">Home</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/downloads.html">Downloads</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/documentation.html">Documentation</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page current-menu-item">
          <a href="/examples.html">Examples</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/mailing-lists.html">Mailing Lists</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/research.html">Research</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/faq.html">FAQ</a>
        </li>
        
      </ul></div>
  </nav><!-- #access -->
</header><!-- #branding -->


  <div id="main">
    <div id="primary">
      <div id="content" role="main">
        
          <article class="page type-page status-publish hentry">
            <h2>Spark Examples</h2>

<p>Spark is built around <em>distributed datasets</em> that support types of parallel operations: transformations, which are lazy and yield another distributed dataset (e.g., <code>map</code>, <code>filter</code>, and <code>join</code>), and actions, which force the computation of a dataset and return a result (e.g., <code>count</code>). The following examples show off some of the available operations and features. Several additional examples are distributed with Spark:</p>

<ul>
  <li>Core Spark: <a href="https://github.com/apache/incubator-spark/tree/master/examples/src/main/scala/org/apache/spark/examples">Scala examples</a>, <a href="https://github.com/apache/incubator-spark/tree/master/examples/src/main/java/org/apache/spark/examples">Java examples</a>, <a href="https://github.com/apache/incubator-spark/tree/master/python/examples">Python examples</a></li>
  <li>Streaming Spark: <a href="https://github.com/apache/incubator-spark/tree/master/examples/src/main/scala/org/apache/spark/streaming/examples">Scala examples</a>, <a href="https://github.com/apache/incubator-spark/tree/master/examples/src/main/java/org/apache/spark/streaming/examples">Java examples</a></li>
</ul>

<h3>Text Search</h3>

<p>In this example, we search through the error messages in a log file:</p>

<p>
</p>
<div class="code">
<span class="keyword">val</span> file = spark.textFile(<span class="string">"hdfs://..."</span>)<br />
<span class="keyword">val</span> errors = file.<span class="sparkop">filter</span>(<span class="closure">line =&gt; line.contains("ERROR")</span>)<br />
<span class="comment">// Count all the errors</span><br />
errors.<span class="sparkop">count</span>()<br />
<span class="comment">// Count errors mentioning MySQL</span><br />
errors.<span class="sparkop">filter</span>(<span class="closure">line =&gt; line.contains("MySQL")</span>).<span class="sparkop">count</span>()<br />
<span class="comment">// Fetch the MySQL errors as an array of strings</span><br />
errors.<span class="sparkop">filter</span>(<span class="closure">line =&gt; line.contains("MySQL")</span>).<span class="sparkop">collect</span>()<br />
</div>
<p></p>

<p>The red code fragments are Scala function literals (closures) that get passed automatically to the cluster. The blue ones are Spark operations.</p>

<h3>In-Memory Text Search</h3>

<p>Spark can <em>cache</em> datasets in memory to speed up reuse. In the example above, we can load just the error messages in RAM using:</p>

<p>
</p>
<div class="code">
errors.<span class="sparkop">cache</span>()
</div>
<p></p>

<p>After the first action that uses <code>errors</code>, later ones will be much faster.</p>

<h3>Word Count</h3>

<p>In this example, we use a few more transformations to build a dataset of (String, Int) pairs called <code>counts</code> and then save it to a file.</p>

<p>
</p>
<div class="code">
<span class="keyword">val</span> file = spark.textFile(<span class="string">"hdfs://..."</span>)<br />
<span class="keyword">val</span> counts = file.<span class="sparkop">flatMap</span>(<span class="closure">line =&gt; line.split(" ")</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.<span class="sparkop">map</span>(<span class="closure">word =&gt; (word, 1)</span>)<br />
&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;.<span class="sparkop">reduceByKey</span>(<span class="closure">_ + _</span>)<br />
counts.<span class="sparkop">saveAsTextFile</span>(<span class="string">"hdfs://..."</span>)
</div>
<p></p>

<h3>Estimating Pi</h3>

<p>Spark can also be used for compute-intensive tasks. This code estimates <span style="font-family: serif; font-size: 120%;">π</span> by "throwing darts" at a circle. We pick random points in the unit square ((0, 0) to (1,1)) and see how many fall in the unit circle. The fraction should be <span style="font-family: serif; font-size: 120%;">π / 4</span>, so we use this to get our estimate.</p>

<p>
</p>
<div class="code">
<span class="keyword">val</span> count = spark.parallelize(1 to NUM_SAMPLES).<span class="sparkop">map</span>(<span class="closure">i =&gt;<br />
&nbsp;&nbsp;<span class="keyword">val</span> x = Math.random<br />
&nbsp;&nbsp;<span class="keyword">val</span> y = Math.random<br />
&nbsp;&nbsp;<span class="keyword">if</span> (x*x + y*y &lt; 1) 1.0 <span class="keyword">else</span> 0.0<br />
</span>).<span class="sparkop">reduce</span>(<span class="closure">_ + _</span>)<br />
println(<span class="string">"Pi is roughly "</span> + 4 * count / NUM_SAMPLES)<br />
</div>
<p></p>

<h3>Logistic Regression</h3>

<p>This is an iterative machine learning algorithm that seeks to find the best hyperplane that separates two sets of points in a multi-dimensional feature space. It can be used to classify messages into spam vs non-spam, for example. Because the algorithm applies the same MapReduce operation repeatedly to the same dataset, it benefits greatly from caching the input data in RAM across iterations.</p>

<p>
</p>
<div class="code">
<span class="keyword">val</span> points = spark.textFile(...).<span class="sparkop">map</span>(parsePoint).<span class="sparkop">cache</span>()<br />
<span class="keyword">var</span> w = Vector.random(D) <span class="comment">// current separating plane</span><br />
<span class="keyword">for</span> (i &lt;- 1 to ITERATIONS) {<br />
&nbsp;&nbsp;<span class="keyword">val</span> gradient = points.<span class="sparkop">map</span>(<span class="closure">p =&gt;<br />
&nbsp;&nbsp;&nbsp;&nbsp;(1 / (1 + exp(-p.y*(w dot p.x))) - 1) * p.y * p.x<br />
&nbsp;&nbsp;</span>).<span class="sparkop">reduce</span>(<span class="closure">_ + _</span>)<br />
&nbsp;&nbsp;w -= gradient<br />
}<br />
println(<span class="string">"Final separating plane: "</span> + w)<br />
</div>
<p></p>

<p>Note that <code>w</code> gets shipped automatically to the cluster with every <code>map</code> call.</p>

<p>The graph below compares the performance of this Spark program against a Hadoop implementation on 30 GB of data on an 80-core cluster, showing the benefit of in-memory caching:</p>

<p align="center">
<img src="/images/spark-lr.png" alt="Logistic regression performance in Spark vs Hadoop" />
</p>


          </article><!-- #post -->
        
      </div><!-- #content -->
      
      <footer id="colophon" role="contentinfo">
  <div id="site-generator">
    <p style="padding-top: 0; padding-bottom: 15px;">
      Apache Spark is an effort undergoing incubation at The Apache Software Foundation.
      <a href="http://incubator.apache.org/" style="border: none;">
        <img style="vertical-align: middle; border: none;" src="/images/incubator-logo.png" alt="Apache Incubator" title="Apache Incubator" />
      </a>  
    </p>
  </div>
</footer><!-- #colophon -->

    </div><!-- #primary -->
  </div><!-- #main -->
</div><!-- #page -->


</body>
</html>