site/releases/spark-release-0-6-0.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235

<!DOCTYPE html>
<!--[if IE 6]>
<html id="ie6" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html id="ie7" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html id="ie8" dir="ltr" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8)  ]><!-->
<html dir="ltr" lang="en-US">
<!--<![endif]-->
<head>
  <link rel="shortcut icon" href="/favicon.ico" />
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width" />
  <title>
     Spark Release 0.6.0 | Apache Spark
    
  </title>

  <link rel="stylesheet" type="text/css" media="all" href="/css/style.css" />
  <link rel="stylesheet" href="/css/pygments-default.css">

  <script type="text/javascript">
  <!-- Google Analytics initialization -->
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32518208-2']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

  <!-- Adds slight delay to links to allow async reporting -->
  function trackOutboundLink(link, category, action) {  
    try { 
      _gaq.push(['_trackEvent', category , action]); 
    } catch(err){}
 
    setTimeout(function() {
      document.location.href = link.href;
    }, 100);
  }
  </script>

  <link rel='canonical' href='/index.html' />

  <style type="text/css">
    #site-title,
    #site-description {
      position: absolute !important;
      clip: rect(1px 1px 1px 1px); /* IE6, IE7 */
      clip: rect(1px, 1px, 1px, 1px);
    }
  </style>
  <style type="text/css" id="custom-background-css">
    body.custom-background { background-color: #f1f1f1; }
  </style>
</head>

<!--body class="page singular"-->
<body class="singular">
<div id="page" class="hfeed">

  <header id="branding" role="banner">
  <hgroup>
    <h1 id="site-title"><span><a href="/" title="Spark" rel="home">Spark</a></span></h1>
    <h2 id="site-description">Lightning-Fast Cluster Computing</h2>
  </hgroup>

  <a id="main-logo" href="/">
    <img style="height:175px; width:auto;" src="/images/spark-project-header1-cropped.png" alt="Spark: Lightning-Fast Cluster Computing" title="Spark: Lightning-Fast Cluster Computing" />
  </a>
  <div class="widget-summit">
    <a href="http://spark-summit.org"><img src="/images/Summit-Logo-FINALtr-150x150px.png" /></a>
    <div class="text">
      <a href="http://spark-summit.org/2013">
        
        <strong>Videos and Slides<br/>
        Available Now!</strong>
      </a>
    </div>
  </div>

  <nav id="access" role="navigation">
    <h3 class="assistive-text">Main menu</h3>
    <div class="menu-main-menu-container">
      <ul id="menu-main-menu" class="menu">
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/index.html">Home</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/downloads.html">Downloads</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/documentation.html">Documentation</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/examples.html">Examples</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/mailing-lists.html">Mailing Lists</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/research.html">Research</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/faq.html">FAQ</a>
        </li>
        
      </ul></div>
  </nav><!-- #access -->
</header><!-- #branding -->


  <div id="main">
    <div id="primary">
      <div id="content" role="main">
        
          <article class="page type-page status-publish hentry">
            <h2>Spark Release 0.6.0</h2>


<p>Spark 0.6.0 is a major release that brings several new features, architectural changes, and performance enhancements. The most visible additions are a standalone deploy mode, a Java API, and expanded documentation; but there are also numerous other changes under the hood, which improve performance in some cases by as much as 2x.</p>

<p>You can download this release as either a <a href="http://github.com/downloads/mesos/spark/spark-0.6.0-sources.tar.gz">source package</a> (2 MB tar.gz) or <a href="http://github.com/downloads/mesos/spark/spark-0.6.0-prebuilt.tar.gz">prebuilt package</a> (48 MB tar.gz)</p>

<h3>Simpler Deployment</h3>

<p>In addition to running on Mesos, Spark now has a <a href="/docs/0.6.0/spark-standalone.html">standalone deploy mode</a> that lets you quickly launch a cluster without installing an external cluster manager. The standalone mode only needs Java installed on each machine, and Spark deployed to it.</p>

<p>In addition, there is <a href="/docs/0.6.0/running-on-yarn.html">experimental support for running on YARN</a> (Hadoop NextGen), currently in a separate branch.</p>

<h3>Java API</h3>

<p>Java programmers can now use Spark through a new <a href="/docs/0.6.0/java-programming-guide.html">Java API layer</a>. This layer makes available all of Spark&#8217;s features, including parallel transformations, distributed datasets, broadcast variables, and accumulators, in a Java-friendly manner.</p>

<h3>Expanded Documentation</h3>

<p>Spark&#8217;s <a href="/docs/0.6.0/">documentation</a> has been expanded with a new <a href="/docs/0.6.0/quick-start.html">quick start guide</a>, additional deployment instructions, configuration guide, tuning guide, and improved <a href="/docs/0.6.0/api/core">Scaladoc</a> API documentation.</p>

<h3>Engine Changes</h3>

<p>Under the hood, Spark 0.6 has new, custom storage and communication layers brought in from the upcoming <a href="http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf">Spark Streaming</a> project. These can improve performance over past versions by as much as 2x. Specifically:</p>

<ul>
  <li>A new communication manager using asynchronous Java NIO lets shuffle operations run faster, especially when sending large amounts of data or when jobs have many tasks.</li>
  <li>A new storage manager supports per-dataset storage level settings (e.g. whether to keep the dataset in memory, deserialized, on disk, etc, or even replicated across nodes).</li>
  <li>Spark's scheduler and control plane have been optimized to better support ultra-low-latency jobs (under 500ms) and high-throughput scheduling decisions.</li>
</ul>

<h3>New APIs</h3>

<ul>
  <li>This release adds the ability to control caching strategies on a per-RDD level, so that different RDDs may be stored in memory, on disk, as serialized bytes, etc. You can choose your storage level using the <a href="/docs/0.6.0/scala-programming-guide.html#rdd-persistence"><tt>persist()</tt> method</a> on RDD.</li>
  <li>A new Accumulable class generalizes Accumulators for the case when the type being accumulated is not the same as the types of elements being added (e.g. you wish to accumulate a collection, such as a Set, by adding individual elements).</li>
  <li>You can now dynamically add files or JARs that should be shipped to your workers with <tt>SparkContext.addFile/Jar</tt>.</li>
  <li>More Spark operators (e.g. joins) support custom partitioners.</li>
</ul>

<h3>Enhanced Debugging</h3>

<p>Spark&#8217;s log now prints which operation in your program each RDD and job described in your logs belongs to, making it easier to tie back to which parts of your code experience problems.</p>

<h3>Maven Artifacts</h3>

<p>Spark is now available in Maven Central, making it easier to link into your programs without having to build it as a JAR. Use the following Maven identifiers to add it to a project:</p>

<ul>
  <li>groupId: org.spark-project</li>
  <li>artifactId: spark-core_2.9.2</li>
  <li>version: 0.6.0</li>
</ul>

<h3>Compatibility</h3>

<p>This release is source-compatible with Spark 0.5 programs, but you will need to recompile them against 0.6. In addition,  the configuration for caching has changed: instead of having a <tt>spark.cache.class</tt> parameter that sets one caching strategy for all RDDs, you can now set a <a href="/docs/0.6.0/scala-programming-guide.html#rdd-persistence">per-RDD storage level</a>. Spark will warn if you try to set <tt>spark.cache.class</tt>.</p>

<h3>Credits</h3>

<p>Spark 0.6 was the work of a large set of new contributors from Berkeley and outside.</p>

<ul>
  <li>Tathagata Das contributed the new communication layer, and parts of the storage layer.</li>
  <li>Haoyuan Li contributed the new storage manager.</li>
  <li>Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.</li>
  <li>Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.</li>
  <li>Josh Rosen contributed the Java API, and several bug fixes.</li>
  <li>Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.</li>
  <li>Reynold Xin contributed numerous bug and performance fixes.</li>
  <li>Imran Rashid contributed the new Accumulable class.</li>
  <li>Harvey Feng contributed improvements to shuffle operations.</li>
  <li>Shivaram Venkataraman improved Spark's memory estimation and wrote a memory tuning guide.</li>
  <li>Ravi Pandya contributed Spark run scripts for Windows.
  </li><li>Mosharaf Chowdhury provided several fixes to broadcast.</li>
  <li>Henry Milner pointed out several bugs in sampling algorithms.</li>
  <li>Ray Racine provided improvements to the EC2 scripts.</li>
  <li>Paul Ruan and Bill Zhao helped with testing.</li>
</ul>

<p style="padding-top:5px;">Thanks also to all the Spark users who have diligently suggested features or reported bugs.</p>

          </article><!-- #post -->
        
      </div><!-- #content -->
      
      <footer id="colophon" role="contentinfo">
  <div id="site-generator">
    <p style="padding-top: 0; padding-bottom: 15px;">
      Apache Spark is an effort undergoing incubation at The Apache Software Foundation.
      <a href="http://incubator.apache.org/" style="border: none;">
        <img style="vertical-align: middle; border: none;" src="/images/incubator-logo.png" alt="Apache Incubator" title="Apache Incubator" />
      </a>  
    </p>
  </div>
</footer><!-- #colophon -->

    </div><!-- #primary -->
  </div><!-- #main -->
</div><!-- #page -->


</body>
</html>