site/releases/spark-release-0-8-1.html


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252

<!DOCTYPE html>
<!--[if IE 6]>
<html id="ie6" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 7]>
<html id="ie7" dir="ltr" lang="en-US">
<![endif]-->
<!--[if IE 8]>
<html id="ie8" dir="ltr" lang="en-US">
<![endif]-->
<!--[if !(IE 6) | !(IE 7) | !(IE 8)  ]><!-->
<html dir="ltr" lang="en-US">
<!--<![endif]-->
<head>
  <link rel="shortcut icon" href="/favicon.ico" />
  <meta charset="UTF-8" />
  <meta name="viewport" content="width=device-width" />
  <title>
     Spark Release 0.8.1 | Apache Spark
    
  </title>

  <link rel="stylesheet" type="text/css" media="all" href="/css/style.css" />
  <link rel="stylesheet" href="/css/pygments-default.css">

  <script type="text/javascript">
  <!-- Google Analytics initialization -->
  var _gaq = _gaq || [];
  _gaq.push(['_setAccount', 'UA-32518208-2']);
  _gaq.push(['_trackPageview']);
  (function() {
    var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
    ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
    var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
  })();

  <!-- Adds slight delay to links to allow async reporting -->
  function trackOutboundLink(link, category, action) {  
    try { 
      _gaq.push(['_trackEvent', category , action]); 
    } catch(err){}
 
    setTimeout(function() {
      document.location.href = link.href;
    }, 100);
  }
  </script>

  <link rel='canonical' href='/index.html' />

  <style type="text/css">
    #site-title,
    #site-description {
      position: absolute !important;
      clip: rect(1px 1px 1px 1px); /* IE6, IE7 */
      clip: rect(1px, 1px, 1px, 1px);
    }
  </style>
  <style type="text/css" id="custom-background-css">
    body.custom-background { background-color: #f1f1f1; }
  </style>
</head>

<!--body class="page singular"-->
<body class="singular">
<div id="page" class="hfeed">

  <header id="branding" role="banner">
  <hgroup>
    <h1 id="site-title"><span><a href="/" title="Spark" rel="home">Spark</a></span></h1>
    <h2 id="site-description">Lightning-Fast Cluster Computing</h2>
  </hgroup>

  <a id="main-logo" href="/">
    <img style="height:175px; width:auto;" src="/images/spark-project-header1-cropped.png" alt="Spark: Lightning-Fast Cluster Computing" title="Spark: Lightning-Fast Cluster Computing" />
  </a>
  <div class="widget-summit">
    <a href="http://spark-summit.org"><img src="/images/Summit-Logo-FINALtr-150x150px.png" /></a>
    <div class="text">
      <a href="http://spark-summit.org/2013">
        
        <strong>Videos and Slides<br/>
        Available Now!</strong>
      </a>
    </div>
  </div>

  <nav id="access" role="navigation">
    <h3 class="assistive-text">Main menu</h3>
    <div class="menu-main-menu-container">
      <ul id="menu-main-menu" class="menu">
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/index.html">Home</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/downloads.html">Downloads</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/documentation.html">Documentation</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/examples.html">Examples</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/mailing-lists.html">Mailing Lists</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/research.html">Research</a>
        </li>
        
        <li class="menu-item menu-item-type-post_type menu-item-object-page ">
          <a href="/faq.html">FAQ</a>
        </li>
        
      </ul></div>
  </nav><!-- #access -->
</header><!-- #branding -->


  <div id="main">
    <div id="primary">
      <div id="content" role="main">
        
          <article class="page type-page status-publish hentry">
            <h2>Spark Release 0.8.1</h2>


<p>Apache Spark 0.8.1 is a maintenance and performance release for the Scala 2.9 version of Spark. It also adds several new features, such as standalone mode high availability, that will appear in Spark 0.9 but developers wanted to have in Scala 2.9. Contributions to 0.8.1 came from 41 developers.</p>

<h3 id="yarn-22-support">YARN 2.2 Support</h3>
<p>Support has been added for running Spark on YARN 2.2 and newer. Due to a change in the YARN API between previous versions and 2.2+, this was not supported in Spark 0.8.0. See the <a href="/docs/0.8.1/running-on-yarn.html">YARN documentation</a> for specific instructions on how to build Spark for YARN 2.2+. We’ve also included a pre-compiled binary for YARN 2.2.</p>

<h3 id="high-availability-mode-for-standalone-cluster-manager">High Availability Mode for Standalone Cluster Manager</h3>
<p>The standalone cluster manager now has a high availability (H/A) mode which can tolerate master failures. This is particularly useful for long-running applications such as streaming jobs and the shark server, where the scheduler master previously represented a single point of failure. Instructions for deploying H/A mode are included <a href="/docs/0.8.1/spark-standalone.html#high-availability">in the documentation</a>. The current implementation uses Zookeeper for coordination.</p>

<h3 id="performance-optimizations">Performance Optimizations</h3>
<p>This release adds several performance optimizations:</p>

<ul>
  <li>Optimized hashtables for shuffle data - reduces memory and CPU consumption</li>
  <li>Efficient encoding for JobConfs - improves latency for stages reading large numbers of blocks from HDFS, S3, and HBase</li>
  <li>Shuffle file consolidation (off by default) - reduces the number of files created in large shuffles for better filesystem performance. This change works best on filesystems newer than ext3 (we recommend ext4 or XFS), and it will be the default in Spark 0.9, but we’ve left it off by default for compatibility. We recommend users turn this on unless they are using ext3 by setting <code>spark.shuffle.consolidateFiles</code> to “true”.</li>
  <li>Torrent broadcast (off by default) - a faster broadcast implementation for large objects.</li>
  <li>Support for fetching large result sets - allows tasks to return large results without tuning Akka buffer sizes.</li>
</ul>

<h3 id="mllib-improvements">MLlib Improvements</h3>
<ul>
  <li>Added a new variant of Alternating Least Squares matrix factorization for implicit feedback.</li>
</ul>

<h3 id="python-improvements">Python Improvements</h3>
<ul>
  <li>It is now possible to set Spark config properties directly from Python</li>
  <li>Python now supports sort operations</li>
  <li>Accumulators now have an explicitly named <code>add</code> method</li>
</ul>

<h3 id="new-operators-and-usability-improvements">New Operators and Usability Improvements</h3>
<ul>
  <li><code>local://</code> URI’s - allows users to specify files already present on slaves as dependencies</li>
  <li>A new “result fetching” state has been added to the UI</li>
  <li>New Spark Streaming operators: <code>transformWith</code>, <code>leftInnerJoin</code>, <code>rightOuterJoin</code></li>
  <li>New Spark operators: <code>repartition</code></li>
  <li>You can now run Spark applications as a different user in standalone and Mesos modes</li>
</ul>

<h3 id="notable-bug-fixes">Notable Bug Fixes</h3>
<ul>
  <li>Fixed an edge case that could cause data loss for Kafka ingest to Spark Streaming</li>
  <li>Fix for scheduler hanging after certain task failures</li>
  <li>Fixed a packaging bug that prevented log output during streaming examples</li>
  <li>Sorting order has been fixed in certain UI fields</li>
</ul>

<h3 id="credits">Credits</h3>

<ul>
  <li>Michael Armbrust – build fix</li>
  <li>Pierre Borckmans – typo fix in documentation</li>
  <li>Evan Chan – <code>local://</code> scheme for dependency jars</li>
  <li>Ewen Cheslack-Postava – <code>add</code> method for python accumulators, support for setting config properties in python</li>
  <li>Mosharaf Chowdhury – optimized broadcast implementation</li>
  <li>Frank Dai – documentation fix</li>
  <li>Aaron Davidson – shuffle file consolidation, H/A mode for standalone scheduler, cleaned up representation of block IDs, several improvements and bug fixes</li>
  <li>Tathagata Das – new streaming operators, fix for kafka concurrency bug</li>
  <li>Ankur Dave – support for pausing spot clusters on EC2</li>
  <li>Harvey Feng – optimization to JobConf broadcasts, bug fixes, YARN 2.2 build</li>
  <li>Ali Ghodsi – YARN 2.2 build</li>
  <li>Thomas Graves – Spark YARN integration including secure HDFS access over YARN</li>
  <li>Li Guoqiang – fix for Maven build</li>
  <li>Stephen Haberman – bug fix</li>
  <li>Haidar Hadi – documentation fix</li>
  <li>Nathan Howell – bug fix relating to YARN</li>
  <li>Holden Karau – Java version of <code>mapPartitionsWithIndex</code></li>
  <li>Du Li – bug fix in make-distrubion.sh</li>
  <li>Raymond Liu – work on YARN 2.2 build</li>
  <li>Xi Liu – bug fix and code clean-up</li>
  <li>David McCauley – bug fix in standalone mode JSON output</li>
  <li>Michael (wannabeast) – bug fix in memory store</li>
  <li>Fabrizio Milo – typos in documentation, clean-up in DAGScheduler, typo in scaladoc</li>
  <li>Mridul Muralidharan – fixes to metadata cleaner and speculative execution</li>
  <li>Sundeep Narravula – build fix, bug fixes in scheduler and tests, code clean-up</li>
  <li>Kay Ousterhout – optimized result fetching, new information in UI, scheduler clean-up and bug fixes</li>
  <li>Nick Pentreath – implicit feedback variant of ALS algorithm</li>
  <li>Imran Rashid – improvement to executor launch</li>
  <li>Ahir Reddy – spark support for SIMR</li>
  <li>Josh Rosen – memory use optimization, clean up of BlockManager code, Java and Python clean-up/fixes</li>
  <li>Henry Saputra – build fix</li>
  <li>Jerry Shao – refactoring of fair scheduler, support for running Spark as a specific user, bug fix</li>
  <li>Mingfei Shi – documentation for JobLogger</li>
  <li>Andre Schumacher – sortByKey in PySpark and associated changes</li>
  <li>Karthik Tunga – bug fix in launch script</li>
  <li>Patrick Wendell – <code>repartition</code> operator, shuffle write metrics, various fixes and release management</li>
  <li>Neal Wiggins – import clean-up, documentation fixes</li>
  <li>Andrew Xia – bug fix in UI</li>
  <li>Reynold Xin – task killing, support for setting job properties in Spark shell, logging improvements, Kryo improvements, several bug fixes</li>
  <li>Matei Zaharia – optimized hashmap for shuffle data, PySpark documentation, optimizations to Kryo serializer</li>
  <li>Wu Zeming – bug fix in executors UI</li>
</ul>

<p>Thanks to everyone who contributed!</p>

          </article><!-- #post -->
        
      </div><!-- #content -->
      
      <footer id="colophon" role="contentinfo">
  <div id="site-generator">
    <p style="padding-top: 0; padding-bottom: 15px;">
      Apache Spark is an effort undergoing incubation at The Apache Software Foundation.
      <a href="http://incubator.apache.org/" style="border: none;">
        <img style="vertical-align: middle; border: none;" src="/images/incubator-logo.png" alt="Apache Incubator" title="Apache Incubator" />
      </a>  
    </p>
  </div>
</footer><!-- #colophon -->

    </div><!-- #primary -->
  </div><!-- #main -->
</div><!-- #page -->


</body>
</html>