Initial port of Spark website from spark-project.org wordpress to Jekyll.

author: Andy Konwinski <andrew@apache.org> 2013-08-23 17:17:53 +0000
committer: Andy Konwinski <andrew@apache.org> 2013-08-23 17:17:53 +0000
commit: 81d6089b47ec4d3e7fe17074f3b5fadec8070071 (patch)
tree: 1401e9f4bc6e1b9f4596ebecc5b7332d9ed96f3a /releases
parent: 71bac61ea11df8144a9a3d2be75ef996517b136d (diff)
download: spark-website-81d6089b47ec4d3e7fe17074f3b5fadec8070071.tar.gz
spark-website-81d6089b47ec4d3e7fe17074f3b5fadec8070071.tar.bz2
spark-website-81d6089b47ec4d3e7fe17074f3b5fadec8070071.zip
10 files changed, 539 insertions, 0 deletions
diff --git a/releases/_posts/2011-07-14-spark-release-0-3.md b/releases/_posts/2011-07-14-spark-release-0-3.md
new file mode 100644
index 000000000..4238398f4
--- /dev/null
+++ b/releases/_posts/2011-07-14-spark-release-0-3.md
@@ -0,0 +1,62 @@
+---
+layout: post
+title: Spark Release 0.3
+categories:
+- Releases
+tags: []
+status: publish
+type: post
+published: true
+---
+Spark 0.3 brings a variety of new features. You can download it for either <a href="https://github.com/mesos/spark/tarball/0.3-scala-2.9">Scala 2.9</a> or <a href="https://github.com/mesos/spark/tarball/0.3-scala-2.8">Scala 2.8</a>.
+
+<h3>Scala 2.9 Support</h3>
+
+This is the first release to support Scala 2.9 in addition to 2.8. Future releases are likely to be 2.9-only unless there is high demand for 2.8.
+
+<h3>Save Operations</h3>
+
+You can now save distributed datasets to the Hadoop filesystem (HDFS), Amazon S3, Hypertable, and any other storage system supported by Hadoop. There are convenience methods for several common formats, like text files and SequenceFiles. For example, to save a dataset as text:
+
+<div class="code">
+<span class="keyword">val</span> numbers = spark.parallelize(1 to 100)<br> numbers.<span class="sparkop">saveAsTextFile</span>(<span class="string">"hdfs://..."</span>)
+</div>
+
+<h3>Native Types for SequenceFiles</h3>
+
+In working with SequenceFiles, which store objects that implement Hadoop's Writable interface, Spark will now let you use native types for certain common Writable types, like IntWritable and Text. For example:
+
+<div class="code">
+<span class="comment">// Will read a SequenceFile of (IntWritable, Text)</span><br>
+<span class="keyword">val</span> data = spark.sequenceFile[Int, String](<span class="string">"hdfs://..."</span>)
+</div>
+
+Similarly, you can save datasets of basic types directly as SequenceFiles:
+
+<div class="code">
+<span class="comment">// Will write a SequenceFile of (IntWritable, IntWritable)</span><br>
+<span class="keyword">val</span> squares = spark.parallelize(1 to 100).<span class="sparkop">map</span>(<span class="closure">n =&gt; (n, n*n)</span>)<br>
+squares.saveAsSequenceFile(<span class="string">"hdfs://..."</span>)
+</div>
+
+<h3>Maven Integration</h3>
+
+Spark now fetches dependencies via Maven and can publish Maven artifacts for easier dependency management.
+
+<h3>Faster Broadcast &amp; Shuffle</h3>
+
+This release includes broadcast and shuffle algorithms from <a href="http://www.mosharaf.com/wp-content/uploads/orchestra-sigcomm11.pdf">this paper</a> to better support applications that communicate large amounts of data.
+
+<h3>Support for Non-Filesystem Hadoop Input Formats</h3>
+
+The new <tt>SparkContext.hadoopRDD</tt> method allows reading data from Hadoop-compatible storage systems other than file systems, such as HBase, Hypertable, etc.
+
+<h3>Other Features</h3>
+
+<ul>
+  <li>Outer join operators (<tt>leftOuterJoin</tt>, <tt>rightOuterJoin</tt>, etc).</li>
+  <li>Support for Scala 2.9 interpreter features (history search, Ctrl-C current line, etc) in the 2.9 version.</li>
+  <li>Better default levels of parallelism for various operations.</li>
+  <li>Ability to control number of splits in a file.</li>
+  <li>Various bug fixes.</li>
+</ul>
diff --git a/releases/_posts/2012-06-12-spark-release-0-5-0.md b/releases/_posts/2012-06-12-spark-release-0-5-0.md
new file mode 100644
index 000000000..df27a0996
--- /dev/null
+++ b/releases/_posts/2012-06-12-spark-release-0-5-0.md
@@ -0,0 +1,36 @@
+---
+layout: post
+title: Spark Release 0.5.0
+categories:
+- Releases
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '1'
+---
+Spark 0.5.0 brings several new features and sets the stage for some big changes coming this summer as we incorporate code from the <a href="http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf">Spark Streaming</a> project. You can download it as a <a href="https://github.com/mesos/spark/zipball/v0.5.0">zip</a> or <a href="https://github.com/mesos/spark/tarball/v0.5.0">tar.gz</a>.
+
+<h3>Mesos 0.9 Support</h3>
+
+This release runs on <a href="http://www.mesosproject.org/">Apache Mesos 0.9</a>, the first Apache Incubator release of Mesos, which contains significant usability and stability improvements.  Most notable are better memory accounting for applications with long-term memory use, easier access of old jobs' traces and logs (by keeping a history of executed tasks on the web UI), and simpler installation.
+
+<h3>Performance Improvements</h3>
+Spark's scheduling is more communication-efficient when sending out operations on RDDs with large lineage graphs. In addition, the cache replacement policy has been improved to more smartly replace data when an RDD does not fit in the cache, shuffles are more efficient, and the serializer used for shipping closures is now configurable, making it possible to use faster libraries than Java serialization there.
+
+<h3>Debug Improvements</h3>
+
+Spark now reports exceptions on the worker nodes back to the master, so you can see them all in one log file. It also automatically marks and filters duplicate errors.
+
+<h3>New Operators</h3>
+
+These include <tt>sortByKey</tt> for parallel sorting, <tt>takeSample</tt>, and more efficient <tt>fold</tt> and <tt>aggregate</tt> operators.  In addition, more of the old operators make use of, and retain, RDD partitioning information to reduce communication cost. For example, if you <tt>join</tt> two hash-partitioned RDDs that were partitioned in the same way, Spark will not shuffle any data across the network.
+
+<h3>EC2 Launch Script Improvements</h3>
+
+Spark's EC2 launch scripts are now included in the main package, and have the ability to discover and use the latest Spark AMI automatically instead of launching a hardcoded machine image ID.
+
+<h3>New Hadoop API Support</h3>
+
+You can now use Spark to read and write data to storage formats in the new <tt>org.apache.mapreduce</tt> packages (the "new Hadoop" API). In addition, this release fixes an issue caused by a HDFS initialization bug in some recent versions of HDFS.
diff --git a/releases/_posts/2012-10-11-spark-release-0-5-1.md b/releases/_posts/2012-10-11-spark-release-0-5-1.md
new file mode 100644
index 000000000..c5c935ed6
--- /dev/null
+++ b/releases/_posts/2012-10-11-spark-release-0-5-1.md
@@ -0,0 +1,46 @@
+---
+layout: post
+title: Spark Release 0.5.1
+categories:
+- Releases
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '1'
+---
+Spark 0.5.1 is a maintenance release that adds several important bug fixes and usability features. You can download it as a <a href="http://github.com/downloads/mesos/spark/spark-0.5.1.tgz">tar.gz file</a>.
+
+<h3>Maven Publishing</h3>
+
+Spark is now available in Maven Central, making it easier to link into your programs without having to build it as a JAR. Use the following Maven identifiers to add it to a project:
+<ul>
+  <li>groupId: org.spark-project</li>
+  <li>artifactId: spark-core_2.9.2</li>
+  <li>version: 0.5.1</li>
+</ul>
+
+<h3>Scala 2.9.2</h3>
+
+Spark now builds against Scala 2.9.2 by default.
+
+<h3>Improved Accumulators</h3>
+
+The new Accumulable class generalizes Accumulators for the case when the type being accumulated is not the same as the types of elements being added (e.g. you wish to accumulate a collection, such as a Set, by adding individual elements). This interface is also more efficient in avoiding the creation of temporary objects. (Contributed by Imran Rashid.)
+
+<h3>Bug Fixes</h3>
+
+<ul>
+  <li>Spark's algorithm for estimating the sizes of objects (in order to manage memory correctly) has been improved
+    to handle JVMs with 32- vs 64-bit pointers and to measure objects more accurately. (Contributed by Shivaram Venkataraman.)</li>
+  <li>Improved algorithms for taking random samples out of datasets to avoid biases that could occur in the previous ones. (Suggested by Henry Milner.)</li>
+  <li>Improved load balancing across nodes in sort operations.</li>
+  <li>Fixed a shuffle bug that could cause reduce tasks to fail to receive a map task's full output.</li>
+  <li>Fixed a bug with locating custom KryoSerializers.</li>
+  <li>Reduced memory consumption of <tt>saveAsObjectFile</tt> when objects are large.</li>
+</ul>
+
+<h3>EC2 Improvements</h3>
+
+Spark's EC2 launch script now configures Spark's memory limit automatically based on the machine's available RAM.
diff --git a/releases/_posts/2012-10-15-spark-release-0-6-0.md b/releases/_posts/2012-10-15-spark-release-0-6-0.md
new file mode 100644
index 000000000..fb17f3037
--- /dev/null
+++ b/releases/_posts/2012-10-15-spark-release-0-6-0.md
@@ -0,0 +1,90 @@
+---
+layout: post
+title: Spark Release 0.6.0
+categories:
+- Releases
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '4'
+---
+Spark 0.6.0 is a major release that brings several new features, architectural changes, and performance enhancements. The most visible additions are a standalone deploy mode, a Java API, and expanded documentation; but there are also numerous other changes under the hood, which improve performance in some cases by as much as 2x.
+
+You can download this release as either a <a href="http://github.com/downloads/mesos/spark/spark-0.6.0-sources.tar.gz">source package</a> (2 MB tar.gz) or <a href="http://github.com/downloads/mesos/spark/spark-0.6.0-prebuilt.tar.gz">prebuilt package</a> (48 MB tar.gz)
+
+<h3>Simpler Deployment</h3>
+
+In addition to running on Mesos, Spark now has a <a href="/docs/0.6.0/spark-standalone.html">standalone deploy mode</a> that lets you quickly launch a cluster without installing an external cluster manager. The standalone mode only needs Java installed on each machine, and Spark deployed to it.
+
+In addition, there is <a href="/docs/0.6.0/running-on-yarn.html">experimental support for running on YARN</a> (Hadoop NextGen), currently in a separate branch.
+
+<h3>Java API</h3>
+
+Java programmers can now use Spark through a new <a href="/docs/0.6.0/java-programming-guide.html">Java API layer</a>. This layer makes available all of Spark's features, including parallel transformations, distributed datasets, broadcast variables, and accumulators, in a Java-friendly manner.
+
+<h3>Expanded Documentation</h3>
+
+Spark's <a href="/docs/0.6.0/">documentation</a> has been expanded with a new <a href="/docs/0.6.0/quick-start.html">quick start guide</a>, additional deployment instructions, configuration guide, tuning guide, and improved <a href="/docs/0.6.0/api/core">Scaladoc</a> API documentation.
+
+<h3>Engine Changes</h3>
+
+Under the hood, Spark 0.6 has new, custom storage and communication layers brought in from the upcoming <a href="http://www.cs.berkeley.edu/~matei/papers/2012/hotcloud_spark_streaming.pdf">Spark Streaming</a> project. These can improve performance over past versions by as much as 2x. Specifically:
+
+<ul>
+  <li>A new communication manager using asynchronous Java NIO lets shuffle operations run faster, especially when sending large amounts of data or when jobs have many tasks.</li>
+  <li>A new storage manager supports per-dataset storage level settings (e.g. whether to keep the dataset in memory, deserialized, on disk, etc, or even replicated across nodes).</li>
+  <li>Spark's scheduler and control plane have been optimized to better support ultra-low-latency jobs (under 500ms) and high-throughput scheduling decisions.</li>
+</ul>
+
+<h3>New APIs</h3>
+
+<ul>
+  <li>This release adds the ability to control caching strategies on a per-RDD level, so that different RDDs may be stored in memory, on disk, as serialized bytes, etc. You can choose your storage level using the <a href="/docs/0.6.0/scala-programming-guide.html#rdd-persistence"><tt>persist()</tt> method</a> on RDD.</li>
+  <li>A new Accumulable class generalizes Accumulators for the case when the type being accumulated is not the same as the types of elements being added (e.g. you wish to accumulate a collection, such as a Set, by adding individual elements).</li>
+  <li>You can now dynamically add files or JARs that should be shipped to your workers with <tt>SparkContext.addFile/Jar</tt>.</li>
+  <li>More Spark operators (e.g. joins) support custom partitioners.</li>
+</ul>
+
+<h3>Enhanced Debugging</h3>
+
+Spark's log now prints which operation in your program each RDD and job described in your logs belongs to, making it easier to tie back to which parts of your code experience problems.
+
+<h3>Maven Artifacts</h3>
+
+Spark is now available in Maven Central, making it easier to link into your programs without having to build it as a JAR. Use the following Maven identifiers to add it to a project:
+
+<ul>
+  <li>groupId: org.spark-project</li>
+  <li>artifactId: spark-core_2.9.2</li>
+  <li>version: 0.6.0</li>
+</ul>
+
+<h3>Compatibility</h3>
+
+This release is source-compatible with Spark 0.5 programs, but you will need to recompile them against 0.6. In addition,  the configuration for caching has changed: instead of having a <tt>spark.cache.class</tt> parameter that sets one caching strategy for all RDDs, you can now set a <a href="/docs/0.6.0/scala-programming-guide.html#rdd-persistence">per-RDD storage level</a>. Spark will warn if you try to set <tt>spark.cache.class</tt>.
+
+<h3>Credits</h3>
+
+Spark 0.6 was the work of a large set of new contributors from Berkeley and outside.
+
+<ul>
+  <li>Tathagata Das contributed the new communication layer, and parts of the storage layer.</li>
+  <li>Haoyuan Li contributed the new storage manager.</li>
+  <li>Denny Britz contributed the YARN deploy mode, key aspects of the standalone one, and several other features.</li>
+  <li>Andy Konwinski contributed the revamped documentation site, Maven publishing, and several API docs.</li>
+  <li>Josh Rosen contributed the Java API, and several bug fixes.</li>
+  <li>Patrick Wendell contributed the enhanced debugging feature and helped with testing and documentation.</li>
+  <li>Reynold Xin contributed numerous bug and performance fixes.</li>
+  <li>Imran Rashid contributed the new Accumulable class.</li>
+  <li>Harvey Feng contributed improvements to shuffle operations.</li>
+  <li>Shivaram Venkataraman improved Spark's memory estimation and wrote a memory tuning guide.</li>
+  <li>Ravi Pandya contributed Spark run scripts for Windows.
+  </li><li>Mosharaf Chowdhury provided several fixes to broadcast.</li>
+  <li>Henry Milner pointed out several bugs in sampling algorithms.</li>
+  <li>Ray Racine provided improvements to the EC2 scripts.</li>
+  <li>Paul Ruan and Bill Zhao helped with testing.</li>
+</ul>
+
+<p style="padding-top:5px;">Thanks also to all the Spark users who have diligently suggested features or reported bugs.</p>
diff --git a/releases/_posts/2012-11-22-spark-release-0-5-2.md b/releases/_posts/2012-11-22-spark-release-0-5-2.md
new file mode 100644
index 000000000..794f3521d
--- /dev/null
+++ b/releases/_posts/2012-11-22-spark-release-0-5-2.md
@@ -0,0 +1,15 @@
+---
+layout: post
+title: Spark Release 0.5.2
+categories:
+- Releases
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '1'
+---
+Spark 0.5.2 is a minor release, whose main addition is to allow Spark to compile against Hadoop 2 distributions. To do this, edit <code>project/SparkBuild.scala</code> and change both the <code>HADOOP_VERSION</code> and <code>HADOOP_MAJOR_VERSION</code> variables, then recompile Spark. This change was contributed by Thomas Dudziak.
+
+You can download Spark 0.5.2 as a <a href="https://github.com/downloads/mesos/spark/spark-0.5.2-sources.tgz">tar.gz file</a> (2 MB).
diff --git a/releases/_posts/2012-11-22-spark-release-0-6-1.md b/releases/_posts/2012-11-22-spark-release-0-6-1.md
new file mode 100644
index 000000000..6a0799429
--- /dev/null
+++ b/releases/_posts/2012-11-22-spark-release-0-6-1.md
@@ -0,0 +1,30 @@
+---
+layout: post
+title: Spark Release 0.6.1
+categories:
+- Releases
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '4'
+---
+Spark 0.6.1 is a maintenance release that contains several important bug fixes and performance improvements. You can download it as a <a href="https://github.com/downloads/mesos/spark/spark-0.6.1-sources.tgz">source package</a> (2.4 MB tar.gz) or <a href="https://github.com/downloads/mesos/spark/spark-0.6.1-prebuilt.tgz">prebuilt package</a> (48 MB tar.gz).
+
+The fixes and improvements in this version include:
+<ul>
+  <li>Fixed overly aggressive message timeouts that could cause workers to disconnect from the cluster</li>
+  <li>Fixed a bug in the standalone deploy mode that did not expose hostnames to scheduler, affecting HDFS locality</li>
+  <li>Improved connection reuse in shuffle, which can greatly speed up small shuffles (contributed by Reynold Xin)</li>
+  <li>Fixed some potential deadlocks in the block manager (contributed by Tathagata Das)</li>
+  <li>Fixed a bug getting IDs of failed hosts from Mesos (contributed by Imran Rashid)</li>
+  <li>Several EC2 script improvements, like better handling of spot instances (contributed by Josh Rosen)</li>
+  <li>Made the local IP address that Spark binds to customizable (contributed by Mikhail Bautin)</li>
+  <li>Support for Hadoop 2 distributions (contributed by Thomas Dudziak)</li>
+  <li>Support for locating Scala on Debian distributions (contributed by Thomas Dudziak)</li>
+  <li>Improved standalone cluster web UI to show more information about jobs</li>
+  <li>Added an option to spread out jobs over the standalone cluster instead of concentrating them on a small number of nodes (<code>spark.deploy.spreadOut</code>)</li>
+</ul>
+
+We recommend that all Spark 0.6 users update to this maintenance release.
diff --git a/releases/_posts/2013-02-07-spark-release-0-6-2.md b/releases/_posts/2013-02-07-spark-release-0-6-2.md
new file mode 100644
index 000000000..e44c72a70
--- /dev/null
+++ b/releases/_posts/2013-02-07-spark-release-0-6-2.md
@@ -0,0 +1,43 @@
+---
+layout: post
+title: Spark Release 0.6.2
+categories: []
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '4'
+  _wpas_done_all: '1'
+---
+Spark 0.6.2 is a maintenance release that contains several bug fixes and usability improvements. You can download it as a <a href="http://spark-project.org/files/spark-0.6.2-sources.tgz">source package</a> (2.5 MB tar.gz) or <a href="http://spark-project.org/files/spark-0.6.2-prebuilt.tgz">prebuilt package</a> (48 MB tar.gz).
+
+We recommend that all Spark 0.6 users update to this maintenance release.
+
+The fixes and improvements in this version include:
+<ul>
+  <li>A number of fault tolerance fixes regarding detecting dead nodes, handling missing map output fetches, and allowing failed nodes to rejoin the cluster</li>
+  <li>Documentation fixes that clarify the configuration for the standalone mode and improve the quick start instructions</li>
+  <li>A connection reuse bug fix that improves shuffle performance</li>
+  <li>Support for launching a cluster across multiple availability zones in the EC2 scripts</li>
+  <li>Support for deleting security groups when an EC2 cluster is terminated</li>
+  <li>Improved memory configuration for the standalone deploy cluster daemons: instead of using <code>SPARK_MEM</code> for their memory, which often led people to give them much more memory than they intended, they now use a separate variable, <code>SPARK_DAEMON_MEMORY</code>, with a reasonable default of 512 MB
+  <li>Fixes to the Windows run scripts for Spark</li>
+  <li>Better detection of a machine's external IP address</li>
+  <li>Several small optimizations and bug fixes</li>
+</ul>
+
+In total, eleven people contributed to this release:
+<ul>
+  <li>Stephen Haberman (bug fix)</li>
+  <li>Shane Huang (shuffle fix)</li>
+  <li>Fernand Pajot (bug fix)</li>
+  <li>Andrew Psaltis (bug fix)</li>
+  <li>Imran Rashid (standalone cluster, bug fix)</li>
+  <li>Charles Reiss (fault recovery fixes, node re-registration, tests)</li>
+  <li>Josh Rosen (fault recovery, Java API fixes, deploy scripts)</li>
+  <li>Peter Sankauskas (EC2 scripts)</li>
+  <li>Lee Moon Soo (bug fix)</li>
+  <li>Patrick Wendell (bugs, docs)</li>
+  <li>Matei Zaharia (fault recovery, UI, docs, bug fixes)</li>
+</ul>
diff --git a/releases/_posts/2013-02-27-spark-release-0-7-0.md b/releases/_posts/2013-02-27-spark-release-0-7-0.md
new file mode 100644
index 000000000..ffb48a9e8
--- /dev/null
+++ b/releases/_posts/2013-02-27-spark-release-0-7-0.md
@@ -0,0 +1,112 @@
+---
+layout: post
+title: Spark Release 0.7.0
+categories: []
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '4'
+  _wpas_done_all: '1'
+---
+The Spark team is proud to release version 0.7.0, a new major release that brings several new features. Most notable are a <a href="/docs/0.7.0/python-programming-guide.html">Python API for Spark</a> and an <a href="/docs/0.7.0/streaming-programming-guide.html">alpha of Spark Streaming</a>. (Details on Spark Streaming can also be found in this <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf">technical report</a>.) The release also adds numerous other improvements across the board. Overall, this is our biggest release to date, with 31 contributors, of which 20 were external to Berkeley.
+
+You can download Spark 0.7.0 as either a <a href="/files/spark-0.7.0-sources.tgz">source package</a> (4 MB tar.gz) or <a href="/files/spark-0.7.0-prebuilt.tgz">prebuilt package</a> (60 MB tar.gz).
+
+<h3>Python API</h3>
+
+Spark 0.7 adds a <a href="/docs/0.7.0/python-programming-guide.html">Python API</a> called PySpark that makes it possible to use Spark from Python, both in standalone programs and in interactive Python shells. It uses the standard CPython runtime, so your programs can call into native libraries like NumPy and SciPy. Like the Scala and Java APIs, PySpark will automatically ship functions from your main program, along with the variables they depend on, to the cluster. PySpark supports most Spark features, including RDDs, accumulators, broadcast variables, and HDFS input and output.
+
+<h3>Spark Streaming Alpha</h3>
+
+Spark Streaming is a new extension of Spark that adds near-real-time processing capability. It offers a simple and high-level API, where users can transform streams using parallel operations like <tt>map</tt>, <tt>filter</tt>, <tt>reduce</tt>, and new sliding window functions. It automatically distributes work over a cluster and provides efficient fault recovery with exactly-once semantics for transformations, without relying on costly transactions to an external system. Spark Streaming is described in more detail in <a href="/talks/strata_spark_streaming.ppt">these slides</a> and <a href="http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf">our technical report</a>. This release is our first alpha of Spark Streaming, with most of the functionality implemented and APIs in Java and Scala.
+
+<h3>Memory Dashboard</h3>
+
+Spark jobs now launch a web dashboard for monitoring the memory usage of each distributed dataset (RDD) in the program. Look for lines like this in your log:
+
+<tt>15:08:44 INFO BlockManagerUI: Started BlockManager web UI at http://mbk.local:63814</tt>
+
+You can also control which port to use through the <tt>spark.ui.port</tt> property.
+
+<h3>Maven Build</h3>
+
+Spark can now be built using Maven in addition to SBT. The Maven build enables easier publishing to repositories of your choice, easy selection of Hadoop versions using the Maven profile (<tt>-Phadoop1</tt> or <tt>-Phadoop2</tt>), as well as Debian packaging using <tt>mvn -Phadoop1,deb install</tt>.
+
+<h3>New Operations</h3>
+
+This release adds several RDD transformations, including <tt>keys</tt>, <tt>values</tt>, <tt>keyBy</tt>, <tt>subtract</tt>, <tt>coalesce</tt>, <tt>zip</tt>. It also adds <tt>SparkContext.hadoopConfiguration</tt> to allow programs to configure Hadoop input/output settings globally across operations. Finally, it adds the <tt>RDD.toDebugString()</tt> method, which can be used to print an RDD's lineage graph for troubleshooting.
+
+<h3>EC2 Improvements</h3>
+
+<ul>
+  <li>Spark will now read S3 credentials from the <tt>AWS_ACCESS_KEY_ID</tt> and <tt>AWS_SECRET_ACCESS_KEY</tt> environment variables, if set, making it easier to access Amazon S3.</li>
+  <li>This release fixes a bug with S3 access that would leave streams open when they are not fully read (e.g. when calling <tt>RDD.first()</tt> or a SQL query with a limit), causing nodes to hang.</li>
+  <li>The EC2 scripts now support both standalone and Mesos clusters, and launch Ganglia on the cluster.</li>
+  <li>Spark EC2 clusters can now be spread across multiple availability zones.</li>
+</ul>
+
+<h3>Other Improvements</h3>
+
+<ul>
+  <li>Shuffle operations like <tt>groupByKey</tt> and <tt>reduceByKey</tt> now try to infer parallelism from the size of the parent RDD (unless <tt>spark.default.parallelism</tt> is set).</li>
+  <li>Several performance improvements to shuffles.</li>
+  <li>Standalone deploy cluster now spreads jobs out across machines by default, leading to better data locality.</li>
+  <li>Better error reporting when jobs aren't being launched due to not enough resources.</li>
+  <li>Standalone deploy web UI now includes JSON endpoints for querying cluster state.</li>
+  <li>Better support for IBM JVM.</li>
+  <li>Default Hadoop version dependency updated to 1.0.4.</li>
+  <li>Improved failure handling and reporting of error messages.</li>
+  <li>Separate configuration for standalone cluster daemons and user applications.</li>
+  <li>Significant refactoring of the scheduler codebase to enable richer unit testing.</li>
+  <li>Several bug and performance fixes throughout.</li>
+</ul>
+
+<h3>Compatibility</h3>
+
+This release is API-compatible with Spark 0.6 programs, but the following features changed slightly:
+<ul>
+  <li>Parallel shuffle operations where you don't specify a level of parallelism use the number of partitions of the parent RDD instead of a constant default. However, if you set <tt>spark.default.parallelism</tt>, they will use that.</li>
+  <li><tt>SparkContext.addFile</tt>, which distributes a file to worker nodes, is no longer guaranteed to put it in the executor's working directory---instead, you can find the directory it used using <tt>SparkFiles.getRootDirectory</tt>, or get a particular file using <tt>SparkFiles.get</tt>. This was done to avoid cluttering the local directory when running in local mode.</li>
+</ul>
+
+<h3>Credits</h3>
+
+Spark 0.7 was the work of many contributors from Berkeley and outside---in total, 31 different contributors, of which 20 were from outside Berkeley. Here are the people who contributed, along with areas they worked on:
+
+<ul>
+  <li>Mikhail Bautin -- Maven build</li>
+  <li>Denny Britz -- memory dashboard, streaming, bug fixes</li>
+  <li>Paul Cavallaro -- error reporting</li>
+  <li>Tathagata Das -- streaming (lead developer), 24/7 operation, bug fixes, docs</li>
+  <li>Thomas Dudziak -- Maven build, Hadoop 2 support</li>
+  <li>Harvey Feng -- bug fix</li>
+  <li>Stephen Haberman -- new RDD operations, configuration, S3 improvements, code cleanup, bug fixes</li>
+  <li>Tyson Hamilton -- JSON status endpoints</li>
+  <li>Mark Hamstra -- API improvements, docs</li>
+  <li>Michael Heuer -- docs</li>
+  <li>Shane Huang -- shuffle performance fixes</li>
+  <li>Andy Konwinski -- docs</li>
+  <li>Ryan LeCompte -- streaming</li>
+  <li>Haoyuan Li -- streaming</li>
+  <li>Richard McKinley -- build</li>
+  <li>Sean McNamara -- streaming</li>
+  <li>Lee Moon Soo -- bug fix</li>
+  <li>Fernand Pajot -- bug fix</li>
+  <li>Nick Pentreath -- Python API, examples</li>
+  <li>Andrew Psaltis -- bug fixes</li>
+  <li>Imran Rashid -- memory dashboard, bug fixes</li>
+  <li>Charles Reiss -- fault recovery fixes, code cleanup, testability, error reporting</li>
+  <li>Josh Rosen -- Python API (lead developer), EC2 scripts, bug fixes</li>
+  <li>Peter Sankauskas -- EC2 scripts</li>
+  <li>Prashant Sharma -- streaming</li>
+  <li>Shivaram Venkataraman -- EC2 scripts, optimizations</li>
+  <li>Patrick Wendell -- streaming, bug fixes, examples, docs</li>
+  <li>Reynold Xin -- optimizations, UI</li>
+  <li>Haitao Yao -- run scripts</li>
+  <li>Matei Zaharia -- streaming, fault recovery, Python API, code cleanup, bug fixes, docs</li>
+  <li>Eric Zhang -- examples</li>
+</ul>
+
+Thanks to everyone who contributed!
diff --git a/releases/_posts/2013-06-02-spark-release-0-7-2.md b/releases/_posts/2013-06-02-spark-release-0-7-2.md
new file mode 100644
index 000000000..9b1ed38f0
--- /dev/null
+++ b/releases/_posts/2013-06-02-spark-release-0-7-2.md
@@ -0,0 +1,56 @@
+---
+layout: post
+title: Spark Release 0.7.2
+categories: []
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '4'
+  _wpas_done_all: '1'
+---
+Spark 0.7.2 is a maintenance release that contains multiple bug fixes and improvements. You can download it as a <a href="http://spark-project.org/download-spark-0.7.2-sources">source package</a> (4 MB tar.gz) or get prebuilt packages for <a href="http://spark-project.org/download-spark-0.7.2-prebuilt-hadoop1">Hadoop 1 / CDH3</a> or <a href="http://spark-project.org/download-spark-0.7.2-prebuilt-cdh4">CDH 4</a> (61 MB tar.gz).
+
+
+We recommend that all users update to this maintenance release.
+
+
+The fixes and improvements in this version include:
+<ul>
+  <li>Scala version updated to 2.9.3.</li>
+  <li>Several improvements to Bagel, including performance fixes and a configurable storage level.</li>
+  <li>New API methods: subtractByKey, foldByKey, mapWith, filterWith, foreachPartition, and others.</li>
+  <li>A new metrics reporting interface, SparkListener, to collect information about each computation stage: task lengths, bytes shuffled, etc.</li>
+  <li>Several new examples using the Java API, including K-means and computing pi.</li>
+  <li>Support for launching multiple worker instances per host in the standalone mode.</li>
+  <li>Various bug fixes across the board.</li>
+</ul>
+
+The following people contributed to this release:
+<ul>
+  <li>Jey Kottalam (Maven build, bug fixes, EC2 scripts, packaging the release)</li>
+  <li>Andrew Ash (bug fixes, docs)</li>
+  <li>Andrey Kouznetsov (bug fixes)</li>
+  <li>Andy Konwinski (docs)</li>
+  <li>Charles Reiss (bug fixes)</li>
+  <li>Christoph Grothaus (bug fixes)</li>
+  <li>Erik van Oosten (bug fixes)</li>
+  <li>Giovanni Delussu (bug fixes)</li>
+  <li>Hiral Patel (bug fixes)</li>
+  <li>Holden Karau (error reporting, EC2 scripts)</li>
+  <li>Imran Rashid (metrics reporting system)</li>
+  <li>Josh Rosen (EC2 scripts)</li>
+  <li>Mark Hamstra (new API methods, tests)</li>
+  <li>Mikhail Bautin (build)</li>
+  <li>Mosharaf Chowdhury (bug fixes)</li>
+  <li>Nick Pentreath (Bagel, examples)</li>
+  <li>Patrick Wendell (bug fixes)</li>
+  <li>Reynold Xin (bug fixes)</li>
+  <li>Stephen Haberman (bug fixes, tests, subtractByKey)</li>
+  <li>Kalpit Shah (build, multiple workers per host)</li>
+  <li>Mike Potts (run scripts)</li>
+  <li>Matei Zaharia (Bagel, bug fixes, build)</li>
+</ul>
+
+We thank everyone who helped with this release, and hope to see more contributions from you in the future!
diff --git a/releases/_posts/2013-07-16-spark-release-0-7-3.md b/releases/_posts/2013-07-16-spark-release-0-7-3.md
new file mode 100644
index 000000000..39b35fb8f
--- /dev/null
+++ b/releases/_posts/2013-07-16-spark-release-0-7-3.md
@@ -0,0 +1,49 @@
+---
+layout: post
+title: Spark Release 0.7.3
+categories:
+- Releases
+tags: []
+status: publish
+type: post
+published: true
+meta:
+  _edit_last: '4'
+  _wpas_done_all: '1'
+---
+Spark 0.7.3 is a maintenance release with several bug fixes, performance fixes, and new features. You can download it as a <a href="/download/spark-0.7.3-sources.tgz">source package</a> (4 MB tar.gz) or get prebuilt packages for <a href="/download/spark-0.7.3-prebuilt-hadoop1.tgz">Hadoop 1 / CDH3</a> or for <a href="/download/spark-0.7.3-prebuilt-cdh4.tgz">CDH 4</a> (61 MB tar.gz).
+
+We recommend that all users update to this maintenance release.
+
+The improvements in this release include:
+
+<ul>
+  <li><b>New "add JARs" functionality in Spark shell:</b> Users of <code>spark-shell</code> can now set the <code>ADD_JARS</code> environment variable to add a list of JARs to their clusters; these will also be sent to workers.</li>
+  <li><b>Windows fixes:</b> Spark standalone clusters now properly kill executors when a job ends or fails. In addition, adding JAR paths with backslashes will now work correctly.</li>
+  <li><b>Streaming API fixes:</b> The Kafka and Twitter APIs for Spark Streaming have been updated. In the Twitter case, this is to deal with the username/password authentication method being disabled in by Twitter, while in the Kafka case, it is to allow receiving messages other than strings. Note that these are breaking API changes as the Streaming API is still in alpha.</li>
+  <li><b>Python performance:</b> Spark's mechanism for spawning Python VMs has been improved to do so faster when the JVM has a large heap size, speeding up the Python API.</li>
+  <li><b>Mesos fixes:</b> JARs added to your job will now be on the classpath when deserializing task results in Mesos.</li>
+  <li><b>Error reporting:</b> Better error reporting for non-serializable exceptions and overly large task results.</li>
+  <li><b>Examples:</b> Added an example of stateful stream processing with <code>updateStateByKey</code>.</li>
+  <li><b>Build:</b> Spark Streaming no longer depends on the Twitter4J repo, which should allow it to build in China.</li>
+  <li><b>Bug fixes</b> in <code>foldByKey</code>, streaming <code>count</code>, statistics methods, documentation, and web UI.</li>
+</ul>
+
+The following people contributed to this release:
+
+<ul>
+  <li>Charles Reiss (Mesos)</li>
+  <li>Christoph Grothaus (Windows spawn fixes)</li>
+  <li>Christopher Nguyen (bug fixes)</li>
+  <li>James Phillpotts (Twitter input stream)</li>
+  <li>Jey Kottalam (Python performance)</li>
+  <li>Josh Rosen (usability)</li>
+  <li>Konstantin Boudnik (build)</li>
+  <li>Mark Hamstra (build)</li>
+  <li>Matei Zaharia (Windows, docs, ADD_JARS, Python, streaming)</li>
+  <li>Patrick Wendell (usability)</li>
+  <li>Tathagata Das (streaming fixes)</li>
+  <li>Jerry Shao (bug fixes)</li>
+  <li>S. Kumar (examples)</li>
+  <li>Sean McNamara (Kafka input streams, streaming fixes)</li>
+</ul>
author	Andy Konwinski <andrew@apache.org>	2013-08-23 17:17:53 +0000
committer	Andy Konwinski <andrew@apache.org>	2013-08-23 17:17:53 +0000
commit	81d6089b47ec4d3e7fe17074f3b5fadec8070071 (patch)
tree	1401e9f4bc6e1b9f4596ebecc5b7332d9ed96f3a /releases
parent	71bac61ea11df8144a9a3d2be75ef996517b136d (diff)
download	spark-website-81d6089b47ec4d3e7fe17074f3b5fadec8070071.tar.gz spark-website-81d6089b47ec4d3e7fe17074f3b5fadec8070071.tar.bz2 spark-website-81d6089b47ec4d3e7fe17074f3b5fadec8070071.zip