--- layout: post title: Spark Release 0.7.0 categories: [] tags: [] status: publish type: post published: true meta: _edit_last: '4' _wpas_done_all: '1' --- The Spark team is proud to release version 0.7.0, a new major release that brings several new features. Most notable are a Python API for Spark and an alpha of Spark Streaming. (Details on Spark Streaming can also be found in this technical report.) The release also adds numerous other improvements across the board. Overall, this is our biggest release to date, with 31 contributors, of which 20 were external to Berkeley. You can download Spark 0.7.0 as either a source package (4 MB tar.gz) or prebuilt package (60 MB tar.gz).

Python API

Spark 0.7 adds a Python API called PySpark that makes it possible to use Spark from Python, both in standalone programs and in interactive Python shells. It uses the standard CPython runtime, so your programs can call into native libraries like NumPy and SciPy. Like the Scala and Java APIs, PySpark will automatically ship functions from your main program, along with the variables they depend on, to the cluster. PySpark supports most Spark features, including RDDs, accumulators, broadcast variables, and HDFS input and output.

Spark Streaming Alpha

Spark Streaming is a new extension of Spark that adds near-real-time processing capability. It offers a simple and high-level API, where users can transform streams using parallel operations like map, filter, reduce, and new sliding window functions. It automatically distributes work over a cluster and provides efficient fault recovery with exactly-once semantics for transformations, without relying on costly transactions to an external system. Spark Streaming is described in more detail in these slides and our technical report. This release is our first alpha of Spark Streaming, with most of the functionality implemented and APIs in Java and Scala.

Memory Dashboard

Spark jobs now launch a web dashboard for monitoring the memory usage of each distributed dataset (RDD) in the program. Look for lines like this in your log: 15:08:44 INFO BlockManagerUI: Started BlockManager web UI at http://mbk.local:63814 You can also control which port to use through the spark.ui.port property.

Maven Build

Spark can now be built using Maven in addition to SBT. The Maven build enables easier publishing to repositories of your choice, easy selection of Hadoop versions using the Maven profile (-Phadoop1 or -Phadoop2), as well as Debian packaging using mvn -Phadoop1,deb install.

New Operations

This release adds several RDD transformations, including keys, values, keyBy, subtract, coalesce, zip. It also adds SparkContext.hadoopConfiguration to allow programs to configure Hadoop input/output settings globally across operations. Finally, it adds the RDD.toDebugString() method, which can be used to print an RDD's lineage graph for troubleshooting.

EC2 Improvements

Other Improvements

Compatibility

This release is API-compatible with Spark 0.6 programs, but the following features changed slightly:

Credits

Spark 0.7 was the work of many contributors from Berkeley and outside---in total, 31 different contributors, of which 20 were from outside Berkeley. Here are the people who contributed, along with areas they worked on: Thanks to everyone who contributed!