--- layout: post title: Spark Release 0.3 categories: - Releases tags: [] status: publish type: post published: true --- Spark 0.3 brings a variety of new features. You can download it for either Scala 2.9 or Scala 2.8.

Scala 2.9 Support

This is the first release to support Scala 2.9 in addition to 2.8. Future releases are likely to be 2.9-only unless there is high demand for 2.8.

Save Operations

You can now save distributed datasets to the Hadoop filesystem (HDFS), Amazon S3, Hypertable, and any other storage system supported by Hadoop. There are convenience methods for several common formats, like text files and SequenceFiles. For example, to save a dataset as text:

val numbers = spark.parallelize(1 to 100)
numbers.saveAsTextFile("hdfs://...")

Native Types for SequenceFiles

In working with SequenceFiles, which store objects that implement Hadoop's Writable interface, Spark will now let you use native types for certain common Writable types, like IntWritable and Text. For example:

// Will read a SequenceFile of (IntWritable, Text)
val data = spark.sequenceFile[Int, String]("hdfs://...")

Similarly, you can save datasets of basic types directly as SequenceFiles:

// Will write a SequenceFile of (IntWritable, IntWritable)
val squares = spark.parallelize(1 to 100).map(n => (n, n*n))
squares.saveAsSequenceFile("hdfs://...")

Maven Integration

Spark now fetches dependencies via Maven and can publish Maven artifacts for easier dependency management.

Faster Broadcast & Shuffle

This release includes broadcast and shuffle algorithms from this paper to better support applications that communicate large amounts of data.

Support for Non-Filesystem Hadoop Input Formats

The new SparkContext.hadoopRDD method allows reading data from Hadoop-compatible storage systems other than file systems, such as HBase, Hypertable, etc.

Other Features

Outer join operators (leftOuterJoin, rightOuterJoin, etc).
Support for Scala 2.9 interpreter features (history search, Ctrl-C current line, etc) in the 2.9 version.
Better default levels of parallelism for various operations.
Ability to control number of splits in a file.
Various bug fixes.