From fb48445129a467feb986dd50585e5e678cad8e1f Mon Sep 17 00:00:00 2001 From: Patrick Wendell Date: Fri, 19 Dec 2014 18:53:06 +0000 Subject: Release note typo's --- site/releases/spark-release-1-2-0.html | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) (limited to 'site/releases/spark-release-1-2-0.html') diff --git a/site/releases/spark-release-1-2-0.html b/site/releases/spark-release-1-2-0.html index e76d6a64c..96569fa2b 100644 --- a/site/releases/spark-release-1-2-0.html +++ b/site/releases/spark-release-1-2-0.html @@ -171,13 +171,13 @@

To download Spark 1.2 visit the downloads page.

Spark Core

-

In 1.2 Spark core upgrades to major subsystems to improve the performance and stability of very large scale shuffles. The first is Spark’s communication manager used during bulk transfers, which upgrades to a netty-based implementation. The second is Spark’s shuffle mechanism, which upgrades to the “sort based” shuffle initially released in Spark 1.1. These both improve the performance and stability of very large scale shuffles. Spark also adds an elastic scaling mechanism designed to improve cluster utilization during long running ETL-style jobs. This is currently supported on YARN and will make its way to other cluster managers in future versions. Finally, Spark 1.2 adds support for Scala 2.11. For instructions on building for Scala 2.11 see the build documentation.

+

In 1.2 Spark core upgrades two major subsystems to improve the performance and stability of very large scale shuffles. The first is Spark’s communication manager used during bulk transfers, which upgrades to a netty-based implementation. The second is Spark’s shuffle mechanism, which upgrades to the “sort based” shuffle initially released in Spark 1.1. These both improve the performance and stability of very large scale shuffles. Spark also adds an elastic scaling mechanism designed to improve cluster utilization during long running ETL-style jobs. This is currently supported on YARN and will make its way to other cluster managers in future versions. Finally, Spark 1.2 adds support for Scala 2.11. For instructions on building for Scala 2.11 see the build documentation.

Spark Streaming

This release includes two major feature additions to Spark’s streaming library, a Python API and a write ahead log for full driver H/A. The Python API covers almost all the DStream transformations and output operations. Input sources based on text files and text over sockets are currently supported. Support for Kafka and Flume input streams in Python will be added in the next release. Second, Spark streaming now features H/A driver support through a write ahead log (WAL). In Spark 1.1 and earlier, some buffered (received but not yet processed) data can be lost during driver restarts. To prevent this Spark 1.2 adds an optional WAL, which buffers received data into a fault-tolerant file system (e.g. HDFS). See the streaming programming guide for more details.

MLLib

-

Spark 1.2 previews a new set of machine learning API’s in a package called spark.ml that supports learning pipelines, where multiple algorithms are run in sequence with varying parameters. This type of pipeline is common in practical machine learning deployments. The new ML package uses Spark’s SchemaRDD to represent ML datasets, providing directly interoperability with Spark SQL. In addition to the new API, Spark 1.2 extends decision trees with two tree ensemble methods: random forests and gradient-boosted trees, among the most successful tree-based models for classification and regression. Finally, MLlib’s Python implementation receives a major update in 1.2 to simplify the process of adding Python APIs, along with better Python API coverage.

+

Spark 1.2 previews a new set of machine learning API’s in a package called spark.ml that supports learning pipelines, where multiple algorithms are run in sequence with varying parameters. This type of pipeline is common in practical machine learning deployments. The new ML package uses Spark’s SchemaRDD to represent ML datasets, providing direct interoperability with Spark SQL. In addition to the new API, Spark 1.2 extends decision trees with two tree ensemble methods: random forests and gradient-boosted trees, among the most successful tree-based models for classification and regression. Finally, MLlib’s Python implementation receives a major update in 1.2 to simplify the process of adding Python APIs, along with better Python API coverage.

Spark SQL

In this release Spark SQL adds a new API for external data sources. This API supports mounting external data sources as temporary tables, with support for optimizations such as predicate pushdown. Spark’s Parquet and JSON bindings have been re-written to use this API and we expect a variety of community projects to integrate with other systems and formats during the 1.2 lifecycle.

@@ -185,8 +185,7 @@

Hive integration has been improved with support for the fixed-precision decimal type and Hive 0.13. Spark SQL also adds dynamically partitioned inserts, a popular Hive feature. An internal re-architecting around caching improves the performance and semantics of caching SchemaRDD instances and adds support for statistics-based partition pruning for cached data.

GraphX

-

In 1.2 GraphX graduates from an alpha component and adds a stable API. This means applications written against GraphX are guaranteed to work with future Spark versions without code changes. -In 1.2 GraphX graduates from an alpha component and adds a stable API. This means applications written against GraphX are guaranteed to work with future Spark versions without code changes. A new core API, aggregateMessages, is introduced to replace the now deprecated mapReduceTriplet API. The new aggregateMessages API features a more imperative programming model and improves performance. Some early test users found 20% - 1X performance improvement by switching to the new API.

+

In 1.2 GraphX graduates from an alpha component and adds a stable API. This means applications written against GraphX are guaranteed to work with future Spark versions without code changes. A new core API, aggregateMessages, is introduced to replace the now deprecated mapReduceTriplet API. The new aggregateMessages API features a more imperative programming model and improves performance. Some early test users found 20% - 1X performance improvement by switching to the new API.

In addition, Spark now supports graph checkpointing and lineage truncation which are necessary to support large numbers of iterations in production jobs. Finally, a handful of performance improvements have been added for PageRank and graph loading.

@@ -200,7 +199,7 @@ In 1.2 GraphX graduates from an alpha component and adds a stable API. This mean

Upgrading to Spark 1.2

-

Spark 1.2 is binary compatible with Spark 1.0 and 1.1, so no code changes are necessary. This excludes API’s marked explicitly as unstable. Spark changes default configuration in a handful of cases for improved performance. Users who want to preserve identical configurations to Spark 1.1 can roll back these changes.

+

Spark 1.2 is binary compatible with Spark 1.0 and 1.1, so no code changes are necessary. This excludes APIs marked explicitly as unstable. Spark changes default configuration in a handful of cases for improved performance. Users who want to preserve identical configurations to Spark 1.1 can roll back these changes.

  1. spark.shuffle.blockTransferService has been changed from nio to netty
  2. -- cgit v1.2.3