More comprehensive new features

author: Reynold Xin <rxin@databricks.com> 2016-07-26 15:29:07 -0700
committer: Reynold Xin <rxin@databricks.com> 2016-07-26 15:29:07 -0700
commit: 7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e (patch)
tree: 05e0508a64ac18f75d0a0d670cb0b73aa9033a94
parent: 175d31a253b26e5af63dfb28235b3ff0a3d74bc9 (diff)
download: spark-website-7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e.tar.gz
spark-website-7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e.tar.bz2
spark-website-7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e.zip
2 files changed, 66 insertions, 32 deletions
diff --git a/releases/_posts/2016-07-27-spark-release-2-0-0.md b/releases/_posts/2016-07-27-spark-release-2-0-0.md
index 9969ce850..8d3596703 100644
--- a/releases/_posts/2016-07-27-spark-release-2-0-0.md
+++ b/releases/_posts/2016-07-27-spark-release-2-0-0.md
@@ -34,38 +34,46 @@ One of the largest changes in Spark 2.0 is the new updated APIs:
  - SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.
  - A new, streamlined configuration API for SparkSession
  - Simpler, more performant accumulator API
+ - A new, improved Aggregator API for typed aggregation in Datasets
 
 
 #### SQL
 
 Spark 2.0 substantially improved SQL functionalities with SQL2003 support. Spark SQL can now run all 99 TPC-DS queries. More prominently, we have improved:
 
+ - A native SQL parser that supports both ANSI-SQL as well as Hive QL
+ - Native DDL command implementations
  - Subquery support, including
- - Uncorrelated Scalar Subqueries
- - Correlated Scalar Subqueries
- - NOT IN predicate Subqueries (in WHERE/HAVING clauses)
- - IN predicate subqueries (in WHERE/HAVING clauses)
- - (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
+   - Uncorrelated Scalar Subqueries
+   - Correlated Scalar Subqueries
+   - NOT IN predicate Subqueries (in WHERE/HAVING clauses)
+   - IN predicate subqueries (in WHERE/HAVING clauses)
+   - (NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)
  - View canonicalization support
 
 In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.
 
 
-#### Performance
+#### New Features
+
+ - Native CSV data source, based on Databricks' [spark-csv module](https://github.com/databricks/spark-csv)
+ - Off-heap memory management for both caching and runtime execution
+ - Hive style bucketing support
+ - Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.
+
+
+#### Performance and Runtime
 
  - Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.
  - Improved Parquet scan throughput through vectorization
  - Improved ORC performance
  - Many improvements in the Catalyst query optimizer for common workloads
  - Improved window function performance via native implementations for all window functions
+ - Automatic file coalescing for native data sources
 
 
 ### MLlib
-The DataFrame-based API is now the primary API. The RDD-based API is entering maintenance mode. See the MLlib guide for details.
-
-#### API changes
-The largest API change is in linear algebra.  The DataFrame-based API (spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg.  This removes the last dependencies of spark.ml.* on spark.mllib.*.  (SPARK-13944)
-See the MLlib migration guide for a full list of API changes.
+The DataFrame-based API is now the primary API. The RDD-based API is entering maintenance mode. See the MLlib guide for details
 
 ####  New features
 
@@ -99,9 +107,14 @@ Spark 2.0 ships the initial experimental release for Structured Streaming, a hig
 For the DStream API, the most prominent update is the new experimental support for Kafka 0.10.
 
 
-### Operational and Packaging Improvements
+### Dependency and Packaging Improvements
+
+There are a variety of changes to Spark's operations and packaging process:
 
-There are a variety of improvements to Spark's operations and packaging process. The most prominent change is that Spark 2.0 no longer requires a fat assembly jar for production deployment.
+ - Spark 2.0 no longer requires a fat assembly jar for production deployment.
+ - Akka dependency has been removed, and as a result, user applications can program against any versions of Akka.
+ - Kryo version is bumped to 3.0.
+ - The default build is now using Scala 2.11 rather than Scala 2.10.
 
 
 ### Removals, Behavior Changes and Deprecations
@@ -134,6 +147,7 @@ The following changes might require updating existing applications that depend o
 - Java RDD’s flatMap and mapPartitions functions used to require functions returning Java Iterable. They have been updated to require functions returning Java iterator so the functions do not need to materialize all the data.
 - Java RDD’s countByKey and countAprroxDistinctByKey now returns a map from K to java.lang.Long, rather than to java.lang.Object.
 - When writing Parquet files, the summary files are not written by default. To re-enable it, users must set “parquet.enable.summary-metadata” to true.
+- The DataFrame-based API (spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg.  This removes the last dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib migration guide for a full list of API changes.
 
 
 For a more complete list, please see [SPARK-11806](https://issues.apache.org/jira/browse/SPARK-11806) for deprecations and removals.
diff --git a/site/releases/spark-release-2-0-0.html b/site/releases/spark-release-2-0-0.html
index ffa82552f..cf6f86b7e 100644
--- a/site/releases/spark-release-2-0-0.html
+++ b/site/releases/spark-release-2-0-0.html
@@ -195,18 +195,18 @@
   <li><a href="#core-and-spark-sql">Core and Spark SQL</a>    <ul>
       <li><a href="#programming-apis">Programming APIs</a></li>
       <li><a href="#sql">SQL</a></li>
-      <li><a href="#performance">Performance</a></li>
+      <li><a href="#new-features">New Features</a></li>
+      <li><a href="#performance-and-runtime">Performance and Runtime</a></li>
     </ul>
   </li>
   <li><a href="#mllib">MLlib</a>    <ul>
-      <li><a href="#api-changes">API changes</a></li>
-      <li><a href="#new-features">New features</a></li>
+      <li><a href="#new-features-1">New features</a></li>
       <li><a href="#speedscaling">Speed/scaling</a></li>
     </ul>
   </li>
   <li><a href="#sparkr">SparkR</a></li>
   <li><a href="#streaming">Streaming</a></li>
-  <li><a href="#operational-and-packaging-improvements">Operational and Packaging Improvements</a></li>
+  <li><a href="#dependency-and-packaging-improvements">Dependency and Packaging Improvements</a></li>
   <li><a href="#removals-behavior-changes-and-deprecations">Removals, Behavior Changes and Deprecations</a>    <ul>
       <li><a href="#removals">Removals</a></li>
       <li><a href="#behavior-changes">Behavior Changes</a></li>
@@ -232,6 +232,7 @@
   <li>SparkSession: new entry point that replaces the old SQLContext and HiveContext for DataFrame and Dataset APIs. SQLContext and HiveContext are kept for backward compatibility.</li>
   <li>A new, streamlined configuration API for SparkSession</li>
   <li>Simpler, more performant accumulator API</li>
+  <li>A new, improved Aggregator API for typed aggregation in Datasets</li>
 </ul>
 
 <h4 id="sql">SQL</h4>
@@ -239,18 +240,32 @@
 <p>Spark 2.0 substantially improved SQL functionalities with SQL2003 support. Spark SQL can now run all 99 TPC-DS queries. More prominently, we have improved:</p>
 
 <ul>
-  <li>Subquery support, including</li>
-  <li>Uncorrelated Scalar Subqueries</li>
-  <li>Correlated Scalar Subqueries</li>
-  <li>NOT IN predicate Subqueries (in WHERE/HAVING clauses)</li>
-  <li>IN predicate subqueries (in WHERE/HAVING clauses)</li>
-  <li>(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)</li>
+  <li>A native SQL parser that supports both ANSI-SQL as well as Hive QL</li>
+  <li>Native DDL command implementations</li>
+  <li>Subquery support, including
+    <ul>
+      <li>Uncorrelated Scalar Subqueries</li>
+      <li>Correlated Scalar Subqueries</li>
+      <li>NOT IN predicate Subqueries (in WHERE/HAVING clauses)</li>
+      <li>IN predicate subqueries (in WHERE/HAVING clauses)</li>
+      <li>(NOT) EXISTS predicate subqueries (in WHERE/HAVING clauses)</li>
+    </ul>
+  </li>
   <li>View canonicalization support</li>
 </ul>
 
 <p>In addition, when building without Hive support, Spark SQL should have almost all the functionality as when building with Hive support, with the exception of Hive connectivity, Hive UDFs, and script transforms.</p>
 
-<h4 id="performance">Performance</h4>
+<h4 id="new-features">New Features</h4>
+
+<ul>
+  <li>Native CSV data source, based on Databricks&#8217; <a href="https://github.com/databricks/spark-csv">spark-csv module</a></li>
+  <li>Off-heap memory management for both caching and runtime execution</li>
+  <li>Hive style bucketing support</li>
+  <li>Approximate summary statistics using sketches, including approximate quantile, Bloom filter, and count-min sketch.</li>
+</ul>
+
+<h4 id="performance-and-runtime">Performance and Runtime</h4>
 
 <ul>
   <li>Substantial (2 - 10X) performance speedups for common operators in SQL and DataFrames via a new technique called whole stage code generation.</li>
@@ -258,16 +273,13 @@
   <li>Improved ORC performance</li>
   <li>Many improvements in the Catalyst query optimizer for common workloads</li>
   <li>Improved window function performance via native implementations for all window functions</li>
+  <li>Automatic file coalescing for native data sources</li>
 </ul>
 
 <h3 id="mllib">MLlib</h3>
-<p>The DataFrame-based API is now the primary API. The RDD-based API is entering maintenance mode. See the MLlib guide for details.</p>
+<p>The DataFrame-based API is now the primary API. The RDD-based API is entering maintenance mode. See the MLlib guide for details</p>
 
-<h4 id="api-changes">API changes</h4>
-<p>The largest API change is in linear algebra.  The DataFrame-based API (spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg.  This removes the last dependencies of spark.ml.* on spark.mllib.*.  (SPARK-13944)
-See the MLlib migration guide for a full list of API changes.</p>
-
-<h4 id="new-features">New features</h4>
+<h4 id="new-features-1">New features</h4>
 
 <ul>
   <li>ML persistence: The DataFrames-based API provides near-complete support for saving and loading ML models and Pipelines in Scala, Java, Python, and R.  See this blog post for details.  (SPARK-6725, SPARK-11939, SPARK-14311)</li>
@@ -300,9 +312,16 @@ See the MLlib migration guide for a full list of API changes.</p>
 
 <p>For the DStream API, the most prominent update is the new experimental support for Kafka 0.10.</p>
 
-<h3 id="operational-and-packaging-improvements">Operational and Packaging Improvements</h3>
+<h3 id="dependency-and-packaging-improvements">Dependency and Packaging Improvements</h3>
+
+<p>There are a variety of changes to Spark&#8217;s operations and packaging process:</p>
 
-<p>There are a variety of improvements to Spark&#8217;s operations and packaging process. The most prominent change is that Spark 2.0 no longer requires a fat assembly jar for production deployment.</p>
+<ul>
+  <li>Spark 2.0 no longer requires a fat assembly jar for production deployment.</li>
+  <li>Akka dependency has been removed, and as a result, user applications can program against any versions of Akka.</li>
+  <li>Kryo version is bumped to 3.0.</li>
+  <li>The default build is now using Scala 2.11 rather than Scala 2.10.</li>
+</ul>
 
 <h3 id="removals-behavior-changes-and-deprecations">Removals, Behavior Changes and Deprecations</h3>
 
@@ -337,6 +356,7 @@ See the MLlib migration guide for a full list of API changes.</p>
   <li>Java RDD’s flatMap and mapPartitions functions used to require functions returning Java Iterable. They have been updated to require functions returning Java iterator so the functions do not need to materialize all the data.</li>
   <li>Java RDD’s countByKey and countAprroxDistinctByKey now returns a map from K to java.lang.Long, rather than to java.lang.Object.</li>
   <li>When writing Parquet files, the summary files are not written by default. To re-enable it, users must set “parquet.enable.summary-metadata” to true.</li>
+  <li>The DataFrame-based API (spark.ml) now depends upon local linear algebra in spark.ml.linalg, rather than in spark.mllib.linalg.  This removes the last dependencies of spark.ml.* on spark.mllib.*. (SPARK-13944) See the MLlib migration guide for a full list of API changes.</li>
 </ul>
 
 <p>For a more complete list, please see <a href="https://issues.apache.org/jira/browse/SPARK-11806">SPARK-11806</a> for deprecations and removals.</p>
author	Reynold Xin <rxin@databricks.com>	2016-07-26 15:29:07 -0700
committer	Reynold Xin <rxin@databricks.com>	2016-07-26 15:29:07 -0700
commit	7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e (patch)
tree	05e0508a64ac18f75d0a0d670cb0b73aa9033a94
parent	175d31a253b26e5af63dfb28235b3ff0a3d74bc9 (diff)
download	spark-website-7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e.tar.gz spark-website-7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e.tar.bz2 spark-website-7cd1fdf235b270b2aa38f8bb68d2e451ff618e2e.zip