summaryrefslogtreecommitdiff
path: root/site/releases/spark-release-1-1-0.html
diff options
context:
space:
mode:
Diffstat (limited to 'site/releases/spark-release-1-1-0.html')
-rw-r--r--site/releases/spark-release-1-1-0.html6
1 files changed, 3 insertions, 3 deletions
diff --git a/site/releases/spark-release-1-1-0.html b/site/releases/spark-release-1-1-0.html
index 895522451..f2d1a6737 100644
--- a/site/releases/spark-release-1-1-0.html
+++ b/site/releases/spark-release-1-1-0.html
@@ -197,7 +197,7 @@
<p>Spark SQL adds a number of new features and performance improvements in this release. A <a href="http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#running-the-thrift-jdbc-server">JDBC/ODBC server</a> allows users to connect to SparkSQL from many different applications and provides shared access to cached tables. A new module provides <a href="http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#json-datasets">support for loading JSON data</a> directly into Spark’s SchemaRDD format, including automatic schema inference. Spark SQL introduces <a href="http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#other-configuration-options">dynamic bytecode generation</a> in this release, a technique which significantly speeds up execution for queries that perform complex expression evaluation. This release also adds support for registering Python, Scala, and Java lambda functions as UDFs, which can then be called directly in SQL. Spark 1.1 adds a <a href="http://spark.apache.org/docs/1.1.0/sql-programming-guide.html#programmatically-specifying-the-schema">public types API to allow users to create SchemaRDD’s from custom data sources</a>. Finally, many optimizations have been added to the native Parquet support as well as throughout the engine.</p>
<h3 id="mllib">MLlib</h3>
-<p>MLlib adds several new algorithms and optimizations in this release. 1.1 introduces a <a href="https://issues.apache.org/jira/browse/SPARK-2359">new library of statistical packages</a> which provides exploratory analytic functions. These include stratified sampling, correlations, chi-squared tests and support for creating random datasets. This release adds utilities for feature extraction (<a href="https://issues.apache.org/jira/browse/SPARK-2510">Word2Vec</a> and <a href="https://issues.apache.org/jira/browse/SPARK-2511">TF-IDF</a>) and feature transformation (<a href="https://issues.apache.org/jira/browse/SPARK-2272">normalization and standard scaling</a>). Also new are support for <a href="https://issues.apache.org/jira/browse/SPARK-1553">nonnegative matrix factorization</a> and <a href="https://issues.apache.org/jira/browse/SPARK-1782">SVD via Lanczos</a>. The decision tree algorithm has been <a href="https://issues.apache.org/jira/browse/SPARK-2478">added in Python and Java</a>. A tree aggregation primitive has been added to help optimize many existing algorithms. Performance improves across the board in MLlib 1.1, with improvements of around 2-3X for many algorithms and up to 5X for large scale decision tree problems. </p>
+<p>MLlib adds several new algorithms and optimizations in this release. 1.1 introduces a <a href="https://issues.apache.org/jira/browse/SPARK-2359">new library of statistical packages</a> which provides exploratory analytic functions. These include stratified sampling, correlations, chi-squared tests and support for creating random datasets. This release adds utilities for feature extraction (<a href="https://issues.apache.org/jira/browse/SPARK-2510">Word2Vec</a> and <a href="https://issues.apache.org/jira/browse/SPARK-2511">TF-IDF</a>) and feature transformation (<a href="https://issues.apache.org/jira/browse/SPARK-2272">normalization and standard scaling</a>). Also new are support for <a href="https://issues.apache.org/jira/browse/SPARK-1553">nonnegative matrix factorization</a> and <a href="https://issues.apache.org/jira/browse/SPARK-1782">SVD via Lanczos</a>. The decision tree algorithm has been <a href="https://issues.apache.org/jira/browse/SPARK-2478">added in Python and Java</a>. A tree aggregation primitive has been added to help optimize many existing algorithms. Performance improves across the board in MLlib 1.1, with improvements of around 2-3X for many algorithms and up to 5X for large scale decision tree problems.</p>
<h3 id="graphx-and-spark-streaming">GraphX and Spark Streaming</h3>
<p>Spark streaming adds a new data source <a href="https://issues.apache.org/jira/browse/SPARK-1981">Amazon Kinesis</a>. For the Apache Flume, a new mode is supported which <a href="https://issues.apache.org/jira/browse/SPARK-1729">pulls data from Flume</a>, simplifying deployment and providing high availability. The first of a set of <a href="https://issues.apache.org/jira/browse/SPARK-2438">streaming machine learning algorithms</a> is introduced with streaming linear regression. Finally, <a href="https://issues.apache.org/jira/browse/SPARK-1341">rate limiting</a> has been added for streaming inputs. GraphX adds <a href="https://issues.apache.org/jira/browse/SPARK-1991">custom storage levels for vertices and edges</a> along with <a href="https://issues.apache.org/jira/browse/SPARK-2748">improved numerical precision</a> across the board. Finally, GraphX adds a new label propagation algorithm.</p>
@@ -215,7 +215,7 @@
<ul>
<li>The default value of <code>spark.io.compression.codec</code> is now <code>snappy</code> for improved memory usage. Old behavior can be restored by switching to <code>lzf</code>.</li>
- <li>The default value of <code>spark.broadcast.factory</code> is now <code>org.apache.spark.broadcast.TorrentBroadcastFactory</code> for improved efficiency of broadcasts. Old behavior can be restored by switching to <code>org.apache.spark.broadcast.HttpBroadcastFactory</code>. </li>
+ <li>The default value of <code>spark.broadcast.factory</code> is now <code>org.apache.spark.broadcast.TorrentBroadcastFactory</code> for improved efficiency of broadcasts. Old behavior can be restored by switching to <code>org.apache.spark.broadcast.HttpBroadcastFactory</code>.</li>
<li>PySpark now performs external spilling during aggregations. Old behavior can be restored by setting <code>spark.shuffle.spill</code> to <code>false</code>.</li>
<li>PySpark uses a new heuristic for determining the parallelism of shuffle operations. Old behavior can be restored by setting <code>spark.default.parallelism</code> to the number of cores in the cluster.</li>
</ul>
@@ -275,7 +275,7 @@
<li>Daneil Darabos &#8211; bug fixes and UI enhancements</li>
<li>Daoyuan Wang &#8211; SQL fixes</li>
<li>David Lemieux &#8211; bug fix</li>
- <li>Davies Liu &#8211; PySpark fixes and spilling </li>
+ <li>Davies Liu &#8211; PySpark fixes and spilling</li>
<li>DB Tsai &#8211; online summaries in MLlib and other MLlib features</li>
<li>Derek Ma &#8211; bug fix</li>
<li>Doris Xin &#8211; MLlib stats library and several fixes</li>