summaryrefslogtreecommitdiff
path: root/site/releases/spark-release-1-2-0.html
diff options
context:
space:
mode:
authorSean Owen <sowen@cloudera.com>2016-10-09 12:43:24 +0100
committerSean Owen <sowen@cloudera.com>2016-10-09 12:43:24 +0100
commit4fbd720ddf9dc899a7e62b3103da63f169174580 (patch)
tree8594b184271cecab411f29400f88eb904930a507 /site/releases/spark-release-1-2-0.html
parent5a6c152c8aa9dec698d13ad283ab37f043aeb6f4 (diff)
downloadspark-website-4fbd720ddf9dc899a7e62b3103da63f169174580.tar.gz
spark-website-4fbd720ddf9dc899a7e62b3103da63f169174580.tar.bz2
spark-website-4fbd720ddf9dc899a7e62b3103da63f169174580.zip
Add new book per request; again fix minor differences in rendering from others' different/older jekyll versions?
Diffstat (limited to 'site/releases/spark-release-1-2-0.html')
-rw-r--r--site/releases/spark-release-1-2-0.html2
1 files changed, 1 insertions, 1 deletions
diff --git a/site/releases/spark-release-1-2-0.html b/site/releases/spark-release-1-2-0.html
index 219990e69..0afe0ca3e 100644
--- a/site/releases/spark-release-1-2-0.html
+++ b/site/releases/spark-release-1-2-0.html
@@ -194,7 +194,7 @@
<p>In 1.2 Spark core upgrades two major subsystems to improve the performance and stability of very large scale shuffles. The first is Spark’s communication manager used during bulk transfers, which upgrades to a <a href="https://issues.apache.org/jira/browse/SPARK-2468">netty-based implementation</a>. The second is Spark’s shuffle mechanism, which upgrades to the <a href="https://issues.apache.org/jira/browse/SPARK-3280">“sort based” shuffle initially released in Spark 1.1</a>. These both improve the performance and stability of very large scale shuffles. Spark also adds an <a href="https://issues.apache.org/jira/browse/SPARK-3174">elastic scaling mechanism</a> designed to improve cluster utilization during long running ETL-style jobs. This is currently supported on YARN and will make its way to other cluster managers in future versions. Finally, Spark 1.2 adds support for Scala 2.11. For instructions on building for Scala 2.11 see the <a href="/docs/1.2.0/building-spark.html#building-for-scala-211">build documentation</a>.</p>
<h3 id="spark-streaming">Spark Streaming</h3>
-<p>This release includes two major feature additions to Spark’s streaming library, a Python API and a write ahead log for full driver H/A. The <a href="https://issues.apache.org/jira/browse/SPARK-2377">Python API</a> covers almost all the DStream transformations and output operations. Input sources based on text files and text over sockets are currently supported. Support for Kafka and Flume input streams in Python will be added in the next release. Second, Spark streaming now features H/A driver support through a <a href="https://issues.apache.org/jira/browse/SPARK-3129">write ahead log (WAL)</a>. In Spark 1.1 and earlier, some buffered (received but not yet processed) data can be lost during driver restarts. To prevent this Spark 1.2 adds an optional WAL, which buffers received data into a fault-tolerant file system (e.g. HDFS). See the <a href="/docs/1.2.0/streaming-programming-guide.html">streaming programming guide</a> for more details. </p>
+<p>This release includes two major feature additions to Spark’s streaming library, a Python API and a write ahead log for full driver H/A. The <a href="https://issues.apache.org/jira/browse/SPARK-2377">Python API</a> covers almost all the DStream transformations and output operations. Input sources based on text files and text over sockets are currently supported. Support for Kafka and Flume input streams in Python will be added in the next release. Second, Spark streaming now features H/A driver support through a <a href="https://issues.apache.org/jira/browse/SPARK-3129">write ahead log (WAL)</a>. In Spark 1.1 and earlier, some buffered (received but not yet processed) data can be lost during driver restarts. To prevent this Spark 1.2 adds an optional WAL, which buffers received data into a fault-tolerant file system (e.g. HDFS). See the <a href="/docs/1.2.0/streaming-programming-guide.html">streaming programming guide</a> for more details.</p>
<h3 id="mllib">MLLib</h3>
<p>Spark 1.2 previews a new set of machine learning API’s in a package called spark.ml that <a href="https://issues.apache.org/jira/browse/SPARK-3530">supports learning pipelines</a>, where multiple algorithms are run in sequence with varying parameters. This type of pipeline is common in practical machine learning deployments. The new ML package uses Spark’s SchemaRDD to represent <a href="https://issues.apache.org/jira/browse/SPARK-3573">ML datasets</a>, providing direct interoperability with Spark SQL. In addition to the new API, Spark 1.2 extends decision trees with two tree ensemble methods: <a href="https://issues.apache.org/jira/browse/SPARK-1545">random forests</a> and <a href="https://issues.apache.org/jira/browse/SPARK-1547">gradient-boosted trees</a>, among the most successful tree-based models for classification and regression. Finally, MLlib&#8217;s Python implementation receives a major update in 1.2 to simplify the process of adding Python APIs, along with better Python API coverage.</p>