summaryrefslogtreecommitdiff
path: root/site/faq.html
diff options
context:
space:
mode:
authorReynold Xin <rxin@apache.org>2014-11-05 22:42:47 +0000
committerReynold Xin <rxin@apache.org>2014-11-05 22:42:47 +0000
commita6867061b120dbb948e67821aeccc8306cd7d92a (patch)
treebdaef6cc448c148948174268ab2bf45df985d7f7 /site/faq.html
parentdf01dbc95bad1227527e32feed84ef5906ae5939 (diff)
downloadspark-website-a6867061b120dbb948e67821aeccc8306cd7d92a.tar.gz
spark-website-a6867061b120dbb948e67821aeccc8306cd7d92a.tar.bz2
spark-website-a6867061b120dbb948e67821aeccc8306cd7d92a.zip
Added sort benchmark news and various minor updates.
Diffstat (limited to 'site/faq.html')
-rw-r--r--site/faq.html25
1 files changed, 13 insertions, 12 deletions
diff --git a/site/faq.html b/site/faq.html
index af410d27e..8a939f63c 100644
--- a/site/faq.html
+++ b/site/faq.html
@@ -102,7 +102,6 @@
<ul class="dropdown-menu">
<li><a href="/documentation.html">Overview</a></li>
<li><a href="/docs/latest/">Latest Release (Spark 1.1.0)</a></li>
- <li><a href="/examples.html">Examples</a></li>
</ul>
</li>
<li class="dropdown">
@@ -116,6 +115,7 @@
<li><a href="https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark">Powered By</a></li>
</ul>
</li>
+ <li><a href="/examples.html">Examples</a></li>
<li><a href="/faq.html">FAQ</a></li>
</ul>
</div>
@@ -129,6 +129,9 @@
<h5>Latest News</h5>
<ul class="list-unstyled">
+ <li><a href="/news/spark-wins-daytona-gray-sort-100tb-benchmark.html">Spark wins Daytona Gray Sort 100TB Benchmark</a>
+ <span class="small">(Nov 05, 2014)</span></li>
+
<li><a href="/news/proposals-open-for-spark-summit-east.html">Submissions open for Spark Summit East 2015 in New York</a>
<span class="small">(Oct 18, 2014)</span></li>
@@ -138,9 +141,6 @@
<li><a href="/news/spark-1-0-2-released.html">Spark 1.0.2 released</a>
<span class="small">(Aug 05, 2014)</span></li>
- <li><a href="/news/spark-0-9-2-released.html">Spark 0.9.2 released</a>
- <span class="small">(Jul 23, 2014)</span></li>
-
</ul>
<p class="small" style="text-align: right;"><a href="/news/index.html">Archive</a></p>
</div>
@@ -166,20 +166,21 @@
<p class="question">How does Spark relate to Hadoop?</p>
<p class="answer">
-Spark is a fast and powerful engine for processing Hadoop data.
-It runs in Hadoop clusters through
-<a href="http://hadoop.apache.org/docs/current2/hadoop-yarn/hadoop-yarn-site/YARN.html">Hadoop YARN</a>
-or Spark's <a href="/docs/latest/spark-standalone.html">standalone mode</a>, and it can process
-data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat.
-It is designed to perform both general data processing (similar to MapReduce) and new workloads like
-streaming, interactive queries, and machine learning.
+Spark is a fast and general processing engine compatible with Hadoop data. It can run in Hadoop clusters through YARN or Spark's standalone mode, and it can process data in HDFS, HBase, Cassandra, Hive, and any Hadoop InputFormat. It is designed to perform both batch processing (similar to MapReduce) and new workloads like streaming, interactive queries, and machine learning.
</p>
<p class="question">Which languages does Spark support?</p>
<p class="answer">Spark supports Scala, Java and Python.</p>
+<p class="question">What is the largest data size Spark can scale to?</p>
+<p class="answer">Spark has been shown to work well from megabytes of data to petabytes. It has been used to sort 100 TB of data 3X faster than Hadoop MapReduce on 1/10th of the machines, <a href="http://databricks.com/blog/2014/11/05/spark-officially-sets-a-new-record-in-large-scale-sorting.html">winning the 2014 Daytona GraySort Benchmark</a>. It has also been used to <a href="http://databricks.com/blog/2014/10/10/spark-petabyte-sort.html">sort 1 PB of data</a>. There are also production workloads that <a href="http://databricks.com/blog/2014/08/14/mining-graph-data-with-spark-at-alibaba-taobao.html">use Spark to do ETL and data analysis on PBs of data</a>.
+</p>
+
<p class="question">How large a cluster can Spark scale to?</p>
-<p class="answer">We have seen multiple deployments on over 1000 nodes.</p>
+<p class="answer">Many organizations run Spark on clusters with thousands of nodes.</p>
+
+<p class="question">What happens if my dataset does not fit in memory?</p>
+<p class="answer">Often each partition of data is small and does fit in memory, and these partitions are processed a few at a time. For very large partitions that do not fit in memory, Spark's built-in operators perform external operations on datasets.</p>
<p class="question">What happens when a cached dataset does not fit in memory?</p>
<p class="answer">Spark can either spill it to disk or recompute the partitions that don't fit in RAM each time they are requested. By default, it uses recomputation, but you can set a dataset's <a href="/docs/latest/scala-programming-guide.html#rdd-persistence">storage level</a> to <code>MEMORY_AND_DISK</code> to avoid this. </p>