aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorMatei Zaharia <matei@eecs.berkeley.edu>2013-08-30 10:16:26 -0700
committerMatei Zaharia <matei@eecs.berkeley.edu>2013-08-30 10:16:26 -0700
commit23762efda25c67a347d2bb2383f6272993a431e4 (patch)
treeca0bf3aeec3e58401b2c7507465745d95e5b2769
parent1b0f69c6234bc97d331cf4b1785078789cb69b47 (diff)
downloadspark-23762efda25c67a347d2bb2383f6272993a431e4.tar.gz
spark-23762efda25c67a347d2bb2383f6272993a431e4.tar.bz2
spark-23762efda25c67a347d2bb2383f6272993a431e4.zip
New hardware provisioning doc, and updates to menus
-rwxr-xr-xdocs/_layouts/global.html22
-rw-r--r--docs/hardware-provisioning.md70
2 files changed, 83 insertions, 9 deletions
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index a76346f428..a014554462 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -61,20 +61,24 @@
<a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
<ul class="dropdown-menu">
<li><a href="quick-start.html">Quick Start</a></li>
- <li><a href="scala-programming-guide.html">Scala</a></li>
- <li><a href="java-programming-guide.html">Java</a></li>
- <li><a href="python-programming-guide.html">Python</a></li>
+ <li><a href="scala-programming-guide.html">Spark in Scala</a></li>
+ <li><a href="java-programming-guide.html">Spark in Java</a></li>
+ <li><a href="python-programming-guide.html">Spark in Python</a></li>
+ <li class="divider"></li>
<li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
+ <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
</ul>
</li>
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu">
- <li><a href="api/core/index.html">Spark Java/Scala (Scaladoc)</a></li>
- <li><a href="api/pyspark/index.html">Spark Python (Epydoc)</a></li>
- <li><a href="api/streaming/index.html">Spark Streaming Java/Scala (Scaladoc) </a></li>
- <li><a href="api/mllib/index.html">Spark ML Library (Scaladoc) </a></li>
+ <li><a href="api/core/index.html">Spark Core for Java/Scala</a></li>
+ <li><a href="api/pyspark/index.html">Spark Core for Python</a></li>
+ <li class="divider"></li>
+ <li><a href="api/streaming/index.html">Spark Streaming</a></li>
+ <li><a href="api/bagel/index.html">Bagel (Pregel on Spark)</a></li>
+ <li><a href="api/mllib/index.html">MLlib (Machine Learning)</a></li>
</ul>
</li>
@@ -91,10 +95,10 @@
<li class="dropdown">
<a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
<ul class="dropdown-menu">
- <li><a href="building-with-maven.html">Building Spark with Maven</a></li>
<li><a href="configuration.html">Configuration</a></li>
<li><a href="tuning.html">Tuning Guide</a></li>
- <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
+ <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
+ <li><a href="building-with-maven.html">Building Spark with Maven</a></li>
<li><a href="contributing-to-spark.html">Contributing to Spark</a></li>
</ul>
</li>
diff --git a/docs/hardware-provisioning.md b/docs/hardware-provisioning.md
new file mode 100644
index 0000000000..d21e2a3d70
--- /dev/null
+++ b/docs/hardware-provisioning.md
@@ -0,0 +1,70 @@
+---
+layout: global
+title: Hardware Provisioning
+---
+
+A common question received by Spark developers is how to configure hardware for it. While the right
+hardware will depend on the situation, we make the following recommendations.
+
+# Storage Systems
+
+Because most Spark jobs will likely have to read input data from an external storage system (e.g.
+the Hadoop File System, or HBase), it is important to place it **as close to this system as
+possible**. We recommend the following:
+
+* If at all possible, run Spark on the same nodes as HDFS. The simplest way is to set up a Spark
+[standalone mode cluster](spark-standalone.html) on the same nodes, and configure Spark and
+Hadoop's memory and CPU usage to avoid interference (for Hadoop, the relevant options are
+`mapred.child.java.opts` for the per-task memory and `mapred.tasktracker.map.tasks.maximum`
+and `mapred.tasktracker.reduce.tasks.maximum` for number of tasks). Alternatively, you can run
+Hadoop and Spark on a common cluster manager like [Mesos](running-on-mesos.html) or
+[Hadoop YARN](running-on-yarn.html).
+
+* If this is not possible, run Spark on different nodes in the same local-area network as HDFS.
+If your cluster spans multiple racks, include some Spark nodes on each rack.
+
+* For low-latency data stores like HBase, it may be preferrable to run computing jobs on different
+nodes than the storage system to avoid interference.
+
+# Local Disks
+
+While Spark can perform a lot of its computation in memory, it still uses local disks to store
+data that doesn't fit in RAM, as well as to preserve intermediate output between stages. We
+recommend having **4-8 disks** per node, configured _without_ RAID (just as separate mount points).
+In Linux, mount the disks with the [`noatime` option](http://www.centos.org/docs/5/html/Global_File_System/s2-manage-mountnoatime.html)
+to reduce unnecessary writes. In Spark, [configure](configuration.html) the `spark.local.dir`
+variable to be a comma-separated list of the local disks. If you are running HDFS, it's fine to
+use the same disks as HDFS.
+
+# Memory
+
+In general, Spark can run well with anywhere from **8 GB to hundreds of gigabytes** of memory per
+machine. In all cases, we recommend allocating only at most 75% of the memory for Spark; leave the
+rest for the operating system and buffer cache.
+
+How much memory you will need will depend on your application. To determine how much your
+application uses for a certain dataset size, load part of your dataset in a Spark RDD and use the
+Storage tab of Spark's monitoring UI (`http://<driver-node>:3030`) to see its size in memory.
+Note that memory usage is greatly affected by storage level and serialization format -- see
+the [tuning guide](tuning.html) for tips on how to reduce it.
+
+Finally, note that the Java VM does not always behave well with more than 200 GB of RAM. If you
+purchase machines with more RAM than this, you can run _multiple worker JVMs per node_. In
+Spark's [standalone mode](spark-standalone.html), you can set the number of workers per node
+with the `SPARK_WORKER_INSTANCES` variable in `conf/spark-env.sh`, and the number of cores
+per worker with `SPARK_WORKER_CORES`.
+
+# Network
+
+In our experience, when the data is in memory, a lot of Spark applications are network-bound.
+Using a **10 Gigabit** or higher network is the best way to make these applications faster.
+This is especially true for "distributed reduce" applications such as group-bys, reduce-bys, and
+SQL joins. In any given application, you can see how much data Spark shuffles across the network
+from the application's monitoring UI (`http://<driver-node>:3030`).
+
+# CPU Cores
+
+Spark scales well to tens of CPU cores per machine because it performes minimal sharing between
+threads. You should likely provision at least **8-16 cores** per machine. Depending on the CPU
+cost of your workload, you may also need more: once data is in memory, most applications are
+either CPU- or network-bound.