summaryrefslogtreecommitdiff
path: root/site/docs/1.5.0/tuning.html
diff options
context:
space:
mode:
authorReynold Xin <rxin@apache.org>2015-09-17 22:11:21 +0000
committerReynold Xin <rxin@apache.org>2015-09-17 22:11:21 +0000
commit6f57b0c45a7d1b6255067c6e9bc549baa491acac (patch)
treedbf7d7a7700e9e6bad3c8289ab831bc9c2c20d62 /site/docs/1.5.0/tuning.html
parentee9ffe89d608e7640a2487406b618d27e58026d6 (diff)
downloadspark-website-6f57b0c45a7d1b6255067c6e9bc549baa491acac.tar.gz
spark-website-6f57b0c45a7d1b6255067c6e9bc549baa491acac.tar.bz2
spark-website-6f57b0c45a7d1b6255067c6e9bc549baa491acac.zip
add 1.5.0 back
Diffstat (limited to 'site/docs/1.5.0/tuning.html')
-rw-r--r--site/docs/1.5.0/tuning.html476
1 files changed, 476 insertions, 0 deletions
diff --git a/site/docs/1.5.0/tuning.html b/site/docs/1.5.0/tuning.html
new file mode 100644
index 000000000..80f18f41c
--- /dev/null
+++ b/site/docs/1.5.0/tuning.html
@@ -0,0 +1,476 @@
+<!DOCTYPE html>
+<!--[if lt IE 7]> <html class="no-js lt-ie9 lt-ie8 lt-ie7"> <![endif]-->
+<!--[if IE 7]> <html class="no-js lt-ie9 lt-ie8"> <![endif]-->
+<!--[if IE 8]> <html class="no-js lt-ie9"> <![endif]-->
+<!--[if gt IE 8]><!--> <html class="no-js"> <!--<![endif]-->
+ <head>
+ <meta charset="utf-8">
+ <meta http-equiv="X-UA-Compatible" content="IE=edge,chrome=1">
+ <title>Tuning - Spark 1.5.0 Documentation</title>
+
+ <meta name="description" content="Tuning and performance optimization guide for Spark 1.5.0">
+
+
+
+
+ <link rel="stylesheet" href="css/bootstrap.min.css">
+ <style>
+ body {
+ padding-top: 60px;
+ padding-bottom: 40px;
+ }
+ </style>
+ <meta name="viewport" content="width=device-width">
+ <link rel="stylesheet" href="css/bootstrap-responsive.min.css">
+ <link rel="stylesheet" href="css/main.css">
+
+ <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
+
+ <link rel="stylesheet" href="css/pygments-default.css">
+
+
+ <!-- Google analytics script -->
+ <script type="text/javascript">
+ var _gaq = _gaq || [];
+ _gaq.push(['_setAccount', 'UA-32518208-2']);
+ _gaq.push(['_trackPageview']);
+
+ (function() {
+ var ga = document.createElement('script'); ga.type = 'text/javascript'; ga.async = true;
+ ga.src = ('https:' == document.location.protocol ? 'https://ssl' : 'http://www') + '.google-analytics.com/ga.js';
+ var s = document.getElementsByTagName('script')[0]; s.parentNode.insertBefore(ga, s);
+ })();
+ </script>
+
+
+ </head>
+ <body>
+ <!--[if lt IE 7]>
+ <p class="chromeframe">You are using an outdated browser. <a href="http://browsehappy.com/">Upgrade your browser today</a> or <a href="http://www.google.com/chromeframe/?redirect=true">install Google Chrome Frame</a> to better experience this site.</p>
+ <![endif]-->
+
+ <!-- This code is taken from http://twitter.github.com/bootstrap/examples/hero.html -->
+
+ <div class="navbar navbar-fixed-top" id="topbar">
+ <div class="navbar-inner">
+ <div class="container">
+ <div class="brand"><a href="index.html">
+ <img src="img/spark-logo-hd.png" style="height:50px;"/></a><span class="version">1.5.0</span>
+ </div>
+ <ul class="nav">
+ <!--TODO(andyk): Add class="active" attribute to li some how.-->
+ <li><a href="index.html">Overview</a></li>
+
+ <li class="dropdown">
+ <a href="#" class="dropdown-toggle" data-toggle="dropdown">Programming Guides<b class="caret"></b></a>
+ <ul class="dropdown-menu">
+ <li><a href="quick-start.html">Quick Start</a></li>
+ <li><a href="programming-guide.html">Spark Programming Guide</a></li>
+ <li class="divider"></li>
+ <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
+ <li><a href="sql-programming-guide.html">DataFrames and SQL</a></li>
+ <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
+ <li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
+ <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
+ <li><a href="sparkr.html">SparkR (R on Spark)</a></li>
+ </ul>
+ </li>
+
+ <li class="dropdown">
+ <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
+ <ul class="dropdown-menu">
+ <li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li>
+ <li><a href="api/java/index.html">Java</a></li>
+ <li><a href="api/python/index.html">Python</a></li>
+ <li><a href="api/R/index.html">R</a></li>
+ </ul>
+ </li>
+
+ <li class="dropdown">
+ <a href="#" class="dropdown-toggle" data-toggle="dropdown">Deploying<b class="caret"></b></a>
+ <ul class="dropdown-menu">
+ <li><a href="cluster-overview.html">Overview</a></li>
+ <li><a href="submitting-applications.html">Submitting Applications</a></li>
+ <li class="divider"></li>
+ <li><a href="spark-standalone.html">Spark Standalone</a></li>
+ <li><a href="running-on-mesos.html">Mesos</a></li>
+ <li><a href="running-on-yarn.html">YARN</a></li>
+ <li class="divider"></li>
+ <li><a href="ec2-scripts.html">Amazon EC2</a></li>
+ </ul>
+ </li>
+
+ <li class="dropdown">
+ <a href="api.html" class="dropdown-toggle" data-toggle="dropdown">More<b class="caret"></b></a>
+ <ul class="dropdown-menu">
+ <li><a href="configuration.html">Configuration</a></li>
+ <li><a href="monitoring.html">Monitoring</a></li>
+ <li><a href="tuning.html">Tuning Guide</a></li>
+ <li><a href="job-scheduling.html">Job Scheduling</a></li>
+ <li><a href="security.html">Security</a></li>
+ <li><a href="hardware-provisioning.html">Hardware Provisioning</a></li>
+ <li><a href="hadoop-third-party-distributions.html">3<sup>rd</sup>-Party Hadoop Distros</a></li>
+ <li class="divider"></li>
+ <li><a href="building-spark.html">Building Spark</a></li>
+ <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark">Contributing to Spark</a></li>
+ <li><a href="https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects">Supplemental Projects</a></li>
+ </ul>
+ </li>
+ </ul>
+ <!--<p class="navbar-text pull-right"><span class="version-text">v1.5.0</span></p>-->
+ </div>
+ </div>
+ </div>
+
+ <div class="container" id="content">
+
+ <h1 class="title">Tuning Spark</h1>
+
+
+ <ul id="markdown-toc">
+ <li><a href="#data-serialization">Data Serialization</a></li>
+ <li><a href="#memory-tuning">Memory Tuning</a> <ul>
+ <li><a href="#determining-memory-consumption">Determining Memory Consumption</a></li>
+ <li><a href="#tuning-data-structures">Tuning Data Structures</a></li>
+ <li><a href="#serialized-rdd-storage">Serialized RDD Storage</a></li>
+ <li><a href="#garbage-collection-tuning">Garbage Collection Tuning</a></li>
+ </ul>
+ </li>
+ <li><a href="#other-considerations">Other Considerations</a> <ul>
+ <li><a href="#level-of-parallelism">Level of Parallelism</a></li>
+ <li><a href="#memory-usage-of-reduce-tasks">Memory Usage of Reduce Tasks</a></li>
+ <li><a href="#broadcasting-large-variables">Broadcasting Large Variables</a></li>
+ <li><a href="#data-locality">Data Locality</a></li>
+ </ul>
+ </li>
+ <li><a href="#summary">Summary</a></li>
+</ul>
+
+<p>Because of the in-memory nature of most Spark computations, Spark programs can be bottlenecked
+by any resource in the cluster: CPU, network bandwidth, or memory.
+Most often, if the data fits in memory, the bottleneck is network bandwidth, but sometimes, you
+also need to do some tuning, such as
+<a href="programming-guide.html#rdd-persistence">storing RDDs in serialized form</a>, to
+decrease memory usage.
+This guide will cover two main topics: data serialization, which is crucial for good network
+performance and can also reduce memory use, and memory tuning. We also sketch several smaller topics.</p>
+
+<h1 id="data-serialization">Data Serialization</h1>
+
+<p>Serialization plays an important role in the performance of any distributed application.
+Formats that are slow to serialize objects into, or consume a large number of
+bytes, will greatly slow down the computation.
+Often, this will be the first thing you should tune to optimize a Spark application.
+Spark aims to strike a balance between convenience (allowing you to work with any Java type
+in your operations) and performance. It provides two serialization libraries:</p>
+
+<ul>
+ <li><a href="http://docs.oracle.com/javase/6/docs/api/java/io/Serializable.html">Java serialization</a>:
+By default, Spark serializes objects using Java&#8217;s <code>ObjectOutputStream</code> framework, and can work
+with any class you create that implements
+<a href="http://docs.oracle.com/javase/6/docs/api/java/io/Serializable.html"><code>java.io.Serializable</code></a>.
+You can also control the performance of your serialization more closely by extending
+<a href="http://docs.oracle.com/javase/6/docs/api/java/io/Externalizable.html"><code>java.io.Externalizable</code></a>.
+Java serialization is flexible but often quite slow, and leads to large
+serialized formats for many classes.</li>
+ <li><a href="https://github.com/EsotericSoftware/kryo">Kryo serialization</a>: Spark can also use
+the Kryo library (version 2) to serialize objects more quickly. Kryo is significantly
+faster and more compact than Java serialization (often as much as 10x), but does not support all
+<code>Serializable</code> types and requires you to <em>register</em> the classes you&#8217;ll use in the program in advance
+for best performance.</li>
+</ul>
+
+<p>You can switch to using Kryo by initializing your job with a <a href="configuration.html#spark-properties">SparkConf</a>
+and calling <code>conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")</code>.
+This setting configures the serializer used for not only shuffling data between worker
+nodes but also when serializing RDDs to disk. The only reason Kryo is not the default is because of the custom
+registration requirement, but we recommend trying it in any network-intensive application.</p>
+
+<p>Spark automatically includes Kryo serializers for the many commonly-used core Scala classes covered
+in the AllScalaRegistrar from the <a href="https://github.com/twitter/chill">Twitter chill</a> library.</p>
+
+<p>To register your own custom classes with Kryo, use the <code>registerKryoClasses</code> method.</p>
+
+<div class="highlight"><pre><code class="language-scala" data-lang="scala"><span class="k">val</span> <span class="n">conf</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkConf</span><span class="o">().</span><span class="n">setMaster</span><span class="o">(...).</span><span class="n">setAppName</span><span class="o">(...)</span>
+<span class="n">conf</span><span class="o">.</span><span class="n">registerKryoClasses</span><span class="o">(</span><span class="nc">Array</span><span class="o">(</span><span class="n">classOf</span><span class="o">[</span><span class="kt">MyClass1</span><span class="o">],</span> <span class="n">classOf</span><span class="o">[</span><span class="kt">MyClass2</span><span class="o">]))</span>
+<span class="k">val</span> <span class="n">sc</span> <span class="k">=</span> <span class="k">new</span> <span class="nc">SparkContext</span><span class="o">(</span><span class="n">conf</span><span class="o">)</span></code></pre></div>
+
+<p>The <a href="https://github.com/EsotericSoftware/kryo">Kryo documentation</a> describes more advanced
+registration options, such as adding custom serialization code.</p>
+
+<p>If your objects are large, you may also need to increase the <code>spark.kryoserializer.buffer</code>
+config property. The default is 2, but this value needs to be large enough to hold the <em>largest</em>
+object you will serialize.</p>
+
+<p>Finally, if you don&#8217;t register your custom classes, Kryo will still work, but it will have to store
+the full class name with each object, which is wasteful.</p>
+
+<h1 id="memory-tuning">Memory Tuning</h1>
+
+<p>There are three considerations in tuning memory usage: the <em>amount</em> of memory used by your objects
+(you may want your entire dataset to fit in memory), the <em>cost</em> of accessing those objects, and the
+overhead of <em>garbage collection</em> (if you have high turnover in terms of objects).</p>
+
+<p>By default, Java objects are fast to access, but can easily consume a factor of 2-5x more space
+than the &#8220;raw&#8221; data inside their fields. This is due to several reasons:</p>
+
+<ul>
+ <li>Each distinct Java object has an &#8220;object header&#8221;, which is about 16 bytes and contains information
+such as a pointer to its class. For an object with very little data in it (say one <code>Int</code> field), this
+can be bigger than the data.</li>
+ <li>Java <code>String</code>s have about 40 bytes of overhead over the raw string data (since they store it in an
+array of <code>Char</code>s and keep extra data such as the length), and store each character
+as <em>two</em> bytes due to <code>String</code>&#8217;s internal usage of UTF-16 encoding. Thus a 10-character string can
+easily consume 60 bytes.</li>
+ <li>Common collection classes, such as <code>HashMap</code> and <code>LinkedList</code>, use linked data structures, where
+there is a &#8220;wrapper&#8221; object for each entry (e.g. <code>Map.Entry</code>). This object not only has a header,
+but also pointers (typically 8 bytes each) to the next object in the list.</li>
+ <li>Collections of primitive types often store them as &#8220;boxed&#8221; objects such as <code>java.lang.Integer</code>.</li>
+</ul>
+
+<p>This section will discuss how to determine the memory usage of your objects, and how to improve
+it &#8211; either by changing your data structures, or by storing data in a serialized format.
+We will then cover tuning Spark&#8217;s cache size and the Java garbage collector.</p>
+
+<h2 id="determining-memory-consumption">Determining Memory Consumption</h2>
+
+<p>The best way to size the amount of memory consumption a dataset will require is to create an RDD, put it
+into cache, and look at the &#8220;Storage&#8221; page in the web UI. The page will tell you how much memory the RDD
+is occupying.</p>
+
+<p>To estimate the memory consumption of a particular object, use <code>SizeEstimator</code>&#8217;s <code>estimate</code> method
+This is useful for experimenting with different data layouts to trim memory usage, as well as
+determining the amount of space a broadcast variable will occupy on each executor heap.</p>
+
+<h2 id="tuning-data-structures">Tuning Data Structures</h2>
+
+<p>The first way to reduce memory consumption is to avoid the Java features that add overhead, such as
+pointer-based data structures and wrapper objects. There are several ways to do this:</p>
+
+<ol>
+ <li>Design your data structures to prefer arrays of objects, and primitive types, instead of the
+standard Java or Scala collection classes (e.g. <code>HashMap</code>). The <a href="http://fastutil.di.unimi.it">fastutil</a>
+library provides convenient collection classes for primitive types that are compatible with the
+Java standard library.</li>
+ <li>Avoid nested structures with a lot of small objects and pointers when possible.</li>
+ <li>Consider using numeric IDs or enumeration objects instead of strings for keys.</li>
+ <li>If you have less than 32 GB of RAM, set the JVM flag <code>-XX:+UseCompressedOops</code> to make pointers be
+four bytes instead of eight. You can add these options in
+<a href="configuration.html#environment-variables"><code>spark-env.sh</code></a>.</li>
+</ol>
+
+<h2 id="serialized-rdd-storage">Serialized RDD Storage</h2>
+
+<p>When your objects are still too large to efficiently store despite this tuning, a much simpler way
+to reduce memory usage is to store them in <em>serialized</em> form, using the serialized StorageLevels in
+the <a href="programming-guide.html#rdd-persistence">RDD persistence API</a>, such as <code>MEMORY_ONLY_SER</code>.
+Spark will then store each RDD partition as one large byte array.
+The only downside of storing data in serialized form is slower access times, due to having to
+deserialize each object on the fly.
+We highly recommend <a href="#data-serialization">using Kryo</a> if you want to cache data in serialized form, as
+it leads to much smaller sizes than Java serialization (and certainly than raw Java objects).</p>
+
+<h2 id="garbage-collection-tuning">Garbage Collection Tuning</h2>
+
+<p>JVM garbage collection can be a problem when you have large &#8220;churn&#8221; in terms of the RDDs
+stored by your program. (It is usually not a problem in programs that just read an RDD once
+and then run many operations on it.) When Java needs to evict old objects to make room for new ones, it will
+need to trace through all your Java objects and find the unused ones. The main point to remember here is
+that <em>the cost of garbage collection is proportional to the number of Java objects</em>, so using data
+structures with fewer objects (e.g. an array of <code>Int</code>s instead of a <code>LinkedList</code>) greatly lowers
+this cost. An even better method is to persist objects in serialized form, as described above: now
+there will be only <em>one</em> object (a byte array) per RDD partition. Before trying other
+techniques, the first thing to try if GC is a problem is to use <a href="#serialized-rdd-storage">serialized caching</a>.</p>
+
+<p>GC can also be a problem due to interference between your tasks&#8217; working memory (the
+amount of space needed to run the task) and the RDDs cached on your nodes. We will discuss how to control
+the space allocated to the RDD cache to mitigate this.</p>
+
+<p><strong>Measuring the Impact of GC</strong></p>
+
+<p>The first step in GC tuning is to collect statistics on how frequently garbage collection occurs and the amount of
+time spent GC. This can be done by adding <code>-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps</code> to the Java options. (See the <a href="configuration.html#Dynamically-Loading-Spark-Properties">configuration guide</a> for info on passing Java options to Spark jobs.) Next time your Spark job is run, you will see messages printed in the worker&#8217;s logs
+each time a garbage collection occurs. Note these logs will be on your cluster&#8217;s worker nodes (in the <code>stdout</code> files in
+their work directories), <em>not</em> on your driver program.</p>
+
+<p><strong>Cache Size Tuning</strong></p>
+
+<p>One important configuration parameter for GC is the amount of memory that should be used for caching RDDs.
+By default, Spark uses 60% of the configured executor memory (<code>spark.executor.memory</code>) to
+cache RDDs. This means that 40% of memory is available for any objects created during task execution.</p>
+
+<p>In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of
+memory, lowering this value will help reduce the memory consumption. To change this to, say, 50%, you can call
+<code>conf.set("spark.storage.memoryFraction", "0.5")</code> on your SparkConf. Combined with the use of serialized caching,
+using a smaller cache should be sufficient to mitigate most of the garbage collection problems.
+In case you are interested in further tuning the Java GC, continue reading below.</p>
+
+<p><strong>Advanced GC Tuning</strong></p>
+
+<p>To further tune garbage collection, we first need to understand some basic information about memory management in the JVM:</p>
+
+<ul>
+ <li>
+ <p>Java Heap space is divided in to two regions Young and Old. The Young generation is meant to hold short-lived objects
+while the Old generation is intended for objects with longer lifetimes.</p>
+ </li>
+ <li>
+ <p>The Young generation is further divided into three regions [Eden, Survivor1, Survivor2].</p>
+ </li>
+ <li>
+ <p>A simplified description of the garbage collection procedure: When Eden is full, a minor GC is run on Eden and objects
+that are alive from Eden and Survivor1 are copied to Survivor2. The Survivor regions are swapped. If an object is old
+enough or Survivor2 is full, it is moved to Old. Finally when Old is close to full, a full GC is invoked.</p>
+ </li>
+</ul>
+
+<p>The goal of GC tuning in Spark is to ensure that only long-lived RDDs are stored in the Old generation and that
+the Young generation is sufficiently sized to store short-lived objects. This will help avoid full GCs to collect
+temporary objects created during task execution. Some steps which may be useful are:</p>
+
+<ul>
+ <li>
+ <p>Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for
+before a task completes, it means that there isn&#8217;t enough memory available for executing tasks.</p>
+ </li>
+ <li>
+ <p>In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of memory used for caching.
+This can be done using the <code>spark.storage.memoryFraction</code> property. It is better to cache fewer objects than to slow
+down task execution!</p>
+ </li>
+ <li>
+ <p>If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You
+can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden
+is determined to be <code>E</code>, then you can set the size of the Young generation using the option <code>-Xmn=4/3*E</code>. (The scaling
+up by 4/3 is to account for space used by survivor regions as well.)</p>
+ </li>
+ <li>
+ <p>As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using
+the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the
+size of the block. So if we wish to have 3 or 4 tasks&#8217; worth of working space, and the HDFS block size is 64 MB,
+we can estimate size of Eden to be <code>4*3*64MB</code>.</p>
+ </li>
+ <li>
+ <p>Monitor how the frequency and time taken by garbage collection changes with the new settings.</p>
+ </li>
+</ul>
+
+<p>Our experience suggests that the effect of GC tuning depends on your application and the amount of memory available.
+There are <a href="http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html">many more tuning options</a> described online,
+but at a high level, managing how frequently full GC takes place can help in reducing the overhead.</p>
+
+<h1 id="other-considerations">Other Considerations</h1>
+
+<h2 id="level-of-parallelism">Level of Parallelism</h2>
+
+<p>Clusters will not be fully utilized unless you set the level of parallelism for each operation high
+enough. Spark automatically sets the number of &#8220;map&#8221; tasks to run on each file according to its size
+(though you can control it through optional parameters to <code>SparkContext.textFile</code>, etc), and for
+distributed &#8220;reduce&#8221; operations, such as <code>groupByKey</code> and <code>reduceByKey</code>, it uses the largest
+parent RDD&#8217;s number of partitions. You can pass the level of parallelism as a second argument
+(see the <a href="api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions"><code>spark.PairRDDFunctions</code></a> documentation),
+or set the config property <code>spark.default.parallelism</code> to change the default.
+In general, we recommend 2-3 tasks per CPU core in your cluster.</p>
+
+<h2 id="memory-usage-of-reduce-tasks">Memory Usage of Reduce Tasks</h2>
+
+<p>Sometimes, you will get an OutOfMemoryError not because your RDDs don&#8217;t fit in memory, but because the
+working set of one of your tasks, such as one of the reduce tasks in <code>groupByKey</code>, was too large.
+Spark&#8217;s shuffle operations (<code>sortByKey</code>, <code>groupByKey</code>, <code>reduceByKey</code>, <code>join</code>, etc) build a hash table
+within each task to perform the grouping, which can often be large. The simplest fix here is to
+<em>increase the level of parallelism</em>, so that each task&#8217;s input set is smaller. Spark can efficiently
+support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has
+a low task launching cost, so you can safely increase the level of parallelism to more than the
+number of cores in your clusters.</p>
+
+<h2 id="broadcasting-large-variables">Broadcasting Large Variables</h2>
+
+<p>Using the <a href="programming-guide.html#broadcast-variables">broadcast functionality</a>
+available in <code>SparkContext</code> can greatly reduce the size of each serialized task, and the cost
+of launching a job over a cluster. If your tasks use any large object from the driver program
+inside of them (e.g. a static lookup table), consider turning it into a broadcast variable.
+Spark prints the serialized size of each task on the master, so you can look at that to
+decide whether your tasks are too large; in general tasks larger than about 20 KB are probably
+worth optimizing.</p>
+
+<h2 id="data-locality">Data Locality</h2>
+
+<p>Data locality can have a major impact on the performance of Spark jobs. If data and the code that
+operates on it are together then computation tends to be fast. But if code and data are separated,
+one must move to the other. Typically it is faster to ship serialized code from place to place than
+a chunk of data because code size is much smaller than data. Spark builds its scheduling around
+this general principle of data locality.</p>
+
+<p>Data locality is how close data is to the code processing it. There are several levels of
+locality based on the data&#8217;s current location. In order from closest to farthest:</p>
+
+<ul>
+ <li><code>PROCESS_LOCAL</code> data is in the same JVM as the running code. This is the best locality
+possible</li>
+ <li><code>NODE_LOCAL</code> data is on the same node. Examples might be in HDFS on the same node, or in
+another executor on the same node. This is a little slower than <code>PROCESS_LOCAL</code> because the data
+has to travel between processes</li>
+ <li><code>NO_PREF</code> data is accessed equally quickly from anywhere and has no locality preference</li>
+ <li><code>RACK_LOCAL</code> data is on the same rack of servers. Data is on a different server on the same rack
+so needs to be sent over the network, typically through a single switch</li>
+ <li><code>ANY</code> data is elsewhere on the network and not in the same rack</li>
+</ul>
+
+<p>Spark prefers to schedule all tasks at the best locality level, but this is not always possible. In
+situations where there is no unprocessed data on any idle executor, Spark switches to lower locality
+levels. There are two options: a) wait until a busy CPU frees up to start a task on data on the same
+server, or b) immediately start a new task in a farther away place that requires moving data there.</p>
+
+<p>What Spark typically does is wait a bit in the hopes that a busy CPU frees up. Once that timeout
+expires, it starts moving the data from far away to the free CPU. The wait timeout for fallback
+between each level can be configured individually or all together in one parameter; see the
+<code>spark.locality</code> parameters on the <a href="configuration.html#scheduling">configuration page</a> for details.
+You should increase these settings if your tasks are long and see poor locality, but the default
+usually works well.</p>
+
+<h1 id="summary">Summary</h1>
+
+<p>This has been a short guide to point out the main concerns you should know about when tuning a
+Spark application &#8211; most importantly, data serialization and memory tuning. For most programs,
+switching to Kryo serialization and persisting data in serialized form will solve most common
+performance issues. Feel free to ask on the
+<a href="https://spark.apache.org/community.html">Spark mailing list</a> about other tuning best practices.</p>
+
+
+ </div> <!-- /container -->
+
+ <script src="js/vendor/jquery-1.8.0.min.js"></script>
+ <script src="js/vendor/bootstrap.min.js"></script>
+ <script src="js/vendor/anchor.min.js"></script>
+ <script src="js/main.js"></script>
+
+ <!-- MathJax Section -->
+ <script type="text/x-mathjax-config">
+ MathJax.Hub.Config({
+ TeX: { equationNumbers: { autoNumber: "AMS" } }
+ });
+ </script>
+ <script>
+ // Note that we load MathJax this way to work with local file (file://), HTTP and HTTPS.
+ // We could use "//cdn.mathjax...", but that won't support "file://".
+ (function(d, script) {
+ script = d.createElement('script');
+ script.type = 'text/javascript';
+ script.async = true;
+ script.onload = function(){
+ MathJax.Hub.Config({
+ tex2jax: {
+ inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
+ displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
+ processEscapes: true,
+ skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
+ }
+ });
+ };
+ script.src = ('https:' == document.location.protocol ? 'https://' : 'http://') +
+ 'cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML';
+ d.getElementsByTagName('head')[0].appendChild(script);
+ }(document));
+ </script>
+ </body>
+</html>