[SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6

## What changes were proposed in this pull request? Collect GC discussion in one section, and documenting findings about G1 GC heap region size. ## How was this patch tested? Jekyll doc build Author: Sean Owen <sowen@cloudera.com> Closes #14732 from srowen/SPARK-16320.
author: Sean Owen <sowen@cloudera.com> 2016-08-22 11:15:53 -0700
committer: Yin Huai <yhuai@databricks.com> 2016-08-22 11:15:53 -0700
commit: 342278c09cf6e79ed4f63422988a6bbd1e7d8a91 (patch)
tree: 7416fb89bab92686f249497e934de75cb4994f78 /docs/tuning.md
parent: 209e1b3c0683a9106428e269e5041980b6cc327f (diff)
download: spark-342278c09cf6e79ed4f63422988a6bbd1e7d8a91.tar.gz
spark-342278c09cf6e79ed4f63422988a6bbd1e7d8a91.tar.bz2
spark-342278c09cf6e79ed4f63422988a6bbd1e7d8a91.zip
1 files changed, 17 insertions, 19 deletions
diff --git a/docs/tuning.md b/docs/tuning.md
index 976f2eb8a7..cbf37213aa 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -122,21 +122,8 @@ large records.
 `R` is the storage space within `M` where cached blocks immune to being evicted by execution.
 
 The value of `spark.memory.fraction` should be set in order to fit this amount of heap space
-comfortably within the JVM's old or "tenured" generation. Otherwise, when much of this space is
-used for caching and execution, the tenured generation will be full, which causes the JVM to
-significantly increase time spent in garbage collection. See
-<a href="https://docs.oracle.com/javase/8/docs/technotes/guides/vm/gctuning/sizing.html">Java GC sizing documentation</a>
-for more information.
-
-The tenured generation size is controlled by the JVM's `NewRatio` parameter, which defaults to 2,
-meaning that the tenured generation is 2 times the size of the new generation (the rest of the heap).
-So, by default, the tenured generation occupies 2/3 or about 0.66 of the heap. A value of
-0.6 for `spark.memory.fraction` keeps storage and execution memory within the old generation with
-room to spare. If `spark.memory.fraction` is increased to, say, 0.8, then `NewRatio` may have to
-increase to 6 or more.
-
-`NewRatio` is set as a JVM flag for executors, which means adding
-`spark.executor.extraJavaOptions=-XX:NewRatio=x` to a Spark job's configuration.
+comfortably within the JVM's old or "tenured" generation. See the discussion of advanced GC
+tuning below for details.
 
 ## Determining Memory Consumption
 
@@ -217,14 +204,22 @@ temporary objects created during task execution. Some steps which may be useful
 * Check if there are too many garbage collections by collecting GC stats. If a full GC is invoked multiple times for
   before a task completes, it means that there isn't enough memory available for executing tasks.
 
-* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of
-  memory used for caching by lowering `spark.memory.storageFraction`; it is better to cache fewer
-  objects than to slow down task execution!
-
 * If there are too many minor collections but not many major GCs, allocating more memory for Eden would help. You
   can set the size of the Eden to be an over-estimate of how much memory each task will need. If the size of Eden
   is determined to be `E`, then you can set the size of the Young generation using the option `-Xmn=4/3*E`. (The scaling
   up by 4/3 is to account for space used by survivor regions as well.)
+  
+* In the GC stats that are printed, if the OldGen is close to being full, reduce the amount of
+  memory used for caching by lowering `spark.memory.fraction`; it is better to cache fewer
+  objects than to slow down task execution. Alternatively, consider decreasing the size of
+  the Young generation. This means lowering `-Xmn` if you've set it as above. If not, try changing the 
+  value of the JVM's `NewRatio` parameter. Many JVMs default this to 2, meaning that the Old generation 
+  occupies 2/3 of the heap. It should be large enough such that this fraction exceeds `spark.memory.fraction`.
+  
+* Try the G1GC garbage collector with `-XX:+UseG1GC`. It can improve performance in some situations where
+  garbage collection is a bottleneck. Note that with large executor heap sizes, it may be important to
+  increase the [G1 region size](https://blogs.oracle.com/g1gc/entry/g1_gc_tuning_a_case) 
+  with `-XX:G1HeapRegionSize`
 
 * As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using
   the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the
@@ -237,6 +232,9 @@ Our experience suggests that the effect of GC tuning depends on your application
 There are [many more tuning options](http://www.oracle.com/technetwork/java/javase/gc-tuning-6-140523.html) described online,
 but at a high level, managing how frequently full GC takes place can help in reducing the overhead.
 
+GC tuning flags for executors can be specified by setting `spark.executor.extraJavaOptions` in
+a job's configuration.
+
 # Other Considerations
 
 ## Level of Parallelism
author	Sean Owen <sowen@cloudera.com>	2016-08-22 11:15:53 -0700
committer	Yin Huai <yhuai@databricks.com>	2016-08-22 11:15:53 -0700
commit	342278c09cf6e79ed4f63422988a6bbd1e7d8a91 (patch)
tree	7416fb89bab92686f249497e934de75cb4994f78 /docs/tuning.md
parent	209e1b3c0683a9106428e269e5041980b6cc327f (diff)
download	spark-342278c09cf6e79ed4f63422988a6bbd1e7d8a91.tar.gz spark-342278c09cf6e79ed4f63422988a6bbd1e7d8a91.tar.bz2 spark-342278c09cf6e79ed4f63422988a6bbd1e7d8a91.zip