From efc5423210d1aadeaea78273a4a8f10425753079 Mon Sep 17 00:00:00 2001 From: Matei Zaharia Date: Sun, 7 Oct 2012 11:30:53 -0700 Subject: Made compression configurable separately for shuffle, broadcast and RDDs --- docs/configuration.md | 41 +++++++++++++++++++++++++++-------------- 1 file changed, 27 insertions(+), 14 deletions(-) (limited to 'docs/configuration.md') diff --git a/docs/configuration.md b/docs/configuration.md index 0987f7f7b1..db90b5bc16 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -113,29 +113,34 @@ Apart from these, the following properties are also available, and may be useful - spark.blockManager.compress - false + spark.storage.memoryFraction + 0.66 - Set to "true" to have Spark compress map output files, RDDs that get cached on disk, - and RDDs that get cached in serialized form. Generally a good idea when dealing with - large datasets, but might add some CPU overhead. + Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" + generation of objects in the JVM, which by default is given 2/3 of the heap, but you can increase + it if you configure your own old generation size. + + + + spark.shuffle.compress + true + + Whether to compress map output files. Generally a good idea. spark.broadcast.compress - false + true - Set to "true" to have Spark compress broadcast variables before sending them. - Generally a good idea when broadcasting large values. + Whether to compress broadcast variables before sending them. Generally a good idea. - spark.storage.memoryFraction - 0.66 + spark.rdd.compress + false - Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" - generation of objects in the JVM, which by default is given 2/3 of the heap, but you can increase - it if you configure your own old generation size. + Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER). + Can save substantial space at the cost of some extra CPU time. @@ -180,11 +185,19 @@ Apart from these, the following properties are also available, and may be useful poor data locality, but the default generally works well. + + spark.akka.threads + 4 + + Number of actor threads to use for communication. Can be useful to increase on large clusters + when the master has a lot of CPU cores. + + spark.master.host (local hostname) - Hostname for the master to listen on (it will bind to this hostname's IP address). + Hostname or IP address for the master to listen on. -- cgit v1.2.3