Updated docs for SparkConf and handled review comments

author: Matei Zaharia <matei@databricks.com> 2013-12-30 22:17:28 -0500
committer: Matei Zaharia <matei@databricks.com> 2013-12-30 22:17:28 -0500
commit: 0fa5809768cf60ec62b4277f04e23a44dc1582e2 (patch)
tree: fee16620755769a70975c41d894db43633b18098 /docs/tuning.md
parent: 994f080f8ae3372366e6004600ba791c8a372ff0 (diff)
download: spark-0fa5809768cf60ec62b4277f04e23a44dc1582e2.tar.gz
spark-0fa5809768cf60ec62b4277f04e23a44dc1582e2.tar.bz2
spark-0fa5809768cf60ec62b4277f04e23a44dc1582e2.zip
1 files changed, 11 insertions, 10 deletions
diff --git a/docs/tuning.md b/docs/tuning.md
index a4be188169..bbb870085c 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -38,14 +38,15 @@ in your operations) and performance. It provides two serialization libraries:
   `Serializable` types and requires you to *register* the classes you'll use in the program in advance
   for best performance.
 
-You can switch to using Kryo by calling `System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")`
-*before* creating your SparkContext. This setting configures the serializer used for not only shuffling data between worker
+You can switch to using Kryo by initializing your job with a [SparkConf](configuration.html#spark-properties)
+and calling `conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")`.
+This setting configures the serializer used for not only shuffling data between worker
 nodes but also when serializing RDDs to disk.  The only reason Kryo is not the default is because of the custom
 registration requirement, but we recommend trying it in any network-intensive application.
 
 Finally, to register your classes with Kryo, create a public class that extends
 [`org.apache.spark.serializer.KryoRegistrator`](api/core/index.html#org.apache.spark.serializer.KryoRegistrator) and set the
-`spark.kryo.registrator` system property to point to it, as follows:
+`spark.kryo.registrator` config property to point to it, as follows:
 
 {% highlight scala %}
 import com.esotericsoftware.kryo.Kryo
@@ -58,17 +59,17 @@ class MyRegistrator extends KryoRegistrator {
   }
 }
 
-// Make sure to set these properties *before* creating a SparkContext!
-System.setProperty("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
-System.setProperty("spark.kryo.registrator", "mypackage.MyRegistrator")
-val sc = new SparkContext(...)
+val conf = new SparkConf().setMaster(...).setAppName(...)
+conf.set("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
+conf.set("spark.kryo.registrator", "mypackage.MyRegistrator")
+val sc = new SparkContext(conf)
 {% endhighlight %}
 
 The [Kryo documentation](http://code.google.com/p/kryo/) describes more advanced
 registration options, such as adding custom serialization code.
 
 If your objects are large, you may also need to increase the `spark.kryoserializer.buffer.mb`
-system property. The default is 2, but this value needs to be large enough to hold the *largest*
+config property. The default is 2, but this value needs to be large enough to hold the *largest*
 object you will serialize.
 
 Finally, if you don't register your classes, Kryo will still work, but it will have to store the
@@ -165,7 +166,7 @@ cache RDDs. This means that 33% of memory is available for any objects created d
 
 In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of
 memory, lowering this value will help reduce the memory consumption. To change this to say 50%, you can call
-`System.setProperty("spark.storage.memoryFraction", "0.5")`. Combined with the use of serialized caching,
+`conf.set("spark.storage.memoryFraction", "0.5")` on your SparkConf. Combined with the use of serialized caching,
 using a smaller cache should be sufficient to mitigate most of the garbage collection problems.
 In case you are interested in further tuning the Java GC, continue reading below.
 
@@ -219,7 +220,7 @@ enough. Spark automatically sets the number of "map" tasks to run on each file a
 distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses the largest
 parent RDD's number of partitions. You can pass the level of parallelism as a second argument
 (see the [`spark.PairRDDFunctions`](api/core/index.html#org.apache.spark.rdd.PairRDDFunctions) documentation),
-or set the system property `spark.default.parallelism` to change the default.
+or set the config property `spark.default.parallelism` to change the default.
 In general, we recommend 2-3 tasks per CPU core in your cluster.
 
 ## Memory Usage of Reduce Tasks
author	Matei Zaharia <matei@databricks.com>	2013-12-30 22:17:28 -0500
committer	Matei Zaharia <matei@databricks.com>	2013-12-30 22:17:28 -0500
commit	0fa5809768cf60ec62b4277f04e23a44dc1582e2 (patch)
tree	fee16620755769a70975c41d894db43633b18098 /docs/tuning.md
parent	994f080f8ae3372366e6004600ba791c8a372ff0 (diff)
download	spark-0fa5809768cf60ec62b4277f04e23a44dc1582e2.tar.gz spark-0fa5809768cf60ec62b4277f04e23a44dc1582e2.tar.bz2 spark-0fa5809768cf60ec62b4277f04e23a44dc1582e2.zip