From 874a9fd407943c7102395cfc64762dfd0ecf9b00 Mon Sep 17 00:00:00 2001 From: Matei Zaharia Date: Wed, 26 Sep 2012 19:17:58 -0700 Subject: More updates to docs, including tuning guide --- docs/configuration.md | 174 ++++++++++++++++++++++++++++++++++++++++++++++++-- 1 file changed, 167 insertions(+), 7 deletions(-) (limited to 'docs/configuration.md') diff --git a/docs/configuration.md b/docs/configuration.md index 0f16676f6d..93a644910c 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -2,21 +2,181 @@ layout: global title: Spark Configuration --- -Spark is configured primarily through the `conf/spark-env.sh` script. This script doesn't exist in the Git repository, but you can create it by copying `conf/spark-env.sh.template`. Make sure the script is executable. -Inside this script, you can set several environment variables: +Spark provides three main locations to configure the system: + +* The [`conf/spark-env.sh` script](#environment-variables-in-spark-envsh), in which you can set environment variables + that affect how the JVM is launched, such as, most notably, the amount of memory per JVM. +* [Java system properties](#system-properties), which control internal configuration parameters and can be set either + programmatically (by calling `System.setProperty` *before* creating a `SparkContext`) or through the + `SPARK_JAVA_OPTS` environment variable in `spark-env.sh`. +* [Logging configuration](#configuring-logging), which is done through `log4j.properties`. + + +# Environment Variables in spark-env.sh + +Spark determines how to initialize the JVM on worker nodes, or even on the local node when you run `spark-shell`, +by running the `conf/spark-env.sh` script in the directory where it is installed. This script does not exist by default +in the Git repository, but but you can create it by copying `conf/spark-env.sh.template`. Make sure that you make +the copy executable. + +Inside `spark-env.sh`, you can set the following environment variables: * `SCALA_HOME` to point to your Scala installation. * `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster]({{HOME_PATH}}running-on-mesos.html). * `SPARK_MEM` to set the amount of memory used per node (this should be in the same format as the JVM's -Xmx option, e.g. `300m` or `1g`) -* `SPARK_JAVA_OPTS` to add JVM options. This includes system properties that you'd like to pass with `-D`. +* `SPARK_JAVA_OPTS` to add JVM options. This includes any system properties that you'd like to pass with `-D`. * `SPARK_CLASSPATH` to add elements to Spark's classpath. * `SPARK_LIBRARY_PATH` to add search directories for native libraries. -The `spark-env.sh` script is executed both when you submit jobs with `run`, when you start the interpreter with `spark-shell`, and on each worker node on a Mesos cluster to set up the environment for that worker. +The most important things to set first will be `SCALA_HOME`, without which `spark-shell` cannot run, and `MESOS_NATIVE_LIBRARY` +if running on Mesos. The next setting will probably be the memory (`SPARK_MEM`). Make sure you set it high enough to be able to run your job but lower than the total memory on the machines (leave at least 1 GB for the operating system). + + +# System Properties + +To set a system property for configuring Spark, you need to either pass it with a -D flag to the JVM (for example `java -Dspark.cores.max=5 MyProgram`) or call `System.setProperty` in your code *before* creating your Spark context, as follows: + +{% highlight scala %} +System.setProperty("spark.cores.max", "5") +val sc = new SparkContext(...) +{% endhighlight %} + +Most of the configurable system properties control internal settings that have reasonable default values. However, +there are at least four properties that you will commonly want to control: + + + + + + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaning
spark.serializerspark.JavaSerializer + Class to use for serializing objects that will be sent over the network or need to be cached + in serialized form. The default of Java serialization works with any Serializable Java object but is + quite slow, so we recommend using spark.KryoSerializer + and configuring Kryo serialization when speed is necessary. Can be any subclass of + spark.Serializer). +
spark.kryo.registrator(none) + If you use Kryo serialization, set this class to register your custom classes with Kryo. + You need to set it to a class that extends + spark.KryoRegistrator). + See the tuning guide for more details. +
spark.local.dir/tmp + Directory to use for "scratch" space in Spark, including map output files and RDDs that get stored + on disk. This should be on a fast, local disk in your system. +
spark.cores.max(infinite) + When running on a standalone deploy cluster or a + Mesos cluster in "coarse-grained" sharing mode, + how many CPU cores to request at most. The default will use all available cores. +
+ + +Apart from these, the following properties are also available, and may be useful in some situations: + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
Property NameDefaultMeaning
spark.mesos.coarsefalse + If set to "true", runs over Mesos clusters in + "coarse-grained" sharing mode, + where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task. + This gives lower-latency scheduling for short queries, but leaves resources in use for the whole + duration of the Spark job. +
spark.default.parallelism8 + Default number of tasks to use for distributed shuffle operations (groupByKey, + reduceByKey, etc) when not set by user. +
spark.storage.memoryFraction0.66 + Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old" + generation of objects in the JVM, which by default is given 2/3 of the heap, but you can increase + it if you configure your own old generation size. +
spark.blockManager.parallelFetches4 + Number of map output files to fetch concurrently from each reduce task. +
spark.closure.serializerspark.JavaSerializer + Serializer class to use for closures. Generally Java is fine unless your distributed functions + (e.g. map functions) reference large objects in the driver program. +
spark.kryoserializer.buffer.mb32 + Maximum object size to allow within Kryo (the library needs to create a buffer at least as + large as the largest single object you'll serialize). Increase this if you get a "buffer limit + exceeded" exception inside Kryo. Note that there will be one buffer per core on each worker. +
spark.broadcast.factoryspark.broadcast. HttpBroadcastFactory + Which broadcast implementation to use. +
spark.locality.wait3000 + Number of milliseconds to wait to launch a data-local task before giving up and launching it + in a non-data-local location. You should increase this if your tasks are long and you are seeing + poor data locality, but the default generally works well. +
spark.master.host(local hostname) + Hostname for the master to listen on (it will bind to this hostname's IP address). +
spark.master.port(random) + Port for the master to listen on. +
-The most important thing to set first will probably be the memory (`SPARK_MEM`). Make sure you set it high enough to be able to run your job but lower than the total memory on the machines (leave at least 1 GB for the operating system). -## Logging Configuration +# Configuring Logging -Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties` file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there. +Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties` +file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there. -- cgit v1.2.3