---
layout: global
title: Spark Configuration
---
Spark provides three main locations to configure the system:
* The [`conf/spark-env.sh` script](#environment-variables-in-spark-envsh), in which you can set environment variables
that affect how the JVM is launched, such as, most notably, the amount of memory per JVM.
* [Java system properties](#system-properties), which control internal configuration parameters and can be set either
programmatically (by calling `System.setProperty` *before* creating a `SparkContext`) or through the
`SPARK_JAVA_OPTS` environment variable in `spark-env.sh`.
* [Logging configuration](#configuring-logging), which is done through `log4j.properties`.
# Environment Variables in spark-env.sh
Spark determines how to initialize the JVM on worker nodes, or even on the local node when you run `spark-shell`,
by running the `conf/spark-env.sh` script in the directory where it is installed. This script does not exist by default
in the Git repository, but but you can create it by copying `conf/spark-env.sh.template`. Make sure that you make
the copy executable.
Inside `spark-env.sh`, you can set the following environment variables:
* `SCALA_HOME` to point to your Scala installation.
* `MESOS_NATIVE_LIBRARY` if you are [running on a Mesos cluster](running-on-mesos.html).
* `SPARK_MEM` to set the amount of memory used per node (this should be in the same format as the JVM's -Xmx option, e.g. `300m` or `1g`)
* `SPARK_JAVA_OPTS` to add JVM options. This includes any system properties that you'd like to pass with `-D`.
* `SPARK_CLASSPATH` to add elements to Spark's classpath.
* `SPARK_LIBRARY_PATH` to add search directories for native libraries.
The most important things to set first will be `SCALA_HOME`, without which `spark-shell` cannot run, and `MESOS_NATIVE_LIBRARY`
if running on Mesos. The next setting will probably be the memory (`SPARK_MEM`). Make sure you set it high enough to be able to run your job but lower than the total memory on the machines (leave at least 1 GB for the operating system).
# System Properties
To set a system property for configuring Spark, you need to either pass it with a -D flag to the JVM (for example `java -Dspark.cores.max=5 MyProgram`) or call `System.setProperty` in your code *before* creating your Spark context, as follows:
{% highlight scala %}
System.setProperty("spark.cores.max", "5")
val sc = new SparkContext(...)
{% endhighlight %}
Most of the configurable system properties control internal settings that have reasonable default values. However,
there are at least four properties that you will commonly want to control:
Property Name | Default | Meaning |
spark.mesos.coarse |
false |
If set to "true", runs over Mesos clusters in
"coarse-grained" sharing mode,
where Spark acquires one long-lived Mesos task on each machine instead of one Mesos task per Spark task.
This gives lower-latency scheduling for short queries, but leaves resources in use for the whole
duration of the Spark job.
|
spark.default.parallelism |
8 |
Default number of tasks to use for distributed shuffle operations (groupByKey ,
reduceByKey , etc) when not set by user.
|
spark.storage.memoryFraction |
0.66 |
Fraction of Java heap to use for Spark's memory cache. This should not be larger than the "old"
generation of objects in the JVM, which by default is given 2/3 of the heap, but you can increase
it if you configure your own old generation size.
|
spark.shuffle.compress |
true |
Whether to compress map output files. Generally a good idea.
|
spark.broadcast.compress |
true |
Whether to compress broadcast variables before sending them. Generally a good idea.
|
spark.rdd.compress |
false |
Whether to compress serialized RDD partitions (e.g. for StorageLevel.MEMORY_ONLY_SER ).
Can save substantial space at the cost of some extra CPU time.
|
spark.reducer.maxMbInFlight |
48 |
Maximum size (in megabytes) of map outputs to fetch simultaneously from each reduce task. Since
each output requires us to create a buffer to receive it, this represents a fixed memory overhead
per reduce task, so keep it small unless you have a large amount of memory.
|
spark.closure.serializer |
spark.JavaSerializer |
Serializer class to use for closures. Generally Java is fine unless your distributed functions
(e.g. map functions) reference large objects in the driver program.
|
spark.kryoserializer.buffer.mb |
32 |
Maximum object size to allow within Kryo (the library needs to create a buffer at least as
large as the largest single object you'll serialize). Increase this if you get a "buffer limit
exceeded" exception inside Kryo. Note that there will be one buffer per core on each worker.
|
spark.broadcast.factory |
spark.broadcast. HttpBroadcastFactory |
Which broadcast implementation to use.
|
spark.locality.wait |
3000 |
Number of milliseconds to wait to launch a data-local task before giving up and launching it
in a non-data-local location. You should increase this if your tasks are long and you are seeing
poor data locality, but the default generally works well.
|
spark.akka.threads |
4 |
Number of actor threads to use for communication. Can be useful to increase on large clusters
when the master has a lot of CPU cores.
|
spark.master.host |
(local hostname) |
Hostname or IP address for the master to listen on.
|
spark.master.port |
(random) |
Port for the master to listen on.
|
# Configuring Logging
Spark uses [log4j](http://logging.apache.org/log4j/) for logging. You can configure it by adding a `log4j.properties`
file in the `conf` directory. One way to start is to copy the existing `log4j.properties.template` located there.