aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorRyan Williams <ryan.blake.williams@gmail.com>2014-12-15 14:52:17 -0800
committerPatrick Wendell <pwendell@gmail.com>2014-12-15 14:52:17 -0800
commit8176b7a02e6b62bbce194c3ce9802d58b7472101 (patch)
tree030a8c3c865df112667dbf329f7552f866a482be /docs
parent38703bbca86003995f32b2e948ad7c7c358aa99a (diff)
downloadspark-8176b7a02e6b62bbce194c3ce9802d58b7472101.tar.gz
spark-8176b7a02e6b62bbce194c3ce9802d58b7472101.tar.bz2
spark-8176b7a02e6b62bbce194c3ce9802d58b7472101.zip
[SPARK-4668] Fix some documentation typos.
Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #3523 from ryan-williams/tweaks and squashes the following commits: d2eddaa [Ryan Williams] code review feedback ce27fc1 [Ryan Williams] CoGroupedRDD comment nit c6cfad9 [Ryan Williams] remove unnecessary if statement b74ea35 [Ryan Williams] comment fix b0221f0 [Ryan Williams] fix a gendered pronoun c71ffed [Ryan Williams] use names on a few boolean parameters 89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io 83e8358 [Ryan Williams] fix pom.xml typo dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
Diffstat (limited to 'docs')
-rw-r--r--docs/building-spark.md16
-rw-r--r--docs/configuration.md6
-rw-r--r--docs/tuning.md8
3 files changed, 22 insertions, 8 deletions
diff --git a/docs/building-spark.md b/docs/building-spark.md
index 4922e877e9..70165eabca 100644
--- a/docs/building-spark.md
+++ b/docs/building-spark.md
@@ -124,7 +124,21 @@ We use the scala-maven-plugin which supports incremental and continuous compilat
mvn scala:cc
-should run continuous compilation (i.e. wait for changes). However, this has not been tested extensively.
+should run continuous compilation (i.e. wait for changes). However, this has not been tested
+extensively. A couple of gotchas to note:
+* it only scans the paths `src/main` and `src/test` (see
+[docs](http://scala-tools.org/mvnsites/maven-scala-plugin/usage_cc.html)), so it will only work
+from within certain submodules that have that structure.
+* you'll typically need to run `mvn install` from the project root for compilation within
+specific submodules to work; this is because submodules that depend on other submodules do so via
+the `spark-parent` module).
+
+Thus, the full flow for running continuous-compilation of the `core` submodule may look more like:
+ ```
+ $ mvn install
+ $ cd core
+ $ mvn scala:cc
+```
# Using With IntelliJ IDEA
diff --git a/docs/configuration.md b/docs/configuration.md
index acee267883..64aa94f622 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -75,8 +75,8 @@ in the `spark-defaults.conf` file.
The application web UI at `http://<driver>:4040` lists Spark properties in the "Environment" tab.
This is a useful place to check to make sure that your properties have been set correctly. Note
-that only values explicitly specified through either `spark-defaults.conf` or SparkConf will
-appear. For all other configuration properties, you can assume the default value is used.
+that only values explicitly specified through `spark-defaults.conf`, `SparkConf`, or the command
+line will appear. For all other configuration properties, you can assume the default value is used.
## Available Properties
@@ -310,7 +310,7 @@ Apart from these, the following properties are also available, and may be useful
<td>(none)</td>
<td>
Add the environment variable specified by <code>EnvironmentVariableName</code> to the Executor
- process. The user can specify multiple of these and to set multiple environment variables.
+ process. The user can specify multiple of these to set multiple environment variables.
</td>
</tr>
<tr>
diff --git a/docs/tuning.md b/docs/tuning.md
index c4ca766328..e2fdcfe6a3 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -111,7 +111,7 @@ pointer-based data structures and wrapper objects. There are several ways to do
3. Consider using numeric IDs or enumeration objects instead of strings for keys.
4. If you have less than 32 GB of RAM, set the JVM flag `-XX:+UseCompressedOops` to make pointers be
four bytes instead of eight. You can add these options in
- [`spark-env.sh`](configuration.html#environment-variables-in-spark-envsh).
+ [`spark-env.sh`](configuration.html#environment-variables).
## Serialized RDD Storage
@@ -154,7 +154,7 @@ By default, Spark uses 60% of the configured executor memory (`spark.executor.me
cache RDDs. This means that 40% of memory is available for any objects created during task execution.
In case your tasks slow down and you find that your JVM is garbage-collecting frequently or running out of
-memory, lowering this value will help reduce the memory consumption. To change this to say 50%, you can call
+memory, lowering this value will help reduce the memory consumption. To change this to, say, 50%, you can call
`conf.set("spark.storage.memoryFraction", "0.5")` on your SparkConf. Combined with the use of serialized caching,
using a smaller cache should be sufficient to mitigate most of the garbage collection problems.
In case you are interested in further tuning the Java GC, continue reading below.
@@ -190,7 +190,7 @@ temporary objects created during task execution. Some steps which may be useful
* As an example, if your task is reading data from HDFS, the amount of memory used by the task can be estimated using
the size of the data block read from HDFS. Note that the size of a decompressed block is often 2 or 3 times the
- size of the block. So if we wish to have 3 or 4 tasks worth of working space, and the HDFS block size is 64 MB,
+ size of the block. So if we wish to have 3 or 4 tasks' worth of working space, and the HDFS block size is 64 MB,
we can estimate size of Eden to be `4*3*64MB`.
* Monitor how the frequency and time taken by garbage collection changes with the new settings.
@@ -219,7 +219,7 @@ working set of one of your tasks, such as one of the reduce tasks in `groupByKey
Spark's shuffle operations (`sortByKey`, `groupByKey`, `reduceByKey`, `join`, etc) build a hash table
within each task to perform the grouping, which can often be large. The simplest fix here is to
*increase the level of parallelism*, so that each task's input set is smaller. Spark can efficiently
-support tasks as short as 200 ms, because it reuses one worker JVMs across all tasks and it has
+support tasks as short as 200 ms, because it reuses one executor JVM across many tasks and it has
a low task launching cost, so you can safely increase the level of parallelism to more than the
number of cores in your clusters.