aboutsummaryrefslogtreecommitdiff
path: root/docs/scala-programming-guide.md
diff options
context:
space:
mode:
authorMatei Zaharia <matei@eecs.berkeley.edu>2012-09-25 23:51:04 -0700
committerMatei Zaharia <matei@eecs.berkeley.edu>2012-09-25 23:51:04 -0700
commitc5754bb9399a59c4a83d28e618fea87900aa8f8a (patch)
tree98e64c44646814907ec87d7f93e89f96653705a8 /docs/scala-programming-guide.md
parentf1246cc7c18bd0c155f920f4dc593e88147a94e4 (diff)
downloadspark-c5754bb9399a59c4a83d28e618fea87900aa8f8a.tar.gz
spark-c5754bb9399a59c4a83d28e618fea87900aa8f8a.tar.bz2
spark-c5754bb9399a59c4a83d28e618fea87900aa8f8a.zip
Fixes to Java guide
Diffstat (limited to 'docs/scala-programming-guide.md')
-rw-r--r--docs/scala-programming-guide.md7
1 files changed, 6 insertions, 1 deletions
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index 1936c1969d..9a97736b6b 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -206,6 +206,10 @@ The following tables list the transformations and actions currently supported (s
<td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td>
</tr>
<tr>
+ <td> <b>countByKey</b>() </td>
+ <td> Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key. </td>
+</tr>
+<tr>
<td> <b>foreach</b>(<i>func</i>) </td>
<td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems. </td>
</tr>
@@ -273,6 +277,7 @@ In addition, each RDD can be stored using a different *storage level*, allowing
As you can see, Spark supports a variety of storage levels that give different tradeoffs between memory usage
and CPU efficiency. We recommend going through the following process to select one:
+
* If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY_DESER`), leave them that way. This is the most
CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
* If not, try using `MEMORY_ONLY` and [selecting a fast serialization library]({{HOME_PATH}}tuning.html) to make the objects
@@ -329,4 +334,4 @@ res2: Int = 10
You can see some [example Spark programs](http://www.spark-project.org/examples.html) on the Spark website.
-In addition, Spark includes several sample jobs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.
+In addition, Spark includes several sample programs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.