From c5754bb9399a59c4a83d28e618fea87900aa8f8a Mon Sep 17 00:00:00 2001
From: Matei Zaharia <matei@eecs.berkeley.edu>
Date: Tue, 25 Sep 2012 23:51:04 -0700
Subject: Fixes to Java guide

---
 docs/css/main.css               |  4 ++
 docs/java-programming-guide.md  | 90 ++++++++++++++++++++++++-----------------
 docs/scala-programming-guide.md |  7 +++-
 3 files changed, 63 insertions(+), 38 deletions(-)

(limited to 'docs')

diff --git a/docs/css/main.css b/docs/css/main.css
index 8c2dc74029..c8aaa8ad22 100755
--- a/docs/css/main.css
+++ b/docs/css/main.css
@@ -48,6 +48,10 @@ code {
   color: #902000;
 }
 
+a code {
+  color: #0088cc;
+}
+
 pre {
   font-family: "Menlo", "Lucida Console", monospace;
 }
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md
index 546d69bfe5..2411e07849 100644
--- a/docs/java-programming-guide.md
+++ b/docs/java-programming-guide.md
@@ -3,22 +3,22 @@ layout: global
 title: Java Programming Guide
 ---
 
-The Spark Java API
-([spark.api.java]({{HOME_PATH}}api/core/index.html#spark.api.java.package)) defines
-[`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) and
-[`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) classes,
-which support
-the same methods as their Scala counterparts but take Java functions and return
-Java data and collection types.
-
-Because Java API is similar to the Scala API, this programming guide only
-covers Java-specific features;
-the [Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html)
-provides a more general introduction to Spark concepts and should be read
-first.
-
-
-# Key differences in the Java API
+The Spark Java API exposes all the Spark features available in the Scala version to Java.
+To learn the basics of Spark, we recommend reading through the
+[Scala Programming Guide]({{HOME_PATH}}scala-programming-guide.html) first; it should be
+easy to follow even if you don't know Scala.
+This guide will show how to use the Spark features described there in Java.
+
+The Spark Java API is defined in the
+[`spark.api.java`]({{HOME_PATH}}api/core/index.html#spark.api.java.package) package, and includes
+a [`JavaSparkContext`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaSparkContext) for
+initializing Spark and [`JavaRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaRDD) classes,
+which support the same methods as their Scala counterparts but take Java functions and return
+Java data and collection types. The main differences have to do with passing functions to RDD
+operations (e.g. map) and handling RDDs of different types, as discussed next.
+
+# Key Differences in the Java API
+
 There are a few key differences between the Java and Scala APIs:
 
 * Java does not support anonymous or first-class functions, so functions must
@@ -27,21 +27,25 @@ There are a few key differences between the Java and Scala APIs:
   [`Function2`]({{HOME_PATH}}api/core/index.html#spark.api.java.function.Function2), etc.
   classes.
 * To maintain type safety, the Java API defines specialized Function and RDD
-  classes for key-value pairs and doubles.
-* RDD methods like `collect` and `countByKey` return Java collections types,
+  classes for key-value pairs and doubles. For example, 
+  [`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD)
+  stores key-value pairs.
+* RDD methods like `collect()` and `countByKey()` return Java collections types,
   such as `java.util.List` and `java.util.Map`.
-
+* Key-value pairs, which are simply written as `(key, value)` in Scala, are represented
+  by the `scala.Tuple2` class, and need to be created using `new Tuple2<K, V>(key, value)`
 
 ## RDD Classes
-Spark defines additional operations on RDDs of doubles and key-value pairs, such
-as `stdev` and `join`.
+
+Spark defines additional operations on RDDs of key-value pairs and doubles, such
+as `reduceByKey`, `join`, and `stdev`.
 
 In the Scala API, these methods are automatically added using Scala's
 [implicit conversions](http://www.scala-lang.org/node/130) mechanism.
 
-In the Java API, the extra methods are defined in
-[`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD) and
+In the Java API, the extra methods are defined in the
 [`JavaPairRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaPairRDD)
+and [`JavaDoubleRDD`]({{HOME_PATH}}api/core/index.html#spark.api.java.JavaDoubleRDD)
 classes.  RDD methods like `map` are overloaded by specialized `PairFunction`
 and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
 types.  Common methods like `filter` and `sample` are implemented by
@@ -57,22 +61,25 @@ class has a single abstract method, `call()`, that must be implemented.
 <table class="table">
 <tr><th>Class</th><th>Function Type</th></tr>
 
-<tr><td>Function&lt;T, R&gt;</td><td>T -&gt; R </td></tr>
-<tr><td>DoubleFunction&lt;T&gt;</td><td>T -&gt; Double </td></tr>
-<tr><td>PairFunction&lt;T, K, V&gt;</td><td>T -&gt; Tuple2&lt;K, V&gt; </td></tr>
+<tr><td>Function&lt;T, R&gt;</td><td>T =&gt; R </td></tr>
+<tr><td>DoubleFunction&lt;T&gt;</td><td>T =&gt; Double </td></tr>
+<tr><td>PairFunction&lt;T, K, V&gt;</td><td>T =&gt; Tuple2&lt;K, V&gt; </td></tr>
 
-<tr><td>FlatMapFunction&lt;T, R&gt;</td><td>T -&gt; Iterable&lt;R&gt; </td></tr>
-<tr><td>DoubleFlatMapFunction&lt;T&gt;</td><td>T -&gt; Iterable&lt;Double&gt; </td></tr>
-<tr><td>PairFlatMapFunction&lt;T, K, V&gt;</td><td>T -&gt; Iterable&lt;Tuple2&lt;K, V&gt;&gt; </td></tr>
+<tr><td>FlatMapFunction&lt;T, R&gt;</td><td>T =&gt; Iterable&lt;R&gt; </td></tr>
+<tr><td>DoubleFlatMapFunction&lt;T&gt;</td><td>T =&gt; Iterable&lt;Double&gt; </td></tr>
+<tr><td>PairFlatMapFunction&lt;T, K, V&gt;</td><td>T =&gt; Iterable&lt;Tuple2&lt;K, V&gt;&gt; </td></tr>
 
-<tr><td>Function2&lt;T1, T2, R&gt;</td><td>T1, T2 -&gt; R (function of two arguments)</td></tr>
+<tr><td>Function2&lt;T1, T2, R&gt;</td><td>T1, T2 =&gt; R (function of two arguments)</td></tr>
 </table>
 
+
 # Other Features
+
 The Java API supports other Spark features, including
 [accumulators]({{HOME_PATH}}scala-programming-guide.html#accumulators),
-[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast_variables), and
-[caching]({{HOME_PATH}}scala-programming-guide.html#caching).
+[broadcast variables]({{HOME_PATH}}scala-programming-guide.html#broadcast-variables), and
+[caching]({{HOME_PATH}}scala-programming-guide.html#rdd-persistence).
+
 
 # Example
 
@@ -130,8 +137,6 @@ JavaPairRDD<String, Integer> ones = words.map(
 Note that `map` was passed a `PairFunction<String, String, Integer>` and
 returned a `JavaPairRDD<String, Integer>`.
 
-
-
 To finish the word count program, we will use `reduceByKey` to count the
 occurrences of each word:
 
@@ -161,12 +166,23 @@ JavaPairRDD<String, Integer> counts = lines.flatMap(
     ...
   );
 {% endhighlight %}
+
 There is no performance difference between these approaches; the choice is
-a matter of style.
+just a matter of style.
+
+# Javadoc
+
+We currently provide documentation for the Java API as Scaladoc, in the
+[`spark.api.java` package]({{HOME_PATH}}api/core/index.html#spark.api.java.package), because
+some of the classes are implemented in Scala. The main downside is that the types and function
+definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of
+`T reduce(Function2<T, T> func)`). 
+We hope to generate documentation with Java-style syntax in the future.
+
 
+# Where to Go from Here
 
-# Where to go from here
-Spark includes several sample jobs using the Java API in
+Spark includes several sample programs using the Java API in
 `examples/src/main/java`.  You can run them by passing the class name to the
 `run` script included in Spark -- for example, `./run
 spark.examples.JavaWordCount`.  Each example program prints usage help when run
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index 1936c1969d..9a97736b6b 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -205,6 +205,10 @@ The following tables list the transformations and actions currently supported (s
   <td> <b>saveAsSequenceFile</b>(<i>path</i>) </td>
   <td> Write the elements of the dataset as a Hadoop SequenceFile in a given path in the local filesystem, HDFS or any other Hadoop-supported file system. This is only available on RDDs of key-value pairs that either implement Hadoop's Writable interface or are implicitly convertible to Writable (Spark includes conversions for basic types like Int, Double, String, etc). </td>
 </tr>
+<tr>
+  <td> <b>countByKey</b>() </td>
+  <td> Only available on RDDs of type (K, V). Returns a `Map` of (K, Int) pairs with the count of each key. </td>
+</tr>
 <tr>
   <td> <b>foreach</b>(<i>func</i>) </td>
   <td> Run a function <i>func</i> on each element of the dataset. This is usually done for side effects such as updating an accumulator variable (see below) or interacting with external storage systems. </td>
@@ -273,6 +277,7 @@ In addition, each RDD can be stored using a different *storage level*, allowing
 
 As you can see, Spark supports a variety of storage levels that give different tradeoffs between memory usage
 and CPU efficiency. We recommend going through the following process to select one:
+
 * If your RDDs fit comfortably with the default storage level (`MEMORY_ONLY_DESER`), leave them that way. This is the most
   CPU-efficient option, allowing operations on the RDDs to run as fast as possible.
 * If not, try using `MEMORY_ONLY` and [selecting a fast serialization library]({{HOME_PATH}}tuning.html) to make the objects
@@ -329,4 +334,4 @@ res2: Int = 10
 
 You can see some [example Spark programs](http://www.spark-project.org/examples.html) on the Spark website.
 
-In addition, Spark includes several sample jobs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.
+In addition, Spark includes several sample programs in `examples/src/main/scala`. Some of them have both Spark versions and local (non-parallel) versions, allowing you to see what had to be changed to make the program run on a cluster. You can run them using by passing the class name to the `run` script included in Spark -- for example, `./run spark.examples.SparkPi`. Each example program prints usage help when run without any arguments.
-- 
cgit v1.2.3