diff options
author | Prashant Sharma <prashant.s@imaginea.com> | 2014-03-03 22:31:30 -0800 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-03-03 22:31:30 -0800 |
commit | 181ec5030792a10f3ce77e997d0e2eda9bcd6139 (patch) | |
tree | 9b88504e5a3eca8177e4ebe1257ea9ce56120c13 /docs/java-programming-guide.md | |
parent | b14ede789abfabe25144385e8dc2fb96691aba81 (diff) | |
download | spark-181ec5030792a10f3ce77e997d0e2eda9bcd6139.tar.gz spark-181ec5030792a10f3ce77e997d0e2eda9bcd6139.tar.bz2 spark-181ec5030792a10f3ce77e997d0e2eda9bcd6139.zip |
[java8API] SPARK-964 Investigate the potential for using JDK 8 lambda expressions for the Java/Scala APIs
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>
Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits:
95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch.
85a954e [Prashant Sharma] Nit. import orderings.
673f7ac [Prashant Sharma] Added support for -java-home as well
80a13e8 [Prashant Sharma] Used fake class tag syntax
26eb3f6 [Prashant Sharma] Patrick's comments on PR.
35d8d79 [Prashant Sharma] Specified java 8 building in the docs
31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag.
4ab87d3 [Prashant Sharma] Review feedback on the pr
c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.
Diffstat (limited to 'docs/java-programming-guide.md')
-rw-r--r-- | docs/java-programming-guide.md | 56 |
1 files changed, 43 insertions, 13 deletions
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md index 5c73dbb25e..6632360f6e 100644 --- a/docs/java-programming-guide.md +++ b/docs/java-programming-guide.md @@ -21,15 +21,21 @@ operations (e.g. map) and handling RDDs of different types, as discussed next. There are a few key differences between the Java and Scala APIs: -* Java does not support anonymous or first-class functions, so functions must - be implemented by extending the +* Java does not support anonymous or first-class functions, so functions are passed + using anonymous classes that implement the [`org.apache.spark.api.java.function.Function`](api/core/index.html#org.apache.spark.api.java.function.Function), [`Function2`](api/core/index.html#org.apache.spark.api.java.function.Function2), etc. - classes. + interfaces. * To maintain type safety, the Java API defines specialized Function and RDD classes for key-value pairs and doubles. For example, [`JavaPairRDD`](api/core/index.html#org.apache.spark.api.java.JavaPairRDD) stores key-value pairs. +* Some methods are defined on the basis of the passed anonymous function's + (a.k.a lambda expression) return type, + for example mapToPair(...) or flatMapToPair returns + [`JavaPairRDD`](api/core/index.html#org.apache.spark.api.java.JavaPairRDD), + similarly mapToDouble and flatMapToDouble returns + [`JavaDoubleRDD`](api/core/index.html#org.apache.spark.api.java.JavaDoubleRDD). * RDD methods like `collect()` and `countByKey()` return Java collections types, such as `java.util.List` and `java.util.Map`. * Key-value pairs, which are simply written as `(key, value)` in Scala, are represented @@ -53,10 +59,10 @@ each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`, etc (this acheives the "same-result-type" principle used by the [Scala collections framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)). -## Function Classes +## Function Interfaces -The following table lists the function classes used by the Java API. Each -class has a single abstract method, `call()`, that must be implemented. +The following table lists the function interfaces used by the Java API. Each +interface has a single abstract method, `call()`, that must be implemented. <table class="table"> <tr><th>Class</th><th>Function Type</th></tr> @@ -78,7 +84,6 @@ RDD [storage level](scala-programming-guide.html#rdd-persistence) constants, suc declared in the [org.apache.spark.api.java.StorageLevels](api/core/index.html#org.apache.spark.api.java.StorageLevels) class. To define your own storage level, you can use StorageLevels.create(...). - # Other Features The Java API supports other Spark features, including @@ -86,6 +91,21 @@ The Java API supports other Spark features, including [broadcast variables](scala-programming-guide.html#broadcast-variables), and [caching](scala-programming-guide.html#rdd-persistence). +# Upgrading From Pre-1.0 Versions of Spark + +In version 1.0 of Spark the Java API was refactored to better support Java 8 +lambda expressions. Users upgrading from older versions of Spark should note +the following changes: + +* All `org.apache.spark.api.java.function.*` have been changed from abstract + classes to interfaces. This means that concrete implementations of these + `Function` classes will need to use `implements` rather than `extends`. +* Certain transformation functions now have multiple versions depending + on the return type. In Spark core, the map functions (map, flatMap, + mapPartitons) have type-specific versions, e.g. + [`mapToPair`](api/core/index.html#org.apache.spark.api.java.JavaRDD@mapToPair[K2,V2](f:org.apache.spark.api.java.function.PairFunction[T,K2,V2]):org.apache.spark.api.java.JavaPairRDD[K2,V2]) + and [`mapToDouble`](api/core/index.html#org.apache.spark.api.java.JavaRDD@mapToDouble[R](f:org.apache.spark.api.java.function.DoubleFunction[T]):org.apache.spark.api.java.JavaDoubleRDD). + Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaDStream@transformToPair[K2,V2](transformFunc:org.apache.spark.api.java.function.Function[R,org.apache.spark.api.java.JavaPairRDD[K2,V2]]):org.apache.spark.streaming.api.java.JavaPairDStream[K2,V2]). # Example @@ -127,11 +147,20 @@ class Split extends FlatMapFunction<String, String> { JavaRDD<String> words = lines.flatMap(new Split()); {% endhighlight %} +Java 8+ users can also write the above `FlatMapFunction` in a more concise way using +a lambda expression: + +{% highlight java %} +JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(s.split(" "))); +{% endhighlight %} + +This lambda syntax can be applied to all anonymous classes in Java 8. + Continuing with the word count example, we map each word to a `(word, 1)` pair: {% highlight java %} import scala.Tuple2; -JavaPairRDD<String, Integer> ones = words.map( +JavaPairRDD<String, Integer> ones = words.mapToPair( new PairFunction<String, String, Integer>() { public Tuple2<String, Integer> call(String s) { return new Tuple2(s, 1); @@ -140,7 +169,7 @@ JavaPairRDD<String, Integer> ones = words.map( ); {% endhighlight %} -Note that `map` was passed a `PairFunction<String, String, Integer>` and +Note that `mapToPair` was passed a `PairFunction<String, String, Integer>` and returned a `JavaPairRDD<String, Integer>`. To finish the word count program, we will use `reduceByKey` to count the @@ -164,7 +193,7 @@ possible to chain the RDD transformations, so the word count example could also be written as: {% highlight java %} -JavaPairRDD<String, Integer> counts = lines.flatMap( +JavaPairRDD<String, Integer> counts = lines.flatMapToPair( ... ).map( ... @@ -180,10 +209,11 @@ just a matter of style. We currently provide documentation for the Java API as Scaladoc, in the [`org.apache.spark.api.java` package](api/core/index.html#org.apache.spark.api.java.package), because -some of the classes are implemented in Scala. The main downside is that the types and function +some of the classes are implemented in Scala. It is important to note that the types and function definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of -`T reduce(Function2<T, T> func)`). -We hope to generate documentation with Java-style syntax in the future. +`T reduce(Function2<T, T> func)`). In addition, the Scala `trait` modifier is used for Java +interface classes. We hope to generate documentation with Java-style syntax in the future to +avoid these quirks. # Where to Go from Here |