diff options
author | Matei Zaharia <matei@databricks.com> | 2014-05-30 00:34:33 -0700 |
---|---|---|
committer | Patrick Wendell <pwendell@gmail.com> | 2014-05-30 00:34:33 -0700 |
commit | c8bf4131bc2a2e147e977159fc90e94b85738830 (patch) | |
tree | a2f885df8fb6654bd7750bb344b97a6cb6889bf3 /docs/java-programming-guide.md | |
parent | eeee978a348ec2a35cc27865cea6357f9db75b74 (diff) | |
download | spark-c8bf4131bc2a2e147e977159fc90e94b85738830.tar.gz spark-c8bf4131bc2a2e147e977159fc90e94b85738830.tar.bz2 spark-c8bf4131bc2a2e147e977159fc90e94b85738830.zip |
[SPARK-1566] consolidate programming guide, and general doc updates
This is a fairly large PR to clean up and update the docs for 1.0. The major changes are:
* A unified programming guide for all languages replaces language-specific ones and shows language-specific info in tabs
* New programming guide sections on key-value pairs, unit testing, input formats beyond text, migrating from 0.9, and passing functions to Spark
* Spark-submit guide moved to a separate page and expanded slightly
* Various cleanups of the menu system, security docs, and others
* Updated look of title bar to differentiate the docs from previous Spark versions
You can find the updated docs at http://people.apache.org/~matei/1.0-docs/_site/ and in particular http://people.apache.org/~matei/1.0-docs/_site/programming-guide.html.
Author: Matei Zaharia <matei@databricks.com>
Closes #896 from mateiz/1.0-docs and squashes the following commits:
03e6853 [Matei Zaharia] Some tweaks to configuration and YARN docs
0779508 [Matei Zaharia] tweak
ef671d4 [Matei Zaharia] Keep frames in JavaDoc links, and other small tweaks
1bf4112 [Matei Zaharia] Review comments
4414f88 [Matei Zaharia] tweaks
d04e979 [Matei Zaharia] Fix some old links to Java guide
a34ed33 [Matei Zaharia] tweak
541bb3b [Matei Zaharia] miscellaneous changes
fcefdec [Matei Zaharia] Moved submitting apps to separate doc
61d72b4 [Matei Zaharia] stuff
181f217 [Matei Zaharia] migration guide, remove old language guides
e11a0da [Matei Zaharia] Add more API functions
6a030a9 [Matei Zaharia] tweaks
8db0ae3 [Matei Zaharia] Added key-value pairs section
318d2c9 [Matei Zaharia] tweaks
1c81477 [Matei Zaharia] New section on basics and function syntax
e38f559 [Matei Zaharia] Actually added programming guide to Git
a33d6fe [Matei Zaharia] First pass at updating programming guide to support all languages, plus other tweaks throughout
3b6a876 [Matei Zaharia] More CSS tweaks
01ec8bf [Matei Zaharia] More CSS tweaks
e6d252e [Matei Zaharia] Change color of doc title bar to differentiate from 0.9.0
Diffstat (limited to 'docs/java-programming-guide.md')
-rw-r--r-- | docs/java-programming-guide.md | 215 |
1 files changed, 2 insertions, 213 deletions
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md index 943fdd9d01..bb539582f6 100644 --- a/docs/java-programming-guide.md +++ b/docs/java-programming-guide.md @@ -1,218 +1,7 @@ --- layout: global title: Java Programming Guide +redirect: programming-guide.html --- -The Spark Java API exposes all the Spark features available in the Scala version to Java. -To learn the basics of Spark, we recommend reading through the -[Scala programming guide](scala-programming-guide.html) first; it should be -easy to follow even if you don't know Scala. -This guide will show how to use the Spark features described there in Java. - -The Spark Java API is defined in the -[`org.apache.spark.api.java`](api/java/index.html?org/apache/spark/api/java/package-summary.html) package, and includes -a [`JavaSparkContext`](api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html) for -initializing Spark and [`JavaRDD`](api/java/index.html?org/apache/spark/api/java/JavaRDD.html) classes, -which support the same methods as their Scala counterparts but take Java functions and return -Java data and collection types. The main differences have to do with passing functions to RDD -operations (e.g. map) and handling RDDs of different types, as discussed next. - -# Key Differences in the Java API - -There are a few key differences between the Java and Scala APIs: - -* Java does not support anonymous or first-class functions, so functions are passed - using anonymous classes that implement the - [`org.apache.spark.api.java.function.Function`](api/java/index.html?org/apache/spark/api/java/function/Function.html), - [`Function2`](api/java/index.html?org/apache/spark/api/java/function/Function2.html), etc. - interfaces. -* To maintain type safety, the Java API defines specialized Function and RDD - classes for key-value pairs and doubles. For example, - [`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html) - stores key-value pairs. -* Some methods are defined on the basis of the passed function's return type. - For example `mapToPair()` returns - [`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html), - and `mapToDouble()` returns - [`JavaDoubleRDD`](api/java/index.html?org/apache/spark/api/java/JavaDoubleRDD.html). -* RDD methods like `collect()` and `countByKey()` return Java collections types, - such as `java.util.List` and `java.util.Map`. -* Key-value pairs, which are simply written as `(key, value)` in Scala, are represented - by the `scala.Tuple2` class, and need to be created using `new Tuple2<K, V>(key, value)`. - -## RDD Classes - -Spark defines additional operations on RDDs of key-value pairs and doubles, such -as `reduceByKey`, `join`, and `stdev`. - -In the Scala API, these methods are automatically added using Scala's -[implicit conversions](http://www.scala-lang.org/node/130) mechanism. - -In the Java API, the extra methods are defined in the -[`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html) -and [`JavaDoubleRDD`](api/java/index.html?org/apache/spark/api/java/JavaDoubleRDD.html) -classes. RDD methods like `map` are overloaded by specialized `PairFunction` -and `DoubleFunction` classes, allowing them to return RDDs of the appropriate -types. Common methods like `filter` and `sample` are implemented by -each specialized RDD class, so filtering a `PairRDD` returns a new `PairRDD`, -etc (this achieves the "same-result-type" principle used by the [Scala collections -framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-collections.html)). - -## Function Interfaces - -The following table lists the function interfaces used by the Java API, located in the -[`org.apache.spark.api.java.function`](api/java/index.html?org/apache/spark/api/java/function/package-summary.html) -package. Each interface has a single abstract method, `call()`. - -<table class="table"> -<tr><th>Class</th><th>Function Type</th></tr> - -<tr><td>Function<T, R></td><td>T => R </td></tr> -<tr><td>DoubleFunction<T></td><td>T => Double </td></tr> -<tr><td>PairFunction<T, K, V></td><td>T => Tuple2<K, V> </td></tr> - -<tr><td>FlatMapFunction<T, R></td><td>T => Iterable<R> </td></tr> -<tr><td>DoubleFlatMapFunction<T></td><td>T => Iterable<Double> </td></tr> -<tr><td>PairFlatMapFunction<T, K, V></td><td>T => Iterable<Tuple2<K, V>> </td></tr> - -<tr><td>Function2<T1, T2, R></td><td>T1, T2 => R (function of two arguments)</td></tr> -</table> - -## Storage Levels - -RDD [storage level](scala-programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are -declared in the [org.apache.spark.api.java.StorageLevels](api/java/index.html?org/apache/spark/api/java/StorageLevels.html) class. To -define your own storage level, you can use StorageLevels.create(...). - -# Other Features - -The Java API supports other Spark features, including -[accumulators](scala-programming-guide.html#accumulators), -[broadcast variables](scala-programming-guide.html#broadcast-variables), and -[caching](scala-programming-guide.html#rdd-persistence). - -# Upgrading From Pre-1.0 Versions of Spark - -In version 1.0 of Spark the Java API was refactored to better support Java 8 -lambda expressions. Users upgrading from older versions of Spark should note -the following changes: - -* All `org.apache.spark.api.java.function.*` have been changed from abstract - classes to interfaces. This means that concrete implementations of these - `Function` classes will need to use `implements` rather than `extends`. -* Certain transformation functions now have multiple versions depending - on the return type. In Spark core, the map functions (`map`, `flatMap`, and - `mapPartitions`) have type-specific versions, e.g. - [`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction)) - and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)). - Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)). - -# Example - -As an example, we will implement word count using the Java API. - -{% highlight java %} -import org.apache.spark.api.java.*; -import org.apache.spark.api.java.function.*; - -JavaSparkContext jsc = new JavaSparkContext(...); -JavaRDD<String> lines = jsc.textFile("hdfs://..."); -JavaRDD<String> words = lines.flatMap( - new FlatMapFunction<String, String>() { - @Override public Iterable<String> call(String s) { - return Arrays.asList(s.split(" ")); - } - } -); -{% endhighlight %} - -The word count program starts by creating a `JavaSparkContext`, which accepts -the same parameters as its Scala counterpart. `JavaSparkContext` supports the -same data loading methods as the regular `SparkContext`; here, `textFile` -loads lines from text files stored in HDFS. - -To split the lines into words, we use `flatMap` to split each line on -whitespace. `flatMap` is passed a `FlatMapFunction` that accepts a string and -returns an `java.lang.Iterable` of strings. - -Here, the `FlatMapFunction` was created inline; another option is to subclass -`FlatMapFunction` and pass an instance to `flatMap`: - -{% highlight java %} -class Split extends FlatMapFunction<String, String> { - @Override public Iterable<String> call(String s) { - return Arrays.asList(s.split(" ")); - } -} -JavaRDD<String> words = lines.flatMap(new Split()); -{% endhighlight %} - -Java 8+ users can also write the above `FlatMapFunction` in a more concise way using -a lambda expression: - -{% highlight java %} -JavaRDD<String> words = lines.flatMap(s -> Arrays.asList(s.split(" "))); -{% endhighlight %} - -This lambda syntax can be applied to all anonymous classes in Java 8. - -Continuing with the word count example, we map each word to a `(word, 1)` pair: - -{% highlight java %} -import scala.Tuple2; -JavaPairRDD<String, Integer> ones = words.mapToPair( - new PairFunction<String, String, Integer>() { - @Override public Tuple2<String, Integer> call(String s) { - return new Tuple2<String, Integer>(s, 1); - } - } -); -{% endhighlight %} - -Note that `mapToPair` was passed a `PairFunction<String, String, Integer>` and -returned a `JavaPairRDD<String, Integer>`. - -To finish the word count program, we will use `reduceByKey` to count the -occurrences of each word: - -{% highlight java %} -JavaPairRDD<String, Integer> counts = ones.reduceByKey( - new Function2<Integer, Integer, Integer>() { - @Override public Integer call(Integer i1, Integer i2) { - return i1 + i2; - } - } -); -{% endhighlight %} - -Here, `reduceByKey` is passed a `Function2`, which implements a function with -two arguments. The resulting `JavaPairRDD` contains `(word, count)` pairs. - -In this example, we explicitly showed each intermediate RDD. It is also -possible to chain the RDD transformations, so the word count example could also -be written as: - -{% highlight java %} -JavaPairRDD<String, Integer> counts = lines.flatMapToPair( - ... - ).map( - ... - ).reduceByKey( - ... - ); -{% endhighlight %} - -There is no performance difference between these approaches; the choice is -just a matter of style. - -# API Docs - -[API documentation](api/java/index.html) for Spark in Java is available in Javadoc format. - -# Where to Go from Here - -Spark includes several sample programs using the Java API in -[`examples/src/main/java`](https://github.com/apache/spark/tree/master/examples/src/main/java/org/apache/spark/examples). You can run them by passing the class name to the -`bin/run-example` script included in Spark; for example: - - ./bin/run-example JavaWordCount README.md +This document has been merged into the [Spark programming guide](programming-guide.html). |