From fc7838470465474f777bd17791c1bb5f9c348521 Mon Sep 17 00:00:00 2001 From: Matei Zaharia Date: Mon, 21 Apr 2014 21:57:40 -0700 Subject: [SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one. Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/ Author: Matei Zaharia This patch had conflicts when merged, resolved by Committer: Patrick Wendell Closes #457 from mateiz/better-docs and squashes the following commits: a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package 5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load f05abc0 [Matei Zaharia] Don't include java.lang package names 995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc a14a93c [Matei Zaharia] typo 76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced --- docs/_layouts/global.html | 105 ++++++-------------------------- docs/_plugins/copy_api_dirs.rb | 65 ++++++++++---------- docs/api.md | 13 ++-- docs/configuration.md | 8 +-- docs/graphx-programming-guide.md | 62 +++++++++---------- docs/index.md | 10 +-- docs/java-programming-guide.md | 55 ++++++++--------- docs/js/main.js | 18 ++++++ docs/mllib-classification-regression.md | 14 ++--- docs/mllib-clustering.md | 2 +- docs/mllib-collaborative-filtering.md | 2 +- docs/mllib-guide.md | 10 +-- docs/mllib-optimization.md | 8 +-- docs/python-programming-guide.md | 4 +- docs/quick-start.md | 6 +- docs/scala-programming-guide.md | 10 +-- docs/sql-programming-guide.md | 20 +++--- docs/streaming-custom-receivers.md | 4 +- docs/streaming-programming-guide.md | 56 ++++++++--------- docs/tuning.md | 4 +- 20 files changed, 207 insertions(+), 269 deletions(-) (limited to 'docs') diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index 5d4dbb7a9c..8b543de574 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -76,32 +76,9 @@ @@ -140,33 +117,6 @@

{{ page.title }}

{{ content }} - - - - - @@ -174,42 +124,23 @@ - + + + - - - - - diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb index 05f0bd47a8..2dbbbf6feb 100644 --- a/docs/_plugins/copy_api_dirs.rb +++ b/docs/_plugins/copy_api_dirs.rb @@ -20,47 +20,48 @@ include FileUtils if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1') # Build Scaladoc for Java/Scala - core_projects = ["core", "examples", "repl", "bagel", "graphx", "streaming", "mllib"] - external_projects = ["flume", "kafka", "mqtt", "twitter", "zeromq"] - sql_projects = ["catalyst", "core", "hive"] - projects = core_projects - projects = projects + external_projects.map { |project_name| "external/" + project_name } - projects = projects + sql_projects.map { |project_name| "sql/" + project_name } - - puts "Moving to project root and building scaladoc." + puts "Moving to project root and building API docs." curr_dir = pwd cd("..") - puts "Running 'sbt/sbt doc hive/doc' from " + pwd + "; this may take a few minutes..." - puts `sbt/sbt doc hive/doc` + puts "Running 'sbt/sbt compile unidoc' from " + pwd + "; this may take a few minutes..." + puts `sbt/sbt compile unidoc` puts "Moving back into docs dir." cd("docs") - # Copy over the scaladoc from each project into the docs directory. + # Copy over the unified ScalaDoc for all projects to api/scala. # This directory will be copied over to _site when `jekyll` command is run. - projects.each do |project_name| - source = "../" + project_name + "/target/scala-2.10/api" - dest = "api/" + project_name + source = "../target/scala-2.10/unidoc" + dest = "api/scala" + + puts "Making directory " + dest + mkdir_p dest + + # From the rubydoc: cp_r('src', 'dest') makes src/dest, but this doesn't. + puts "cp -r " + source + "/. " + dest + cp_r(source + "/.", dest) + + # Append custom JavaScript + js = File.readlines("./js/api-docs.js") + js_file = dest + "/lib/template.js" + File.open(js_file, 'a') { |f| f.write("\n" + js.join()) } - puts "making directory " + dest - mkdir_p dest + # Append custom CSS + css = File.readlines("./css/api-docs.css") + css_file = dest + "/lib/template.css" + File.open(css_file, 'a') { |f| f.write("\n" + css.join()) } - # From the rubydoc: cp_r('src', 'dest') makes src/dest, but this doesn't. - puts "cp -r " + source + "/. " + dest - cp_r(source + "/.", dest) + # Copy over the unified JavaDoc for all projects to api/java. + source = "../target/javaunidoc" + dest = "api/java" - # Append custom JavaScript - js = File.readlines("./js/api-docs.js") - js_file = dest + "/lib/template.js" - File.open(js_file, 'a') { |f| f.write("\n" + js.join()) } + puts "Making directory " + dest + mkdir_p dest - # Append custom CSS - css = File.readlines("./css/api-docs.css") - css_file = dest + "/lib/template.css" - File.open(css_file, 'a') { |f| f.write("\n" + css.join()) } - end + puts "cp -r " + source + "/. " + dest + cp_r(source + "/.", dest) # Build Epydoc for Python puts "Moving to python directory and building epydoc." @@ -70,11 +71,11 @@ if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1') puts "Moving back into docs dir." cd("../docs") - puts "echo making directory pyspark" - mkdir_p "pyspark" + puts "Making directory api/python" + mkdir_p "api/python" - puts "cp -r ../python/docs/. api/pyspark" - cp_r("../python/docs/.", "api/pyspark") + puts "cp -r ../python/docs/. api/python" + cp_r("../python/docs/.", "api/python") cd("..") end diff --git a/docs/api.md b/docs/api.md index 91c8e51d26..0346038333 100644 --- a/docs/api.md +++ b/docs/api.md @@ -1,13 +1,10 @@ --- layout: global -title: Spark API documentation (Scaladoc) +title: Spark API Documentation --- -Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt/sbt doc` from the Spark project home directory. +Here you can API docs for Spark and its submodules. -- [Spark](api/core/index.html) -- [Spark Examples](api/examples/index.html) -- [Spark Streaming](api/streaming/index.html) -- [Bagel](api/bagel/index.html) -- [GraphX](api/graphx/index.html) -- [PySpark](api/pyspark/index.html) +- [Spark Scala API (Scaladoc)](api/scala/index.html) +- [Spark Java API (Javadoc)](api/java/index.html) +- [Spark Python API (Epydoc)](api/python/index.html) diff --git a/docs/configuration.md b/docs/configuration.md index 5a4abca264..e7e1dd56cf 100644 --- a/docs/configuration.md +++ b/docs/configuration.md @@ -6,7 +6,7 @@ title: Spark Configuration Spark provides three locations to configure the system: * [Spark properties](#spark-properties) control most application parameters and can be set by passing - a [SparkConf](api/core/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java + a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java system properties. * [Environment variables](#environment-variables) can be used to set per-machine settings, such as the IP address, through the `conf/spark-env.sh` script on each node. @@ -16,7 +16,7 @@ Spark provides three locations to configure the system: # Spark Properties Spark properties control most application settings and are configured separately for each application. -The preferred way to set them is by passing a [SparkConf](api/core/index.html#org.apache.spark.SparkConf) +The preferred way to set them is by passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) class to your SparkContext constructor. Alternatively, Spark will also load them from Java system properties, for compatibility with old versions of Spark. @@ -53,7 +53,7 @@ there are at least five properties that you will commonly want to control: in serialized form. The default of Java serialization works with any Serializable Java object but is quite slow, so we recommend using org.apache.spark.serializer.KryoSerializer and configuring Kryo serialization when speed is necessary. Can be any subclass of - org.apache.spark.Serializer. + org.apache.spark.Serializer. @@ -62,7 +62,7 @@ there are at least five properties that you will commonly want to control: If you use Kryo serialization, set this class to register your custom classes with Kryo. It should be set to a class that extends - KryoRegistrator. + KryoRegistrator. See the tuning guide for more details. diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md index 1238e3e0a4..07be8ba58e 100644 --- a/docs/graphx-programming-guide.md +++ b/docs/graphx-programming-guide.md @@ -17,7 +17,7 @@ title: GraphX Programming Guide # Overview GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level, -GraphX extends the Spark [RDD](api/core/index.html#org.apache.spark.rdd.RDD) by introducing the +GraphX extends the Spark [RDD](api/scala/index.html#org.apache.spark.rdd.RDD) by introducing the [Resilient Distributed Property Graph](#property_graph): a directed multigraph with properties attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and @@ -82,7 +82,7 @@ Prior to the release of GraphX, graph computation in Spark was expressed using B implementation of Pregel. GraphX improves upon Bagel by exposing a richer property graph API, a more streamlined version of the Pregel abstraction, and system optimizations to improve performance and reduce memory overhead. While we plan to eventually deprecate Bagel, we will continue to -support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and +support the [Bagel API](api/scala/index.html#org.apache.spark.bagel.package) and [Bagel programming guide](bagel-programming-guide.html). However, we encourage Bagel users to explore the new GraphX API and comment on issues that may complicate the transition from Bagel. @@ -103,7 +103,7 @@ getting started with Spark refer to the [Spark Quick Start Guide](quick-start.ht # The Property Graph -The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed multigraph +The [property graph](api/scala/index.html#org.apache.spark.graphx.Graph) is a directed multigraph with user defined objects attached to each vertex and edge. A directed multigraph is a directed graph with potentially multiple parallel edges sharing the same source and destination vertex. The ability to support parallel edges simplifies modeling scenarios where there can be multiple @@ -179,7 +179,7 @@ val userGraph: Graph[(String, String), String] There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic generators and these are discussed in more detail in the section on [graph builders](#graph_builders). Probably the most general method is to use the -[Graph object](api/graphx/index.html#org.apache.spark.graphx.Graph$). For example the following +[Graph object](api/scala/index.html#org.apache.spark.graphx.Graph$). For example the following code constructs a graph from a collection of RDDs: {% highlight scala %} @@ -203,7 +203,7 @@ In the above example we make use of the [`Edge`][Edge] case class. Edges have a `dstId` corresponding to the source and destination vertex identifiers. In addition, the `Edge` class has an `attr` member which stores the edge property. -[Edge]: api/graphx/index.html#org.apache.spark.graphx.Edge +[Edge]: api/scala/index.html#org.apache.spark.graphx.Edge We can deconstruct a graph into the respective vertex and edge views by using the `graph.vertices` and `graph.edges` members respectively. @@ -229,7 +229,7 @@ The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD, ED]]` containing instances of the [`EdgeTriplet`][EdgeTriplet] class. This *join* can be expressed in the following SQL expression: -[EdgeTriplet]: api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet +[EdgeTriplet]: api/scala/index.html#org.apache.spark.graphx.EdgeTriplet {% highlight sql %} SELECT src.id, dst.id, src.attr, e.attr, dst.attr @@ -270,8 +270,8 @@ core operators are defined in [`GraphOps`][GraphOps]. However, thanks to Scala operators in `GraphOps` are automatically available as members of `Graph`. For example, we can compute the in-degree of each vertex (defined in `GraphOps`) by the following: -[Graph]: api/graphx/index.html#org.apache.spark.graphx.Graph -[GraphOps]: api/graphx/index.html#org.apache.spark.graphx.GraphOps +[Graph]: api/scala/index.html#org.apache.spark.graphx.Graph +[GraphOps]: api/scala/index.html#org.apache.spark.graphx.GraphOps {% highlight scala %} val graph: Graph[(String, String), String] @@ -382,7 +382,7 @@ val newGraph = Graph(newVertices, graph.edges) val newGraph = graph.mapVertices((id, attr) => mapUdf(id, attr)) {% endhighlight %} -[Graph.mapVertices]: api/graphx/index.html#org.apache.spark.graphx.Graph@mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED] +[Graph.mapVertices]: api/scala/index.html#org.apache.spark.graphx.Graph@mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED] These operators are often used to initialize the graph for a particular computation or project away unnecessary properties. For example, given a graph with the out-degrees as the vertex properties @@ -419,7 +419,7 @@ This can be useful when, for example, trying to compute the inverse PageRank. B operation does not modify vertex or edge properties or change the number of edges, it can be implemented efficiently without data-movement or duplication. -[Graph.reverse]: api/graphx/index.html#org.apache.spark.graphx.Graph@reverse:Graph[VD,ED] +[Graph.reverse]: api/scala/index.html#org.apache.spark.graphx.Graph@reverse:Graph[VD,ED] The [`subgraph`][Graph.subgraph] operator takes vertex and edge predicates and returns the graph containing only the vertices that satisfy the vertex predicate (evaluate to true) and edges that @@ -427,7 +427,7 @@ satisfy the edge predicate *and connect vertices that satisfy the vertex predica operator can be used in number of situations to restrict the graph to the vertices and edges of interest or eliminate broken links. For example in the following code we remove broken links: -[Graph.subgraph]: api/graphx/index.html#org.apache.spark.graphx.Graph@subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED] +[Graph.subgraph]: api/scala/index.html#org.apache.spark.graphx.Graph@subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED] {% highlight scala %} // Create an RDD for the vertices @@ -467,7 +467,7 @@ vertices and edges that are also found in the input graph. This can be used in example, we might run connected components using the graph with missing vertices and then restrict the answer to the valid subgraph. -[Graph.mask]: api/graphx/index.html#org.apache.spark.graphx.Graph@mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED] +[Graph.mask]: api/scala/index.html#org.apache.spark.graphx.Graph@mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED] {% highlight scala %} // Run Connected Components @@ -482,7 +482,7 @@ The [`groupEdges`][Graph.groupEdges] operator merges parallel edges (i.e., dupli pairs of vertices) in the multigraph. In many numerical applications, parallel edges can be *added* (their weights combined) into a single edge thereby reducing the size of the graph. -[Graph.groupEdges]: api/graphx/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED] +[Graph.groupEdges]: api/scala/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED] ## Join Operators @@ -506,7 +506,7 @@ returns a new graph with the vertex properties obtained by applying the user def to the result of the joined vertices. Vertices without a matching value in the RDD retain their original value. -[GraphOps.joinVertices]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED] +[GraphOps.joinVertices]: api/scala/index.html#org.apache.spark.graphx.GraphOps@joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED] > Note that if the RDD contains more than one value for a given vertex only one will be used. It > is therefore recommended that the input RDD be first made unique using the following which will @@ -525,7 +525,7 @@ property type. Because not all vertices may have a matching value in the input function takes an `Option` type. For example, we can setup a graph for PageRank by initializing vertex properties with their `outDegree`. -[Graph.outerJoinVertices]: api/graphx/index.html#org.apache.spark.graphx.Graph@outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED] +[Graph.outerJoinVertices]: api/scala/index.html#org.apache.spark.graphx.Graph@outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED] {% highlight scala %} @@ -559,7 +559,7 @@ PageRank Value, shortest path to the source, and smallest reachable vertex id). ### Map Reduce Triplets (mapReduceTriplets) -[Graph.mapReduceTriplets]: api/graphx/index.html#org.apache.spark.graphx.Graph@mapReduceTriplets[A](mapFunc:org.apache.spark.graphx.EdgeTriplet[VD,ED]=>Iterator[(org.apache.spark.graphx.VertexId,A)],reduceFunc:(A,A)=>A,activeSetOpt:Option[(org.apache.spark.graphx.VertexRDD[_],org.apache.spark.graphx.EdgeDirection)])(implicitevidence$10:scala.reflect.ClassTag[A]):org.apache.spark.graphx.VertexRDD[A] +[Graph.mapReduceTriplets]: api/scala/index.html#org.apache.spark.graphx.Graph@mapReduceTriplets[A](mapFunc:org.apache.spark.graphx.EdgeTriplet[VD,ED]=>Iterator[(org.apache.spark.graphx.VertexId,A)],reduceFunc:(A,A)=>A,activeSetOpt:Option[(org.apache.spark.graphx.VertexRDD[_],org.apache.spark.graphx.EdgeDirection)])(implicitevidence$10:scala.reflect.ClassTag[A]):org.apache.spark.graphx.VertexRDD[A] The core (heavily optimized) aggregation primitive in GraphX is the [`mapReduceTriplets`][Graph.mapReduceTriplets] operator: @@ -665,8 +665,8 @@ attributes at each vertex. This can be easily accomplished using the [`collectNeighborIds`][GraphOps.collectNeighborIds] and the [`collectNeighbors`][GraphOps.collectNeighbors] operators. -[GraphOps.collectNeighborIds]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]] -[GraphOps.collectNeighbors]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]] +[GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]] +[GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]] {% highlight scala %} @@ -685,7 +685,7 @@ class GraphOps[VD, ED] { In Spark, RDDs are not persisted in memory by default. To avoid recomputation, they must be explicitly cached when using them multiple times (see the [Spark Programming Guide][RDD Persistence]). Graphs in GraphX behave the same way. **When using a graph multiple times, make sure to call [`Graph.cache()`][Graph.cache] on it first.** [RDD Persistence]: scala-programming-guide.html#rdd-persistence -[Graph.cache]: api/graphx/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED] +[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED] In iterative computations, *uncaching* may also be necessary for best performance. By default, cached RDDs and graphs will remain in memory until memory pressure forces them to be evicted in LRU order. For iterative computation, intermediate results from previous iterations will fill up the cache. Though they will eventually be evicted, the unnecessary data stored in memory will slow down garbage collection. It would be more efficient to uncache intermediate results as soon as they are no longer necessary. This involves materializing (caching and forcing) a graph or RDD every iteration, uncaching all other datasets, and only using the materialized dataset in future iterations. However, because graphs are composed of multiple RDDs, it can be difficult to unpersist them correctly. **For iterative computation we recommend using the Pregel API, which correctly unpersists intermediate results.** @@ -716,7 +716,7 @@ messages remaining. The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch* of its implementation (note calls to graph.cache have been removed): -[GraphOps.pregel]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED] +[GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED] {% highlight scala %} class GraphOps[VD, ED] { @@ -840,12 +840,12 @@ object Graph { [`Graph.fromEdgeTuples`][Graph.fromEdgeTuples] allows creating a graph from only an RDD of edge tuples, assigning the edges the value 1, and automatically creating any vertices mentioned by edges and assigning them the default value. It also supports deduplicating the edges; to deduplicate, pass `Some` of a [`PartitionStrategy`][PartitionStrategy] as the `uniqueEdges` parameter (for example, `uniqueEdges = Some(PartitionStrategy.RandomVertexCut)`). A partition strategy is necessary to colocate identical edges on the same partition so they can be deduplicated. -[PartitionStrategy]: api/graphx/index.html#org.apache.spark.graphx.PartitionStrategy$ +[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$ -[GraphLoader.edgeListFile]: api/graphx/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int] -[Graph.apply]: api/graphx/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED] -[Graph.fromEdgeTuples]: api/graphx/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int] -[Graph.fromEdges]: api/graphx/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED] +[GraphLoader.edgeListFile]: api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int] +[Graph.apply]: api/scala/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED] +[Graph.fromEdgeTuples]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int] +[Graph.fromEdges]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED] # Vertex and Edge RDDs @@ -913,7 +913,7 @@ of the various partitioning strategies defined in [`PartitionStrategy`][Partitio each partition, edge attributes and adjacency structure, are stored separately enabling maximum reuse when changing attribute values. -[PartitionStrategy]: api/graphx/index.html#org.apache.spark.graphx.PartitionStrategy +[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy The three additional functions exposed by the `EdgeRDD` are: {% highlight scala %} @@ -952,7 +952,7 @@ the [`Graph.partitionBy`][Graph.partitionBy] operator. The default partitioning the initial partitioning of the edges as provided on graph construction. However, users can easily switch to 2D-partitioning or other heuristics included in GraphX. -[Graph.partitionBy]: api/graphx/index.html#org.apache.spark.graphx.Graph$@partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy):org.apache.spark.graphx.Graph[VD,ED] +[Graph.partitionBy]: api/scala/index.html#org.apache.spark.graphx.Graph$@partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy):org.apache.spark.graphx.Graph[VD,ED]

ClassFunction Type @@ -81,7 +81,7 @@ interface has a single abstract method, `call()`, that must be implemented. ## Storage Levels RDD [storage level](scala-programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are -declared in the [org.apache.spark.api.java.StorageLevels](api/core/index.html#org.apache.spark.api.java.StorageLevels) class. To +declared in the [org.apache.spark.api.java.StorageLevels](api/java/index.html?org/apache/spark/api/java/StorageLevels.html) class. To define your own storage level, you can use StorageLevels.create(...). # Other Features @@ -101,11 +101,11 @@ the following changes: classes to interfaces. This means that concrete implementations of these `Function` classes will need to use `implements` rather than `extends`. * Certain transformation functions now have multiple versions depending - on the return type. In Spark core, the map functions (map, flatMap, - mapPartitons) have type-specific versions, e.g. - [`mapToPair`](api/core/index.html#org.apache.spark.api.java.JavaRDD@mapToPair[K2,V2](f:org.apache.spark.api.java.function.PairFunction[T,K2,V2]):org.apache.spark.api.java.JavaPairRDD[K2,V2]) - and [`mapToDouble`](api/core/index.html#org.apache.spark.api.java.JavaRDD@mapToDouble[R](f:org.apache.spark.api.java.function.DoubleFunction[T]):org.apache.spark.api.java.JavaDoubleRDD). - Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaDStream@transformToPair[K2,V2](transformFunc:org.apache.spark.api.java.function.Function[R,org.apache.spark.api.java.JavaPairRDD[K2,V2]]):org.apache.spark.streaming.api.java.JavaPairDStream[K2,V2]). + on the return type. In Spark core, the map functions (`map`, `flatMap`, and + `mapPartitons`) have type-specific versions, e.g. + [`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction)) + and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)). + Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)). # Example @@ -205,16 +205,9 @@ JavaPairRDD counts = lines.flatMapToPair( There is no performance difference between these approaches; the choice is just a matter of style. -# Javadoc - -We currently provide documentation for the Java API as Scaladoc, in the -[`org.apache.spark.api.java` package](api/core/index.html#org.apache.spark.api.java.package), because -some of the classes are implemented in Scala. It is important to note that the types and function -definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of -`T reduce(Function2 func)`). In addition, the Scala `trait` modifier is used for Java -interface classes. We hope to generate documentation with Java-style syntax in the future to -avoid these quirks. +# API Docs +[API documentation](api/java/index.html) for Spark in Java is available in Javadoc format. # Where to Go from Here diff --git a/docs/js/main.js b/docs/js/main.js index 0bd2286cce..5905546711 100755 --- a/docs/js/main.js +++ b/docs/js/main.js @@ -73,8 +73,26 @@ function viewSolution() { }); } +// A script to fix internal hash links because we have an overlapping top bar. +// Based on https://github.com/twitter/bootstrap/issues/193#issuecomment-2281510 +function maybeScrollToHash() { + console.log("HERE"); + if (window.location.hash && $(window.location.hash).length) { + console.log("HERE2", $(window.location.hash), $(window.location.hash).offset().top); + var newTop = $(window.location.hash).offset().top - 57; + $(window).scrollTop(newTop); + } +} $(function() { codeTabs(); viewSolution(); + + $(window).bind('hashchange', function() { + maybeScrollToHash(); + }); + + // Scroll now too in case we had opened the page on a hash, but wait a bit because some browsers + // will try to do *their* initial scroll after running the onReady handler. + $(window).load(function() { setTimeout(function() { maybeScrollToHash(); }, 25); }); }); diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md index 2c42f60c2e..2e0fa093dc 100644 --- a/docs/mllib-classification-regression.md +++ b/docs/mllib-classification-regression.md @@ -316,26 +316,26 @@ For each of them, we support all 3 possible regularizations (none, L1 or L2). Available algorithms for binary classification: -* [SVMWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD) -* [LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD) +* [SVMWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD) +* [LogisticRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD) Available algorithms for linear regression: -* [LinearRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD) -* [RidgeRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD) -* [LassoWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LassoWithSGD) +* [LinearRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD) +* [RidgeRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD) +* [LassoWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD) Behind the scenes, all above methods use the SGD implementation from the gradient descent primitive in MLlib, see the optimization part: -* [GradientDescent](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) +* [GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) #### Tree-based Methods The decision tree algorithm supports binary classification and regression: -* [DecisionTee](api/mllib/index.html#org.apache.spark.mllib.tree.DecisionTree) +* [DecisionTee](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree) # Usage in Scala diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md index 50a8671560..0359c67157 100644 --- a/docs/mllib-clustering.md +++ b/docs/mllib-clustering.md @@ -33,7 +33,7 @@ a given dataset, the algorithm returns the best clustering result). Available algorithms for clustering: -* [KMeans](api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans) +* [KMeans](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans) diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md index aa22f67b30..2f1f5f3856 100644 --- a/docs/mllib-collaborative-filtering.md +++ b/docs/mllib-collaborative-filtering.md @@ -42,7 +42,7 @@ for an item. Available algorithms for collaborative filtering: -* [ALS](api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS) +* [ALS](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) # Usage in Scala diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md index 4236b0c8b6..0963a99881 100644 --- a/docs/mllib-guide.md +++ b/docs/mllib-guide.md @@ -36,15 +36,15 @@ The following links provide a detailed explanation of the methods and usage exam # Data Types Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the -[Vector](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) class is used to +[Vector](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) class is used to represent vectors. You can create either dense or sparse vectors using the -[Vectors](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) factory. +[Vectors](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) factory. In Python, MLlib can take the following vector types: * [NumPy](http://www.numpy.org) arrays * Standard Python lists (e.g. `[1, 2, 3]`) -* The MLlib [SparseVector](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html) class +* The MLlib [SparseVector](api/python/pyspark.mllib.linalg.SparseVector-class.html) class * [SciPy sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html) For efficiency, we recommend using NumPy arrays over lists, and using the @@ -52,8 +52,8 @@ For efficiency, we recommend using NumPy arrays over lists, and using the for SciPy matrices, or MLlib's own SparseVector class. Several other simple data types are used throughout the library, e.g. the LabeledPoint -class ([Java/Scala](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint), -[Python](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data. +class ([Java/Scala](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint), +[Python](api/python/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data. # Dependencies MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md index 396b98d52a..c79cc3d944 100644 --- a/docs/mllib-optimization.md +++ b/docs/mllib-optimization.md @@ -95,12 +95,12 @@ As an alternative to just use the subgradient `$R'(\wv)$` of the regularizer in direction, an improved update for some cases can be obtained by using the proximal operator instead. For the L1-regularizer, the proximal operator is given by soft thresholding, as implemented in -[L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.L1Updater). +[L1Updater](api/scala/index.html#org.apache.spark.mllib.optimization.L1Updater). ## Update Schemes for Distributed SGD The SGD implementation in -[GradientDescent](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) uses +[GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) uses a simple (distributed) sampling of the data examples. We recall that the loss part of the optimization problem `$\eqref{eq:regPrimal}$` is `$\frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)$`, and therefore `$\frac1n \sum_{i=1}^n L'_{\wv,i}$` would @@ -138,7 +138,7 @@ are developed, see the section for example. The SGD method -[GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) +[GradientDescent.runMiniBatchSGD](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) has the following parameters: * `gradient` is a class that computes the stochastic gradient of the function @@ -161,6 +161,6 @@ each iteration, to compute the gradient direction. Available algorithms for gradient descent: -* [GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) +* [GradientDescent.runMiniBatchSGD](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md index 39de603b29..98233bf556 100644 --- a/docs/python-programming-guide.md +++ b/docs/python-programming-guide.md @@ -134,7 +134,7 @@ Files listed here will be added to the `PYTHONPATH` and shipped to remote worker Code dependencies can be added to an existing SparkContext using its `addPyFile()` method. You can set [configuration properties](configuration.html#spark-properties) by passing a -[SparkConf](api/pyspark/pyspark.conf.SparkConf-class.html) object to SparkContext: +[SparkConf](api/python/pyspark.conf.SparkConf-class.html) object to SparkContext: {% highlight python %} from pyspark import SparkConf, SparkContext @@ -147,7 +147,7 @@ sc = SparkContext(conf = conf) # API Docs -[API documentation](api/pyspark/index.html) for PySpark is available as Epydoc. +[API documentation](api/python/index.html) for PySpark is available as Epydoc. Many of the methods also contain [doctests](http://docs.python.org/2/library/doctest.html) that provide additional usage examples. # Libraries diff --git a/docs/quick-start.md b/docs/quick-start.md index 6b4f4ba425..68afa6e1bf 100644 --- a/docs/quick-start.md +++ b/docs/quick-start.md @@ -138,7 +138,9 @@ Spark README. Note that you'll need to replace YOUR_SPARK_HOME with the location installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext, we initialize a SparkContext as part of the program. -We pass the SparkContext constructor a SparkConf object which contains information about our +We pass the SparkContext constructor a +[SparkConf](api/scala/index.html#org.apache.spark.SparkConf) +object which contains information about our application. We also call sc.addJar to make sure that when our application is launched in cluster mode, the jar file containing it will be shipped automatically to worker nodes. @@ -327,4 +329,4 @@ Congratulations on running your first Spark application! * For an in-depth overview of the API see "Programming Guides" menu section. * For running applications on a cluster head to the [deployment overview](cluster-overview.html). -* For configuration options available to Spark applications see the [configuration page](configuration.html). \ No newline at end of file +* For configuration options available to Spark applications see the [configuration page](configuration.html). diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md index 4431da0721..a3171709ff 100644 --- a/docs/scala-programming-guide.md +++ b/docs/scala-programming-guide.md @@ -147,7 +147,7 @@ All transformations in Spark are lazy, in that they do not compute their By default, each transformed RDD is recomputed each time you run an action on it. However, you may also *persist* an RDD in memory using the `persist` (or `cache`) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting datasets on disk, or replicated across the cluster. The next section in this document describes these options. -The following tables list the transformations and actions currently supported (see also the [RDD API doc](api/core/index.html#org.apache.spark.rdd.RDD) for details): +The following tables list the transformations and actions currently supported (see also the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD) for details): ### Transformations @@ -216,7 +216,7 @@ The following tables list the transformations and actions currently supported (s -A complete list of transformations is available in the [RDD API doc](api/core/index.html#org.apache.spark.rdd.RDD). +A complete list of transformations is available in the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD). ### Actions @@ -264,7 +264,7 @@ A complete list of transformations is available in the [RDD API doc](api/core/in -A complete list of actions is available in the [RDD API doc](api/core/index.html#org.apache.spark.rdd.RDD). +A complete list of actions is available in the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD). ## RDD Persistence @@ -283,7 +283,7 @@ In addition, each RDD can be stored using a different *storage level*, allowing persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space), or replicate it across nodes, or store the data in off-heap memory in [Tachyon](http://tachyon-project.org/). These levels are chosen by passing a -[`org.apache.spark.storage.StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel) +[`org.apache.spark.storage.StorageLevel`](api/scala/index.html#org.apache.spark.storage.StorageLevel) object to `persist()`. The `cache()` method is a shorthand for using the default storage level, which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The complete set of available storage levels is: @@ -355,7 +355,7 @@ waiting to recompute a lost partition. If you want to define your own storage level (say, with replication factor of 3 instead of 2), then use the function factor method `apply()` of the -[`StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel$) singleton object. +[`StorageLevel`](api/scala/index.html#org.apache.spark.storage.StorageLevel$) singleton object. Spark has a block manager inside the Executors that let you chose memory, disk, or off-heap. The latter is for storing RDDs off-heap outside the Executor JVM on top of the memory management system diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md index 8e98cc0c80..e25379bd76 100644 --- a/docs/sql-programming-guide.md +++ b/docs/sql-programming-guide.md @@ -14,8 +14,8 @@ title: Spark SQL Programming Guide Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. At the core of this component is a new type of RDD, -[SchemaRDD](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed -[Row](api/sql/catalyst/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with +[SchemaRDD](api/scala/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed +[Row](api/scala/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/). @@ -27,8 +27,8 @@ file, or by running HiveQL against data stored in [Apache Hive](http://hive.apac

Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using Spark. At the core of this component is a new type of RDD, -[JavaSchemaRDD](api/sql/core/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed -[Row](api/sql/catalyst/index.html#org.apache.spark.sql.api.java.Row) objects along with +[JavaSchemaRDD](api/scala/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed +[Row](api/scala/index.html#org.apache.spark.sql.api.java.Row) objects along with a schema that describes the data types of each column in the row. A JavaSchemaRDD is similar to a table in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, parquet file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/). @@ -38,8 +38,8 @@ file, or by running HiveQL against data stored in [Apache Hive](http://hive.apac Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using Spark. At the core of this component is a new type of RDD, -[SchemaRDD](api/pyspark/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed -[Row](api/pyspark/pyspark.sql.Row-class.html) objects along with +[SchemaRDD](api/python/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed +[Row](api/python/pyspark.sql.Row-class.html) objects along with a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/). @@ -56,7 +56,7 @@ file, or by running HiveQL against data stored in [Apache Hive](http://hive.apac
The entry point into all relational functionality in Spark is the -[SQLContext](api/sql/core/index.html#org.apache.spark.sql.SQLContext) class, or one of its +[SQLContext](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its descendants. To create a basic SQLContext, all you need is a SparkContext. {% highlight scala %} @@ -72,7 +72,7 @@ import sqlContext._
The entry point into all relational functionality in Spark is the -[JavaSQLContext](api/sql/core/index.html#org.apache.spark.sql.api.java.JavaSQLContext) class, or one +[JavaSQLContext](api/scala/index.html#org.apache.spark.sql.api.java.JavaSQLContext) class, or one of its descendants. To create a basic JavaSQLContext, all you need is a JavaSparkContext. {% highlight java %} @@ -85,7 +85,7 @@ JavaSQLContext sqlCtx = new org.apache.spark.sql.api.java.JavaSQLContext(ctx);
The entry point into all relational functionality in Spark is the -[SQLContext](api/pyspark/pyspark.sql.SQLContext-class.html) class, or one +[SQLContext](api/python/pyspark.sql.SQLContext-class.html) class, or one of its decedents. To create a basic SQLContext, all you need is a SparkContext. {% highlight python %} @@ -331,7 +331,7 @@ val teenagers = people.where('age >= 10).where('age <= 19).select('name) The DSL uses Scala symbols to represent columns in the underlying table, which are identifiers prefixed with a tick (`'`). Implicit conversions turn these symbols into expressions that are evaluated by the SQL execution engine. A full list of the functions supported can be found in the -[ScalaDoc](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD). +[ScalaDoc](api/scala/index.html#org.apache.spark.sql.SchemaRDD). diff --git a/docs/streaming-custom-receivers.md b/docs/streaming-custom-receivers.md index 3fb540c9fb..3cfa4516cc 100644 --- a/docs/streaming-custom-receivers.md +++ b/docs/streaming-custom-receivers.md @@ -9,7 +9,7 @@ This guide shows the programming model and features by walking through a simple ### Writing a Simple Receiver -This starts with implementing [NetworkReceiver](api/streaming/index.html#org.apache.spark.streaming.dstream.NetworkReceiver). +This starts with implementing [NetworkReceiver](api/scala/index.html#org.apache.spark.streaming.dstream.NetworkReceiver). The following is a simple socket text-stream receiver. @@ -125,4 +125,4 @@ _A more comprehensive example is provided in the spark streaming examples_ ## References 1.[Akka Actor documentation](http://doc.akka.io/docs/akka/2.0.5/scala/actors.html) -2.[NetworkReceiver](api/streaming/index.html#org.apache.spark.streaming.dstream.NetworkReceiver) +2.[NetworkReceiver](api/scala/index.html#org.apache.spark.streaming.dstream.NetworkReceiver) diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md index f9904d4501..946d6c4879 100644 --- a/docs/streaming-programming-guide.md +++ b/docs/streaming-programming-guide.md @@ -40,7 +40,7 @@ Spark Streaming provides a high-level abstraction called *discretized stream* or which represents a continuous stream of data. DStreams can be created either from input data stream from sources such as Kafka and Flume, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of -[RDDs](api/core/index.html#org.apache.spark.rdd.RDD). +[RDDs](api/scala/index.html#org.apache.spark.rdd.RDD). This guide shows you how to start writing Spark Streaming programs with DStreams. You can write Spark Streaming programs in Scala or Java, both of which are presented in this guide. You @@ -62,7 +62,7 @@ First, we import the names of the Spark Streaming classes, and some implicit conversions from StreamingContext into our environment, to add useful methods to other classes we need (like DStream). -[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) is the +[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) is the main entry point for all streaming functionality. {% highlight scala %} @@ -71,7 +71,7 @@ import org.apache.spark.streaming.StreamingContext._ {% endhighlight %} Then we create a -[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) object. +[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) object. Besides Spark's configuration, we specify that any DStream will be processed in 1 second batches. @@ -132,7 +132,7 @@ The complete code can be found in the Spark Streaming example
First, we create a -[JavaStreamingContext](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object, +[JavaStreamingContext](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object, which is the main entry point for all streaming functionality. Besides Spark's configuration, we specify that any DStream would be processed in 1 second batches. @@ -168,7 +168,7 @@ JavaDStream words = lines.flatMap( generating multiple new records from each record in the source DStream. In this case, each line will be split into multiple words and the stream of words is represented as the `words` DStream. Note that we defined the transformation using a -[FlatMapFunction](api/core/index.html#org.apache.spark.api.java.function.FlatMapFunction) object. +[FlatMapFunction](api/scala/index.html#org.apache.spark.api.java.function.FlatMapFunction) object. As we will discover along the way, there are a number of such convenience classes in the Java API that help define DStream transformations. @@ -192,9 +192,9 @@ wordCounts.print(); // Print a few of the counts to the console {% endhighlight %} The `words` DStream is further mapped (one-to-one transformation) to a DStream of `(word, -1)` pairs, using a [PairFunction](api/core/index.html#org.apache.spark.api.java.function.PairFunction) +1)` pairs, using a [PairFunction](api/scala/index.html#org.apache.spark.api.java.function.PairFunction) object. Then, it is reduced to get the frequency of words in each batch of data, -using a [Function2](api/core/index.html#org.apache.spark.api.java.function.Function2) object. +using a [Function2](api/scala/index.html#org.apache.spark.api.java.function.Function2) object. Finally, `wordCounts.print()` will print a few of the counts generated every second. Note that when these lines are executed, Spark Streaming only sets up the computation it @@ -333,7 +333,7 @@ for the full list of supported sources and artifacts.
To initialize a Spark Streaming program in Scala, a -[`StreamingContext`](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) +[`StreamingContext`](api/scala/index.html#org.apache.spark.streaming.StreamingContext) object has to be created, which is the main entry point of all Spark Streaming functionality. A `StreamingContext` object can be created by using @@ -344,7 +344,7 @@ new StreamingContext(master, appName, batchDuration, [sparkHome], [jars])
To initialize a Spark Streaming program in Java, a -[`JavaStreamingContext`](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) +[`JavaStreamingContext`](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object has to be created, which is the main entry point of all Spark Streaming functionality. A `JavaStreamingContext` object can be created by using @@ -431,8 +431,8 @@ and process any files created in that directory. Note that For more details on streams from files, Akka actors and sockets, see the API documentations of the relevant functions in -[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) for -Scala and [JavaStreamingContext](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) +[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) for +Scala and [JavaStreamingContext](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) for Java. Additional functionality for creating DStreams from sources such as Kafka, Flume, and Twitter @@ -802,10 +802,10 @@ output operators are defined: The complete list of DStream operations is available in the API documentation. For the Scala API, -see [DStream](api/streaming/index.html#org.apache.spark.streaming.dstream.DStream) -and [PairDStreamFunctions](api/streaming/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions). -For the Java API, see [JavaDStream](api/streaming/index.html#org.apache.spark.streaming.api.java.dstream.DStream) -and [JavaPairDStream](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaPairDStream). +see [DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream) +and [PairDStreamFunctions](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions). +For the Java API, see [JavaDStream](api/scala/index.html#org.apache.spark.streaming.api.java.dstream.DStream) +and [JavaPairDStream](api/scala/index.html#org.apache.spark.streaming.api.java.JavaPairDStream). Specifically for the Java API, see [Spark's Java programming guide](java-programming-guide.html) for more information. @@ -881,7 +881,7 @@ Cluster resources maybe under-utilized if the number of parallel tasks used in a computation is not high enough. For example, for distributed reduce operations like `reduceByKey` and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of parallelism as an argument (see the -[`PairDStreamFunctions`](api/streaming/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions) +[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions) documentation), or set the [config property](configuration.html#spark-properties) `spark.default.parallelism` to change the default. @@ -925,7 +925,7 @@ A good approach to figure out the right batch size for your application is to te conservative batch size (say, 5-10 seconds) and a low data rate. To verify whether the system is able to keep up with data rate, you can check the value of the end-to-end delay experienced by each processed batch (either look for "Total delay" in Spark driver log4j logs, or use the -[StreamingListener](api/streaming/index.html#org.apache.spark.streaming.scheduler.StreamingListener) +[StreamingListener](api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener) interface). If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise, if the delay is continuously increasing, it means that the system is unable to keep up and it @@ -952,7 +952,7 @@ exception saying so. ## Monitoring Besides Spark's in-built [monitoring capabilities](monitoring.html), the progress of a Spark Streaming program can also be monitored using the [StreamingListener] -(api/streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface, +(api/scala/index.html#org.apache.spark.scheduler.StreamingListener) interface, which allows you to get statistics of batch processing times, queueing delays, and total end-to-end delays. Note that this is still an experimental API and it is likely to be improved upon (i.e., more information reported) in the future. @@ -965,9 +965,9 @@ in Spark Streaming applications and achieving more consistent batch processing t * **Default persistence level of DStreams**: Unlike RDDs, the default persistence level of DStreams serializes the data in memory (that is, -[StorageLevel.MEMORY_ONLY_SER](api/core/index.html#org.apache.spark.storage.StorageLevel$) for +[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$) for DStream compared to -[StorageLevel.MEMORY_ONLY](api/core/index.html#org.apache.spark.storage.StorageLevel$) for RDDs). +[StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$) for RDDs). Even though keeping the data serialized incurs higher serialization/deserialization overheads, it significantly reduces GC pauses. @@ -1244,15 +1244,15 @@ and output 30 after recovery. # Where to Go from Here * API documentation - - Main docs of StreamingContext and DStreams in [Scala](api/streaming/index.html#org.apache.spark.streaming.package) - and [Java](api/streaming/index.html#org.apache.spark.streaming.api.java.package) + - Main docs of StreamingContext and DStreams in [Scala](api/scala/index.html#org.apache.spark.streaming.package) + and [Java](api/scala/index.html#org.apache.spark.streaming.api.java.package) - Additional docs for - [Kafka](api/external/kafka/index.html#org.apache.spark.streaming.kafka.KafkaUtils$), - [Flume](api/external/flume/index.html#org.apache.spark.streaming.flume.FlumeUtils$), - [Twitter](api/external/twitter/index.html#org.apache.spark.streaming.twitter.TwitterUtils$), - [ZeroMQ](api/external/zeromq/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$), and - [MQTT](api/external/mqtt/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$) + [Kafka](api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$), + [Flume](api/scala/index.html#org.apache.spark.streaming.flume.FlumeUtils$), + [Twitter](api/scala/index.html#org.apache.spark.streaming.twitter.TwitterUtils$), + [ZeroMQ](api/scala/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$), and + [MQTT](api/scala/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$) * More examples in [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/streaming/examples) and [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/streaming/examples) -* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) describing Spark Streaming +* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) describing Spark Streaming. diff --git a/docs/tuning.md b/docs/tuning.md index cc069f0e84..78e10770a8 100644 --- a/docs/tuning.md +++ b/docs/tuning.md @@ -48,7 +48,7 @@ Spark automatically includes Kryo serializers for the many commonly-used core Sc in the AllScalaRegistrar from the [Twitter chill](https://github.com/twitter/chill) library. To register your own custom classes with Kryo, create a public class that extends -[`org.apache.spark.serializer.KryoRegistrator`](api/core/index.html#org.apache.spark.serializer.KryoRegistrator) and set the +[`org.apache.spark.serializer.KryoRegistrator`](api/scala/index.html#org.apache.spark.serializer.KryoRegistrator) and set the `spark.kryo.registrator` config property to point to it, as follows: {% highlight scala %} @@ -222,7 +222,7 @@ enough. Spark automatically sets the number of "map" tasks to run on each file a (though you can control it through optional parameters to `SparkContext.textFile`, etc), and for distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses the largest parent RDD's number of partitions. You can pass the level of parallelism as a second argument -(see the [`spark.PairRDDFunctions`](api/core/index.html#org.apache.spark.rdd.PairRDDFunctions) documentation), +(see the [`spark.PairRDDFunctions`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) documentation), or set the config property `spark.default.parallelism` to change the default. In general, we recommend 2-3 tasks per CPU core in your cluster. -- cgit v1.2.3