aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
authorMatei Zaharia <matei@databricks.com>2014-04-21 21:57:40 -0700
committerPatrick Wendell <pwendell@gmail.com>2014-04-21 21:57:40 -0700
commitfc7838470465474f777bd17791c1bb5f9c348521 (patch)
tree6809dfd66ebafa6dced2018585a3a1f9ba270d53 /docs
parent04c37b6f749dc2418cc28c89964cdc687dfcbd51 (diff)
downloadspark-fc7838470465474f777bd17791c1bb5f9c348521.tar.gz
spark-fc7838470465474f777bd17791c1bb5f9c348521.tar.bz2
spark-fc7838470465474f777bd17791c1bb5f9c348521.zip
[SPARK-1439, SPARK-1440] Generate unified Scaladoc across projects and Javadocs
I used the sbt-unidoc plugin (https://github.com/sbt/sbt-unidoc) to create a unified Scaladoc of our public packages, and generate Javadocs as well. One limitation is that I haven't found an easy way to exclude packages in the Javadoc; there is a SBT task that identifies Java sources to run javadoc on, but it's been very difficult to modify it from outside to change what is set in the unidoc package. Some SBT-savvy people should help with this. The Javadoc site also lacks package-level descriptions and things like that, so we may want to look into that. We may decide not to post these right now if it's too limited compared to the Scala one. Example of the built doc site: http://people.csail.mit.edu/matei/spark-unified-docs/ Author: Matei Zaharia <matei@databricks.com> This patch had conflicts when merged, resolved by Committer: Patrick Wendell <pwendell@gmail.com> Closes #457 from mateiz/better-docs and squashes the following commits: a63d4a3 [Matei Zaharia] Skip Java/Scala API docs for Python package 5ea1f43 [Matei Zaharia] Fix links to Java classes in Java guide, fix some JS for scrolling to anchors on page load f05abc0 [Matei Zaharia] Don't include java.lang package names 995e992 [Matei Zaharia] Skip internal packages and class names with $ in JavaDoc a14a93c [Matei Zaharia] typo 76ce64d [Matei Zaharia] Add groups to Javadoc index page, and a first package-info.java ed6f994 [Matei Zaharia] Generate JavaDoc as well, add titles, update doc site to use unified docs acb993d [Matei Zaharia] Add Unidoc plugin for the projects we want Unidoced
Diffstat (limited to 'docs')
-rwxr-xr-xdocs/_layouts/global.html105
-rw-r--r--docs/_plugins/copy_api_dirs.rb65
-rw-r--r--docs/api.md13
-rw-r--r--docs/configuration.md8
-rw-r--r--docs/graphx-programming-guide.md62
-rw-r--r--docs/index.md10
-rw-r--r--docs/java-programming-guide.md55
-rwxr-xr-xdocs/js/main.js18
-rw-r--r--docs/mllib-classification-regression.md14
-rw-r--r--docs/mllib-clustering.md2
-rw-r--r--docs/mllib-collaborative-filtering.md2
-rw-r--r--docs/mllib-guide.md10
-rw-r--r--docs/mllib-optimization.md8
-rw-r--r--docs/python-programming-guide.md4
-rw-r--r--docs/quick-start.md6
-rw-r--r--docs/scala-programming-guide.md10
-rw-r--r--docs/sql-programming-guide.md20
-rw-r--r--docs/streaming-custom-receivers.md4
-rw-r--r--docs/streaming-programming-guide.md56
-rw-r--r--docs/tuning.md4
20 files changed, 207 insertions, 269 deletions
diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 5d4dbb7a9c..8b543de574 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -76,32 +76,9 @@
<li class="dropdown">
<a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
<ul class="dropdown-menu">
- <li><a href="api/core/index.html#org.apache.spark.package">Spark Core for Java/Scala</a></li>
- <li><a href="api/pyspark/index.html">Spark Core for Python</a></li>
- <li class="divider"></li>
- <li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
- <li class="dropdown-submenu">
- <a tabindex="-1" href="#">Spark SQL</a>
- <ul class="dropdown-menu">
- <li><a href="api/sql/core/org/apache/spark/sql/SQLContext.html">Spark SQL Core</a></li>
- <li><a href="api/sql/hive/org/apache/spark/sql/hive/package.html">Hive Support</a></li>
- <li><a href="api/sql/catalyst/org/apache/spark/sql/catalyst/package.html">Catalyst (Optimization)</a></li>
- </ul>
- </li>
- <li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
- <li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
- <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph Processing)</a></li>
- <li class="divider"></li>
- <li class="dropdown-submenu">
- <a tabindex="-1" href="#">External Data Sources</a>
- <ul class="dropdown-menu">
- <li><a href="api/external/kafka/index.html#org.apache.spark.streaming.kafka.KafkaUtils$">Kafka</a></li>
- <li><a href="api/external/flume/index.html#org.apache.spark.streaming.flume.FlumeUtils$">Flume</a></li>
- <li><a href="api/external/twitter/index.html#org.apache.spark.streaming.twitter.TwitterUtils$">Twitter</a></li>
- <li><a href="api/external/zeromq/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$">ZeroMQ</a></li>
- <li><a href="api/external/mqtt/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$">MQTT</a></li>
- </ul>
- </li>
+ <li><a href="api/scala/index.html#org.apache.spark.package">Scaladoc</a></li>
+ <li><a href="api/java/index.html">Javadoc</a></li>
+ <li><a href="api/python/index.html">Python API</a></li>
</ul>
</li>
@@ -140,33 +117,6 @@
<h1 class="title">{{ page.title }}</h1>
{{ content }}
- <!-- Main hero unit for a primary marketing message or call to action -->
- <!--<div class="hero-unit">
- <h1>Hello, world!</h1>
- <p>This is a template for a simple marketing or informational website. It includes a large callout called the hero unit and three supporting pieces of content. Use it as a starting point to create something more unique.</p>
- <p><a class="btn btn-primary btn-large">Learn more &raquo;</a></p>
- </div>-->
-
- <!-- Example row of columns -->
- <!--<div class="row">
- <div class="span4">
- <h2>Heading</h2>
- <p>Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui. </p>
- <p><a class="btn" href="#">View details &raquo;</a></p>
- </div>
- <div class="span4">
- <h2>Heading</h2>
- <p>Donec id elit non mi porta gravida at eget metus. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus. Etiam porta sem malesuada magna mollis euismod. Donec sed odio dui. </p>
- <p><a class="btn" href="#">View details &raquo;</a></p>
- </div>
- <div class="span4">
- <h2>Heading</h2>
- <p>Donec sed odio dui. Cras justo odio, dapibus ac facilisis in, egestas eget quam. Vestibulum id ligula porta felis euismod semper. Fusce dapibus, tellus ac cursus commodo, tortor mauris condimentum nibh, ut fermentum massa justo sit amet risus.</p>
- <p><a class="btn" href="#">View details &raquo;</a></p>
- </div>
- </div>
-
- <hr>-->
</div> <!-- /container -->
@@ -174,42 +124,23 @@
<script src="js/vendor/bootstrap.min.js"></script>
<script src="js/main.js"></script>
- <!-- A script to fix internal hash links because we have an overlapping top bar.
- Based on https://github.com/twitter/bootstrap/issues/193#issuecomment-2281510 -->
+ <!-- MathJax Section -->
+ <script type="text/x-mathjax-config">
+ MathJax.Hub.Config({
+ TeX: { equationNumbers: { autoNumber: "AMS" } }
+ });
+ </script>
+ <script type="text/javascript"
+ src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
<script>
- $(function() {
- function maybeScrollToHash() {
- if (window.location.hash && $(window.location.hash).length) {
- var newTop = $(window.location.hash).offset().top - $('#topbar').height() - 5;
- $(window).scrollTop(newTop);
- }
+ MathJax.Hub.Config({
+ tex2jax: {
+ inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
+ displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
+ processEscapes: true,
+ skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
}
- $(window).bind('hashchange', function() {
- maybeScrollToHash();
- });
- // Scroll now too in case we had opened the page on a hash, but wait 1 ms because some browsers
- // will try to do *their* initial scroll after running the onReady handler.
- setTimeout(function() { maybeScrollToHash(); }, 1)
- })
+ });
</script>
-
</body>
- <!-- MathJax Section -->
- <script type="text/x-mathjax-config">
- MathJax.Hub.Config({
- TeX: { equationNumbers: { autoNumber: "AMS" } }
- });
- </script>
- <script type="text/javascript"
- src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML"></script>
- <script>
- MathJax.Hub.Config({
- tex2jax: {
- inlineMath: [ ["$", "$"], ["\\\\(","\\\\)"] ],
- displayMath: [ ["$$","$$"], ["\\[", "\\]"] ],
- processEscapes: true,
- skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
- }
- });
- </script>
</html>
diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb
index 05f0bd47a8..2dbbbf6feb 100644
--- a/docs/_plugins/copy_api_dirs.rb
+++ b/docs/_plugins/copy_api_dirs.rb
@@ -20,47 +20,48 @@ include FileUtils
if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1')
# Build Scaladoc for Java/Scala
- core_projects = ["core", "examples", "repl", "bagel", "graphx", "streaming", "mllib"]
- external_projects = ["flume", "kafka", "mqtt", "twitter", "zeromq"]
- sql_projects = ["catalyst", "core", "hive"]
- projects = core_projects
- projects = projects + external_projects.map { |project_name| "external/" + project_name }
- projects = projects + sql_projects.map { |project_name| "sql/" + project_name }
-
- puts "Moving to project root and building scaladoc."
+ puts "Moving to project root and building API docs."
curr_dir = pwd
cd("..")
- puts "Running 'sbt/sbt doc hive/doc' from " + pwd + "; this may take a few minutes..."
- puts `sbt/sbt doc hive/doc`
+ puts "Running 'sbt/sbt compile unidoc' from " + pwd + "; this may take a few minutes..."
+ puts `sbt/sbt compile unidoc`
puts "Moving back into docs dir."
cd("docs")
- # Copy over the scaladoc from each project into the docs directory.
+ # Copy over the unified ScalaDoc for all projects to api/scala.
# This directory will be copied over to _site when `jekyll` command is run.
- projects.each do |project_name|
- source = "../" + project_name + "/target/scala-2.10/api"
- dest = "api/" + project_name
+ source = "../target/scala-2.10/unidoc"
+ dest = "api/scala"
+
+ puts "Making directory " + dest
+ mkdir_p dest
+
+ # From the rubydoc: cp_r('src', 'dest') makes src/dest, but this doesn't.
+ puts "cp -r " + source + "/. " + dest
+ cp_r(source + "/.", dest)
+
+ # Append custom JavaScript
+ js = File.readlines("./js/api-docs.js")
+ js_file = dest + "/lib/template.js"
+ File.open(js_file, 'a') { |f| f.write("\n" + js.join()) }
- puts "making directory " + dest
- mkdir_p dest
+ # Append custom CSS
+ css = File.readlines("./css/api-docs.css")
+ css_file = dest + "/lib/template.css"
+ File.open(css_file, 'a') { |f| f.write("\n" + css.join()) }
- # From the rubydoc: cp_r('src', 'dest') makes src/dest, but this doesn't.
- puts "cp -r " + source + "/. " + dest
- cp_r(source + "/.", dest)
+ # Copy over the unified JavaDoc for all projects to api/java.
+ source = "../target/javaunidoc"
+ dest = "api/java"
- # Append custom JavaScript
- js = File.readlines("./js/api-docs.js")
- js_file = dest + "/lib/template.js"
- File.open(js_file, 'a') { |f| f.write("\n" + js.join()) }
+ puts "Making directory " + dest
+ mkdir_p dest
- # Append custom CSS
- css = File.readlines("./css/api-docs.css")
- css_file = dest + "/lib/template.css"
- File.open(css_file, 'a') { |f| f.write("\n" + css.join()) }
- end
+ puts "cp -r " + source + "/. " + dest
+ cp_r(source + "/.", dest)
# Build Epydoc for Python
puts "Moving to python directory and building epydoc."
@@ -70,11 +71,11 @@ if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1')
puts "Moving back into docs dir."
cd("../docs")
- puts "echo making directory pyspark"
- mkdir_p "pyspark"
+ puts "Making directory api/python"
+ mkdir_p "api/python"
- puts "cp -r ../python/docs/. api/pyspark"
- cp_r("../python/docs/.", "api/pyspark")
+ puts "cp -r ../python/docs/. api/python"
+ cp_r("../python/docs/.", "api/python")
cd("..")
end
diff --git a/docs/api.md b/docs/api.md
index 91c8e51d26..0346038333 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -1,13 +1,10 @@
---
layout: global
-title: Spark API documentation (Scaladoc)
+title: Spark API Documentation
---
-Here you can find links to the Scaladoc generated for the Spark sbt subprojects. If the following links don't work, try running `sbt/sbt doc` from the Spark project home directory.
+Here you can API docs for Spark and its submodules.
-- [Spark](api/core/index.html)
-- [Spark Examples](api/examples/index.html)
-- [Spark Streaming](api/streaming/index.html)
-- [Bagel](api/bagel/index.html)
-- [GraphX](api/graphx/index.html)
-- [PySpark](api/pyspark/index.html)
+- [Spark Scala API (Scaladoc)](api/scala/index.html)
+- [Spark Java API (Javadoc)](api/java/index.html)
+- [Spark Python API (Epydoc)](api/python/index.html)
diff --git a/docs/configuration.md b/docs/configuration.md
index 5a4abca264..e7e1dd56cf 100644
--- a/docs/configuration.md
+++ b/docs/configuration.md
@@ -6,7 +6,7 @@ title: Spark Configuration
Spark provides three locations to configure the system:
* [Spark properties](#spark-properties) control most application parameters and can be set by passing
- a [SparkConf](api/core/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java
+ a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf) object to SparkContext, or through Java
system properties.
* [Environment variables](#environment-variables) can be used to set per-machine settings, such as
the IP address, through the `conf/spark-env.sh` script on each node.
@@ -16,7 +16,7 @@ Spark provides three locations to configure the system:
# Spark Properties
Spark properties control most application settings and are configured separately for each application.
-The preferred way to set them is by passing a [SparkConf](api/core/index.html#org.apache.spark.SparkConf)
+The preferred way to set them is by passing a [SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
class to your SparkContext constructor.
Alternatively, Spark will also load them from Java system properties, for compatibility with old versions
of Spark.
@@ -53,7 +53,7 @@ there are at least five properties that you will commonly want to control:
in serialized form. The default of Java serialization works with any Serializable Java object but is
quite slow, so we recommend <a href="tuning.html">using <code>org.apache.spark.serializer.KryoSerializer</code>
and configuring Kryo serialization</a> when speed is necessary. Can be any subclass of
- <a href="api/core/index.html#org.apache.spark.serializer.Serializer"><code>org.apache.spark.Serializer</code></a>.
+ <a href="api/scala/index.html#org.apache.spark.serializer.Serializer"><code>org.apache.spark.Serializer</code></a>.
</td>
</tr>
<tr>
@@ -62,7 +62,7 @@ there are at least five properties that you will commonly want to control:
<td>
If you use Kryo serialization, set this class to register your custom classes with Kryo.
It should be set to a class that extends
- <a href="api/core/index.html#org.apache.spark.serializer.KryoRegistrator"><code>KryoRegistrator</code></a>.
+ <a href="api/scala/index.html#org.apache.spark.serializer.KryoRegistrator"><code>KryoRegistrator</code></a>.
See the <a href="tuning.html#data-serialization">tuning guide</a> for more details.
</td>
</tr>
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 1238e3e0a4..07be8ba58e 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -17,7 +17,7 @@ title: GraphX Programming Guide
# Overview
GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level,
-GraphX extends the Spark [RDD](api/core/index.html#org.apache.spark.rdd.RDD) by introducing the
+GraphX extends the Spark [RDD](api/scala/index.html#org.apache.spark.rdd.RDD) by introducing the
[Resilient Distributed Property Graph](#property_graph): a directed multigraph with properties
attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental
operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
@@ -82,7 +82,7 @@ Prior to the release of GraphX, graph computation in Spark was expressed using B
implementation of Pregel. GraphX improves upon Bagel by exposing a richer property graph API, a
more streamlined version of the Pregel abstraction, and system optimizations to improve performance
and reduce memory overhead. While we plan to eventually deprecate Bagel, we will continue to
-support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and
+support the [Bagel API](api/scala/index.html#org.apache.spark.bagel.package) and
[Bagel programming guide](bagel-programming-guide.html). However, we encourage Bagel users to
explore the new GraphX API and comment on issues that may complicate the transition from Bagel.
@@ -103,7 +103,7 @@ getting started with Spark refer to the [Spark Quick Start Guide](quick-start.ht
# The Property Graph
<a name="property_graph"></a>
-The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed multigraph
+The [property graph](api/scala/index.html#org.apache.spark.graphx.Graph) is a directed multigraph
with user defined objects attached to each vertex and edge. A directed multigraph is a directed
graph with potentially multiple parallel edges sharing the same source and destination vertex. The
ability to support parallel edges simplifies modeling scenarios where there can be multiple
@@ -179,7 +179,7 @@ val userGraph: Graph[(String, String), String]
There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
generators and these are discussed in more detail in the section on
[graph builders](#graph_builders). Probably the most general method is to use the
-[Graph object](api/graphx/index.html#org.apache.spark.graphx.Graph$). For example the following
+[Graph object](api/scala/index.html#org.apache.spark.graphx.Graph$). For example the following
code constructs a graph from a collection of RDDs:
{% highlight scala %}
@@ -203,7 +203,7 @@ In the above example we make use of the [`Edge`][Edge] case class. Edges have a
`dstId` corresponding to the source and destination vertex identifiers. In addition, the `Edge`
class has an `attr` member which stores the edge property.
-[Edge]: api/graphx/index.html#org.apache.spark.graphx.Edge
+[Edge]: api/scala/index.html#org.apache.spark.graphx.Edge
We can deconstruct a graph into the respective vertex and edge views by using the `graph.vertices`
and `graph.edges` members respectively.
@@ -229,7 +229,7 @@ The triplet view logically joins the vertex and edge properties yielding an
`RDD[EdgeTriplet[VD, ED]]` containing instances of the [`EdgeTriplet`][EdgeTriplet] class. This
*join* can be expressed in the following SQL expression:
-[EdgeTriplet]: api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet
+[EdgeTriplet]: api/scala/index.html#org.apache.spark.graphx.EdgeTriplet
{% highlight sql %}
SELECT src.id, dst.id, src.attr, e.attr, dst.attr
@@ -270,8 +270,8 @@ core operators are defined in [`GraphOps`][GraphOps]. However, thanks to Scala
operators in `GraphOps` are automatically available as members of `Graph`. For example, we can
compute the in-degree of each vertex (defined in `GraphOps`) by the following:
-[Graph]: api/graphx/index.html#org.apache.spark.graphx.Graph
-[GraphOps]: api/graphx/index.html#org.apache.spark.graphx.GraphOps
+[Graph]: api/scala/index.html#org.apache.spark.graphx.Graph
+[GraphOps]: api/scala/index.html#org.apache.spark.graphx.GraphOps
{% highlight scala %}
val graph: Graph[(String, String), String]
@@ -382,7 +382,7 @@ val newGraph = Graph(newVertices, graph.edges)
val newGraph = graph.mapVertices((id, attr) => mapUdf(id, attr))
{% endhighlight %}
-[Graph.mapVertices]: api/graphx/index.html#org.apache.spark.graphx.Graph@mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED]
+[Graph.mapVertices]: api/scala/index.html#org.apache.spark.graphx.Graph@mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED]
These operators are often used to initialize the graph for a particular computation or project away
unnecessary properties. For example, given a graph with the out-degrees as the vertex properties
@@ -419,7 +419,7 @@ This can be useful when, for example, trying to compute the inverse PageRank. B
operation does not modify vertex or edge properties or change the number of edges, it can be
implemented efficiently without data-movement or duplication.
-[Graph.reverse]: api/graphx/index.html#org.apache.spark.graphx.Graph@reverse:Graph[VD,ED]
+[Graph.reverse]: api/scala/index.html#org.apache.spark.graphx.Graph@reverse:Graph[VD,ED]
The [`subgraph`][Graph.subgraph] operator takes vertex and edge predicates and returns the graph
containing only the vertices that satisfy the vertex predicate (evaluate to true) and edges that
@@ -427,7 +427,7 @@ satisfy the edge predicate *and connect vertices that satisfy the vertex predica
operator can be used in number of situations to restrict the graph to the vertices and edges of
interest or eliminate broken links. For example in the following code we remove broken links:
-[Graph.subgraph]: api/graphx/index.html#org.apache.spark.graphx.Graph@subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED]
+[Graph.subgraph]: api/scala/index.html#org.apache.spark.graphx.Graph@subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexId,VD)⇒Boolean):Graph[VD,ED]
{% highlight scala %}
// Create an RDD for the vertices
@@ -467,7 +467,7 @@ vertices and edges that are also found in the input graph. This can be used in
example, we might run connected components using the graph with missing vertices and then restrict
the answer to the valid subgraph.
-[Graph.mask]: api/graphx/index.html#org.apache.spark.graphx.Graph@mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED]
+[Graph.mask]: api/scala/index.html#org.apache.spark.graphx.Graph@mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED]
{% highlight scala %}
// Run Connected Components
@@ -482,7 +482,7 @@ The [`groupEdges`][Graph.groupEdges] operator merges parallel edges (i.e., dupli
pairs of vertices) in the multigraph. In many numerical applications, parallel edges can be *added*
(their weights combined) into a single edge thereby reducing the size of the graph.
-[Graph.groupEdges]: api/graphx/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED]
+[Graph.groupEdges]: api/scala/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED]
## Join Operators
<a name="join_operators"></a>
@@ -506,7 +506,7 @@ returns a new graph with the vertex properties obtained by applying the user def
to the result of the joined vertices. Vertices without a matching value in the RDD retain their
original value.
-[GraphOps.joinVertices]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED]
+[GraphOps.joinVertices]: api/scala/index.html#org.apache.spark.graphx.GraphOps@joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED]
> Note that if the RDD contains more than one value for a given vertex only one will be used. It
> is therefore recommended that the input RDD be first made unique using the following which will
@@ -525,7 +525,7 @@ property type. Because not all vertices may have a matching value in the input
function takes an `Option` type. For example, we can setup a graph for PageRank by initializing
vertex properties with their `outDegree`.
-[Graph.outerJoinVertices]: api/graphx/index.html#org.apache.spark.graphx.Graph@outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED]
+[Graph.outerJoinVertices]: api/scala/index.html#org.apache.spark.graphx.Graph@outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED]
{% highlight scala %}
@@ -559,7 +559,7 @@ PageRank Value, shortest path to the source, and smallest reachable vertex id).
### Map Reduce Triplets (mapReduceTriplets)
<a name="mrTriplets"></a>
-[Graph.mapReduceTriplets]: api/graphx/index.html#org.apache.spark.graphx.Graph@mapReduceTriplets[A](mapFunc:org.apache.spark.graphx.EdgeTriplet[VD,ED]=&gt;Iterator[(org.apache.spark.graphx.VertexId,A)],reduceFunc:(A,A)=&gt;A,activeSetOpt:Option[(org.apache.spark.graphx.VertexRDD[_],org.apache.spark.graphx.EdgeDirection)])(implicitevidence$10:scala.reflect.ClassTag[A]):org.apache.spark.graphx.VertexRDD[A]
+[Graph.mapReduceTriplets]: api/scala/index.html#org.apache.spark.graphx.Graph@mapReduceTriplets[A](mapFunc:org.apache.spark.graphx.EdgeTriplet[VD,ED]=&gt;Iterator[(org.apache.spark.graphx.VertexId,A)],reduceFunc:(A,A)=&gt;A,activeSetOpt:Option[(org.apache.spark.graphx.VertexRDD[_],org.apache.spark.graphx.EdgeDirection)])(implicitevidence$10:scala.reflect.ClassTag[A]):org.apache.spark.graphx.VertexRDD[A]
The core (heavily optimized) aggregation primitive in GraphX is the
[`mapReduceTriplets`][Graph.mapReduceTriplets] operator:
@@ -665,8 +665,8 @@ attributes at each vertex. This can be easily accomplished using the
[`collectNeighborIds`][GraphOps.collectNeighborIds] and the
[`collectNeighbors`][GraphOps.collectNeighbors] operators.
-[GraphOps.collectNeighborIds]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
-[GraphOps.collectNeighbors]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
+[GraphOps.collectNeighborIds]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexId]]
+[GraphOps.collectNeighbors]: api/scala/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexId,VD)]]
{% highlight scala %}
@@ -685,7 +685,7 @@ class GraphOps[VD, ED] {
In Spark, RDDs are not persisted in memory by default. To avoid recomputation, they must be explicitly cached when using them multiple times (see the [Spark Programming Guide][RDD Persistence]). Graphs in GraphX behave the same way. **When using a graph multiple times, make sure to call [`Graph.cache()`][Graph.cache] on it first.**
[RDD Persistence]: scala-programming-guide.html#rdd-persistence
-[Graph.cache]: api/graphx/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
+[Graph.cache]: api/scala/index.html#org.apache.spark.graphx.Graph@cache():Graph[VD,ED]
In iterative computations, *uncaching* may also be necessary for best performance. By default, cached RDDs and graphs will remain in memory until memory pressure forces them to be evicted in LRU order. For iterative computation, intermediate results from previous iterations will fill up the cache. Though they will eventually be evicted, the unnecessary data stored in memory will slow down garbage collection. It would be more efficient to uncache intermediate results as soon as they are no longer necessary. This involves materializing (caching and forcing) a graph or RDD every iteration, uncaching all other datasets, and only using the materialized dataset in future iterations. However, because graphs are composed of multiple RDDs, it can be difficult to unpersist them correctly. **For iterative computation we recommend using the Pregel API, which correctly unpersists intermediate results.**
@@ -716,7 +716,7 @@ messages remaining.
The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch*
of its implementation (note calls to graph.cache have been removed):
-[GraphOps.pregel]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
+[GraphOps.pregel]: api/scala/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexId,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexId,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
{% highlight scala %}
class GraphOps[VD, ED] {
@@ -840,12 +840,12 @@ object Graph {
[`Graph.fromEdgeTuples`][Graph.fromEdgeTuples] allows creating a graph from only an RDD of edge tuples, assigning the edges the value 1, and automatically creating any vertices mentioned by edges and assigning them the default value. It also supports deduplicating the edges; to deduplicate, pass `Some` of a [`PartitionStrategy`][PartitionStrategy] as the `uniqueEdges` parameter (for example, `uniqueEdges = Some(PartitionStrategy.RandomVertexCut)`). A partition strategy is necessary to colocate identical edges on the same partition so they can be deduplicated.
-[PartitionStrategy]: api/graphx/index.html#org.apache.spark.graphx.PartitionStrategy$
+[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy$
-[GraphLoader.edgeListFile]: api/graphx/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]
-[Graph.apply]: api/graphx/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
-[Graph.fromEdgeTuples]: api/graphx/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int]
-[Graph.fromEdges]: api/graphx/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
+[GraphLoader.edgeListFile]: api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]
+[Graph.apply]: api/scala/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
+[Graph.fromEdgeTuples]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexId,VertexId)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int]
+[Graph.fromEdges]: api/scala/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
# Vertex and Edge RDDs
<a name="vertex_and_edge_rdds"></a>
@@ -913,7 +913,7 @@ of the various partitioning strategies defined in [`PartitionStrategy`][Partitio
each partition, edge attributes and adjacency structure, are stored separately enabling maximum
reuse when changing attribute values.
-[PartitionStrategy]: api/graphx/index.html#org.apache.spark.graphx.PartitionStrategy
+[PartitionStrategy]: api/scala/index.html#org.apache.spark.graphx.PartitionStrategy
The three additional functions exposed by the `EdgeRDD` are:
{% highlight scala %}
@@ -952,7 +952,7 @@ the [`Graph.partitionBy`][Graph.partitionBy] operator. The default partitioning
the initial partitioning of the edges as provided on graph construction. However, users can easily
switch to 2D-partitioning or other heuristics included in GraphX.
-[Graph.partitionBy]: api/graphx/index.html#org.apache.spark.graphx.Graph$@partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy):org.apache.spark.graphx.Graph[VD,ED]
+[Graph.partitionBy]: api/scala/index.html#org.apache.spark.graphx.Graph$@partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy):org.apache.spark.graphx.Graph[VD,ED]
<p style="text-align: center;">
<img src="img/vertex_routing_edge_tables.png"
@@ -983,7 +983,7 @@ GraphX comes with static and dynamic implementations of PageRank as methods on t
GraphX also includes an example social network dataset that we can run PageRank on. A set of users is given in `graphx/data/users.txt`, and a set of relationships between users is given in `graphx/data/followers.txt`. We compute the PageRank of each user as follows:
-[PageRank]: api/graphx/index.html#org.apache.spark.graphx.lib.PageRank$
+[PageRank]: api/scala/index.html#org.apache.spark.graphx.lib.PageRank$
{% highlight scala %}
// Load the edges as a graph
@@ -1006,7 +1006,7 @@ println(ranksByUsername.collect().mkString("\n"))
The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected components can approximate clusters. GraphX contains an implementation of the algorithm in the [`ConnectedComponents` object][ConnectedComponents], and we compute the connected components of the example social network dataset from the [PageRank section](#pagerank) as follows:
-[ConnectedComponents]: api/graphx/index.html#org.apache.spark.graphx.lib.ConnectedComponents$
+[ConnectedComponents]: api/scala/index.html#org.apache.spark.graphx.lib.ConnectedComponents$
{% highlight scala %}
// Load the graph as in the PageRank example
@@ -1029,8 +1029,8 @@ println(ccByUsername.collect().mkString("\n"))
A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the [`TriangleCount` object][TriangleCount] that determines the number of triangles passing through each vertex, providing a measure of clustering. We compute the triangle count of the social network dataset from the [PageRank section](#pagerank). *Note that `TriangleCount` requires the edges to be in canonical orientation (`srcId < dstId`) and the graph to be partitioned using [`Graph.partitionBy`][Graph.partitionBy].*
-[TriangleCount]: api/graphx/index.html#org.apache.spark.graphx.lib.TriangleCount$
-[Graph.partitionBy]: api/graphx/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]
+[TriangleCount]: api/scala/index.html#org.apache.spark.graphx.lib.TriangleCount$
+[Graph.partitionBy]: api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]
{% highlight scala %}
// Load the edges in canonical order and partition the graph for triangle count
diff --git a/docs/index.md b/docs/index.md
index 89ec5b0548..6fc9a4f03b 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -83,13 +83,9 @@ Note that on Windows, you need to set the environment variables on separate line
**API Docs:**
-* [Spark for Java/Scala (Scaladoc)](api/core/index.html)
-* [Spark for Python (Epydoc)](api/pyspark/index.html)
-* [Spark Streaming for Java/Scala (Scaladoc)](api/streaming/index.html)
-* [MLlib (Machine Learning) for Java/Scala (Scaladoc)](api/mllib/index.html)
-* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html)
-* [GraphX (Graphs on Spark) for Scala (Scaladoc)](api/graphx/index.html)
-
+* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package)
+* [Spark Java API (Javadoc)](api/java/index.html)
+* [Spark Python API (Epydoc)](api/python/index.html)
**Deployment guides:**
diff --git a/docs/java-programming-guide.md b/docs/java-programming-guide.md
index 6632360f6e..07c8512bf9 100644
--- a/docs/java-programming-guide.md
+++ b/docs/java-programming-guide.md
@@ -10,9 +10,9 @@ easy to follow even if you don't know Scala.
This guide will show how to use the Spark features described there in Java.
The Spark Java API is defined in the
-[`org.apache.spark.api.java`](api/core/index.html#org.apache.spark.api.java.package) package, and includes
-a [`JavaSparkContext`](api/core/index.html#org.apache.spark.api.java.JavaSparkContext) for
-initializing Spark and [`JavaRDD`](api/core/index.html#org.apache.spark.api.java.JavaRDD) classes,
+[`org.apache.spark.api.java`](api/java/index.html?org/apache/spark/api/java/package-summary.html) package, and includes
+a [`JavaSparkContext`](api/java/index.html?org/apache/spark/api/java/JavaSparkContext.html) for
+initializing Spark and [`JavaRDD`](api/java/index.html?org/apache/spark/api/java/JavaRDD.html) classes,
which support the same methods as their Scala counterparts but take Java functions and return
Java data and collection types. The main differences have to do with passing functions to RDD
operations (e.g. map) and handling RDDs of different types, as discussed next.
@@ -23,19 +23,18 @@ There are a few key differences between the Java and Scala APIs:
* Java does not support anonymous or first-class functions, so functions are passed
using anonymous classes that implement the
- [`org.apache.spark.api.java.function.Function`](api/core/index.html#org.apache.spark.api.java.function.Function),
- [`Function2`](api/core/index.html#org.apache.spark.api.java.function.Function2), etc.
+ [`org.apache.spark.api.java.function.Function`](api/java/index.html?org/apache/spark/api/java/function/Function.html),
+ [`Function2`](api/java/index.html?org/apache/spark/api/java/function/Function2.html), etc.
interfaces.
* To maintain type safety, the Java API defines specialized Function and RDD
classes for key-value pairs and doubles. For example,
- [`JavaPairRDD`](api/core/index.html#org.apache.spark.api.java.JavaPairRDD)
+ [`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)
stores key-value pairs.
-* Some methods are defined on the basis of the passed anonymous function's
- (a.k.a lambda expression) return type,
- for example mapToPair(...) or flatMapToPair returns
- [`JavaPairRDD`](api/core/index.html#org.apache.spark.api.java.JavaPairRDD),
- similarly mapToDouble and flatMapToDouble returns
- [`JavaDoubleRDD`](api/core/index.html#org.apache.spark.api.java.JavaDoubleRDD).
+* Some methods are defined on the basis of the passed function's return type.
+ For example `mapToPair()` returns
+ [`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html),
+ and `mapToDouble()` returns
+ [`JavaDoubleRDD`](api/java/index.html?org/apache/spark/api/java/JavaDoubleRDD.html).
* RDD methods like `collect()` and `countByKey()` return Java collections types,
such as `java.util.List` and `java.util.Map`.
* Key-value pairs, which are simply written as `(key, value)` in Scala, are represented
@@ -50,8 +49,8 @@ In the Scala API, these methods are automatically added using Scala's
[implicit conversions](http://www.scala-lang.org/node/130) mechanism.
In the Java API, the extra methods are defined in the
-[`JavaPairRDD`](api/core/index.html#org.apache.spark.api.java.JavaPairRDD)
-and [`JavaDoubleRDD`](api/core/index.html#org.apache.spark.api.java.JavaDoubleRDD)
+[`JavaPairRDD`](api/java/index.html?org/apache/spark/api/java/JavaPairRDD.html)
+and [`JavaDoubleRDD`](api/java/index.html?org/apache/spark/api/java/JavaDoubleRDD.html)
classes. RDD methods like `map` are overloaded by specialized `PairFunction`
and `DoubleFunction` classes, allowing them to return RDDs of the appropriate
types. Common methods like `filter` and `sample` are implemented by
@@ -61,8 +60,9 @@ framework](http://docs.scala-lang.org/overviews/core/architecture-of-scala-colle
## Function Interfaces
-The following table lists the function interfaces used by the Java API. Each
-interface has a single abstract method, `call()`, that must be implemented.
+The following table lists the function interfaces used by the Java API, located in the
+[`org.apache.spark.api.java.function`](api/java/index.html?org/apache/spark/api/java/function/package-summary.html)
+package. Each interface has a single abstract method, `call()`.
<table class="table">
<tr><th>Class</th><th>Function Type</th></tr>
@@ -81,7 +81,7 @@ interface has a single abstract method, `call()`, that must be implemented.
## Storage Levels
RDD [storage level](scala-programming-guide.html#rdd-persistence) constants, such as `MEMORY_AND_DISK`, are
-declared in the [org.apache.spark.api.java.StorageLevels](api/core/index.html#org.apache.spark.api.java.StorageLevels) class. To
+declared in the [org.apache.spark.api.java.StorageLevels](api/java/index.html?org/apache/spark/api/java/StorageLevels.html) class. To
define your own storage level, you can use StorageLevels.create(...).
# Other Features
@@ -101,11 +101,11 @@ the following changes:
classes to interfaces. This means that concrete implementations of these
`Function` classes will need to use `implements` rather than `extends`.
* Certain transformation functions now have multiple versions depending
- on the return type. In Spark core, the map functions (map, flatMap,
- mapPartitons) have type-specific versions, e.g.
- [`mapToPair`](api/core/index.html#org.apache.spark.api.java.JavaRDD@mapToPair[K2,V2](f:org.apache.spark.api.java.function.PairFunction[T,K2,V2]):org.apache.spark.api.java.JavaPairRDD[K2,V2])
- and [`mapToDouble`](api/core/index.html#org.apache.spark.api.java.JavaRDD@mapToDouble[R](f:org.apache.spark.api.java.function.DoubleFunction[T]):org.apache.spark.api.java.JavaDoubleRDD).
- Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaDStream@transformToPair[K2,V2](transformFunc:org.apache.spark.api.java.function.Function[R,org.apache.spark.api.java.JavaPairRDD[K2,V2]]):org.apache.spark.streaming.api.java.JavaPairDStream[K2,V2]).
+ on the return type. In Spark core, the map functions (`map`, `flatMap`, and
+ `mapPartitons`) have type-specific versions, e.g.
+ [`mapToPair`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToPair(org.apache.spark.api.java.function.PairFunction))
+ and [`mapToDouble`](api/java/org/apache/spark/api/java/JavaRDDLike.html#mapToDouble(org.apache.spark.api.java.function.DoubleFunction)).
+ Spark Streaming also uses the same approach, e.g. [`transformToPair`](api/java/org/apache/spark/streaming/api/java/JavaDStreamLike.html#transformToPair(org.apache.spark.api.java.function.Function)).
# Example
@@ -205,16 +205,9 @@ JavaPairRDD<String, Integer> counts = lines.flatMapToPair(
There is no performance difference between these approaches; the choice is
just a matter of style.
-# Javadoc
-
-We currently provide documentation for the Java API as Scaladoc, in the
-[`org.apache.spark.api.java` package](api/core/index.html#org.apache.spark.api.java.package), because
-some of the classes are implemented in Scala. It is important to note that the types and function
-definitions show Scala syntax (for example, `def reduce(func: Function2[T, T]): T` instead of
-`T reduce(Function2<T, T> func)`). In addition, the Scala `trait` modifier is used for Java
-interface classes. We hope to generate documentation with Java-style syntax in the future to
-avoid these quirks.
+# API Docs
+[API documentation](api/java/index.html) for Spark in Java is available in Javadoc format.
# Where to Go from Here
diff --git a/docs/js/main.js b/docs/js/main.js
index 0bd2286cce..5905546711 100755
--- a/docs/js/main.js
+++ b/docs/js/main.js
@@ -73,8 +73,26 @@ function viewSolution() {
});
}
+// A script to fix internal hash links because we have an overlapping top bar.
+// Based on https://github.com/twitter/bootstrap/issues/193#issuecomment-2281510
+function maybeScrollToHash() {
+ console.log("HERE");
+ if (window.location.hash && $(window.location.hash).length) {
+ console.log("HERE2", $(window.location.hash), $(window.location.hash).offset().top);
+ var newTop = $(window.location.hash).offset().top - 57;
+ $(window).scrollTop(newTop);
+ }
+}
$(function() {
codeTabs();
viewSolution();
+
+ $(window).bind('hashchange', function() {
+ maybeScrollToHash();
+ });
+
+ // Scroll now too in case we had opened the page on a hash, but wait a bit because some browsers
+ // will try to do *their* initial scroll after running the onReady handler.
+ $(window).load(function() { setTimeout(function() { maybeScrollToHash(); }, 25); });
});
diff --git a/docs/mllib-classification-regression.md b/docs/mllib-classification-regression.md
index 2c42f60c2e..2e0fa093dc 100644
--- a/docs/mllib-classification-regression.md
+++ b/docs/mllib-classification-regression.md
@@ -316,26 +316,26 @@ For each of them, we support all 3 possible regularizations (none, L1 or L2).
Available algorithms for binary classification:
-* [SVMWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.SVMWithSGD)
-* [LogisticRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD)
+* [SVMWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.SVMWithSGD)
+* [LogisticRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.classification.LogisticRegressionWithSGD)
Available algorithms for linear regression:
-* [LinearRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD)
-* [RidgeRegressionWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
-* [LassoWithSGD](api/mllib/index.html#org.apache.spark.mllib.regression.LassoWithSGD)
+* [LinearRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LinearRegressionWithSGD)
+* [RidgeRegressionWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.RidgeRegressionWithSGD)
+* [LassoWithSGD](api/scala/index.html#org.apache.spark.mllib.regression.LassoWithSGD)
Behind the scenes, all above methods use the SGD implementation from the
gradient descent primitive in MLlib, see the
<a href="mllib-optimization.html">optimization</a> part:
-* [GradientDescent](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+* [GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent)
#### Tree-based Methods
The decision tree algorithm supports binary classification and regression:
-* [DecisionTee](api/mllib/index.html#org.apache.spark.mllib.tree.DecisionTree)
+* [DecisionTee](api/scala/index.html#org.apache.spark.mllib.tree.DecisionTree)
# Usage in Scala
diff --git a/docs/mllib-clustering.md b/docs/mllib-clustering.md
index 50a8671560..0359c67157 100644
--- a/docs/mllib-clustering.md
+++ b/docs/mllib-clustering.md
@@ -33,7 +33,7 @@ a given dataset, the algorithm returns the best clustering result).
Available algorithms for clustering:
-* [KMeans](api/mllib/index.html#org.apache.spark.mllib.clustering.KMeans)
+* [KMeans](api/scala/index.html#org.apache.spark.mllib.clustering.KMeans)
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index aa22f67b30..2f1f5f3856 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -42,7 +42,7 @@ for an item.
Available algorithms for collaborative filtering:
-* [ALS](api/mllib/index.html#org.apache.spark.mllib.recommendation.ALS)
+* [ALS](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS)
# Usage in Scala
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 4236b0c8b6..0963a99881 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -36,15 +36,15 @@ The following links provide a detailed explanation of the methods and usage exam
# Data Types
Most MLlib algorithms operate on RDDs containing vectors. In Java and Scala, the
-[Vector](api/mllib/index.html#org.apache.spark.mllib.linalg.Vector) class is used to
+[Vector](api/scala/index.html#org.apache.spark.mllib.linalg.Vector) class is used to
represent vectors. You can create either dense or sparse vectors using the
-[Vectors](api/mllib/index.html#org.apache.spark.mllib.linalg.Vectors$) factory.
+[Vectors](api/scala/index.html#org.apache.spark.mllib.linalg.Vectors$) factory.
In Python, MLlib can take the following vector types:
* [NumPy](http://www.numpy.org) arrays
* Standard Python lists (e.g. `[1, 2, 3]`)
-* The MLlib [SparseVector](api/pyspark/pyspark.mllib.linalg.SparseVector-class.html) class
+* The MLlib [SparseVector](api/python/pyspark.mllib.linalg.SparseVector-class.html) class
* [SciPy sparse matrices](http://docs.scipy.org/doc/scipy/reference/sparse.html)
For efficiency, we recommend using NumPy arrays over lists, and using the
@@ -52,8 +52,8 @@ For efficiency, we recommend using NumPy arrays over lists, and using the
for SciPy matrices, or MLlib's own SparseVector class.
Several other simple data types are used throughout the library, e.g. the LabeledPoint
-class ([Java/Scala](api/mllib/index.html#org.apache.spark.mllib.regression.LabeledPoint),
-[Python](api/pyspark/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data.
+class ([Java/Scala](api/scala/index.html#org.apache.spark.mllib.regression.LabeledPoint),
+[Python](api/python/pyspark.mllib.regression.LabeledPoint-class.html)) for labeled data.
# Dependencies
MLlib uses the [jblas](https://github.com/mikiobraun/jblas) linear algebra library, which itself
diff --git a/docs/mllib-optimization.md b/docs/mllib-optimization.md
index 396b98d52a..c79cc3d944 100644
--- a/docs/mllib-optimization.md
+++ b/docs/mllib-optimization.md
@@ -95,12 +95,12 @@ As an alternative to just use the subgradient `$R'(\wv)$` of the regularizer in
direction, an improved update for some cases can be obtained by using the proximal operator
instead.
For the L1-regularizer, the proximal operator is given by soft thresholding, as implemented in
-[L1Updater](api/mllib/index.html#org.apache.spark.mllib.optimization.L1Updater).
+[L1Updater](api/scala/index.html#org.apache.spark.mllib.optimization.L1Updater).
## Update Schemes for Distributed SGD
The SGD implementation in
-[GradientDescent](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent) uses
+[GradientDescent](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent) uses
a simple (distributed) sampling of the data examples.
We recall that the loss part of the optimization problem `$\eqref{eq:regPrimal}$` is
`$\frac1n \sum_{i=1}^n L(\wv;\x_i,y_i)$`, and therefore `$\frac1n \sum_{i=1}^n L'_{\wv,i}$` would
@@ -138,7 +138,7 @@ are developed, see the
section for example.
The SGD method
-[GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+[GradientDescent.runMiniBatchSGD](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent)
has the following parameters:
* `gradient` is a class that computes the stochastic gradient of the function
@@ -161,6 +161,6 @@ each iteration, to compute the gradient direction.
Available algorithms for gradient descent:
-* [GradientDescent.runMiniBatchSGD](api/mllib/index.html#org.apache.spark.mllib.optimization.GradientDescent)
+* [GradientDescent.runMiniBatchSGD](api/scala/index.html#org.apache.spark.mllib.optimization.GradientDescent)
diff --git a/docs/python-programming-guide.md b/docs/python-programming-guide.md
index 39de603b29..98233bf556 100644
--- a/docs/python-programming-guide.md
+++ b/docs/python-programming-guide.md
@@ -134,7 +134,7 @@ Files listed here will be added to the `PYTHONPATH` and shipped to remote worker
Code dependencies can be added to an existing SparkContext using its `addPyFile()` method.
You can set [configuration properties](configuration.html#spark-properties) by passing a
-[SparkConf](api/pyspark/pyspark.conf.SparkConf-class.html) object to SparkContext:
+[SparkConf](api/python/pyspark.conf.SparkConf-class.html) object to SparkContext:
{% highlight python %}
from pyspark import SparkConf, SparkContext
@@ -147,7 +147,7 @@ sc = SparkContext(conf = conf)
# API Docs
-[API documentation](api/pyspark/index.html) for PySpark is available as Epydoc.
+[API documentation](api/python/index.html) for PySpark is available as Epydoc.
Many of the methods also contain [doctests](http://docs.python.org/2/library/doctest.html) that provide additional usage examples.
# Libraries
diff --git a/docs/quick-start.md b/docs/quick-start.md
index 6b4f4ba425..68afa6e1bf 100644
--- a/docs/quick-start.md
+++ b/docs/quick-start.md
@@ -138,7 +138,9 @@ Spark README. Note that you'll need to replace YOUR_SPARK_HOME with the location
installed. Unlike the earlier examples with the Spark shell, which initializes its own SparkContext,
we initialize a SparkContext as part of the program.
-We pass the SparkContext constructor a SparkConf object which contains information about our
+We pass the SparkContext constructor a
+[SparkConf](api/scala/index.html#org.apache.spark.SparkConf)
+object which contains information about our
application. We also call sc.addJar to make sure that when our application is launched in cluster
mode, the jar file containing it will be shipped automatically to worker nodes.
@@ -327,4 +329,4 @@ Congratulations on running your first Spark application!
* For an in-depth overview of the API see "Programming Guides" menu section.
* For running applications on a cluster head to the [deployment overview](cluster-overview.html).
-* For configuration options available to Spark applications see the [configuration page](configuration.html). \ No newline at end of file
+* For configuration options available to Spark applications see the [configuration page](configuration.html).
diff --git a/docs/scala-programming-guide.md b/docs/scala-programming-guide.md
index 4431da0721..a3171709ff 100644
--- a/docs/scala-programming-guide.md
+++ b/docs/scala-programming-guide.md
@@ -147,7 +147,7 @@ All transformations in Spark are <i>lazy</i>, in that they do not compute their
By default, each transformed RDD is recomputed each time you run an action on it. However, you may also *persist* an RDD in memory using the `persist` (or `cache`) method, in which case Spark will keep the elements around on the cluster for much faster access the next time you query it. There is also support for persisting datasets on disk, or replicated across the cluster. The next section in this document describes these options.
-The following tables list the transformations and actions currently supported (see also the [RDD API doc](api/core/index.html#org.apache.spark.rdd.RDD) for details):
+The following tables list the transformations and actions currently supported (see also the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD) for details):
### Transformations
@@ -216,7 +216,7 @@ The following tables list the transformations and actions currently supported (s
</tr>
</table>
-A complete list of transformations is available in the [RDD API doc](api/core/index.html#org.apache.spark.rdd.RDD).
+A complete list of transformations is available in the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD).
### Actions
@@ -264,7 +264,7 @@ A complete list of transformations is available in the [RDD API doc](api/core/in
</tr>
</table>
-A complete list of actions is available in the [RDD API doc](api/core/index.html#org.apache.spark.rdd.RDD).
+A complete list of actions is available in the [RDD API doc](api/scala/index.html#org.apache.spark.rdd.RDD).
## RDD Persistence
@@ -283,7 +283,7 @@ In addition, each RDD can be stored using a different *storage level*, allowing
persist the dataset on disk, or persist it in memory but as serialized Java objects (to save space),
or replicate it across nodes, or store the data in off-heap memory in [Tachyon](http://tachyon-project.org/).
These levels are chosen by passing a
-[`org.apache.spark.storage.StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel)
+[`org.apache.spark.storage.StorageLevel`](api/scala/index.html#org.apache.spark.storage.StorageLevel)
object to `persist()`. The `cache()` method is a shorthand for using the default storage level,
which is `StorageLevel.MEMORY_ONLY` (store deserialized objects in memory). The complete set of
available storage levels is:
@@ -355,7 +355,7 @@ waiting to recompute a lost partition.
If you want to define your own storage level (say, with replication factor of 3 instead of 2), then
use the function factor method `apply()` of the
-[`StorageLevel`](api/core/index.html#org.apache.spark.storage.StorageLevel$) singleton object.
+[`StorageLevel`](api/scala/index.html#org.apache.spark.storage.StorageLevel$) singleton object.
Spark has a block manager inside the Executors that let you chose memory, disk, or off-heap. The
latter is for storing RDDs off-heap outside the Executor JVM on top of the memory management system
diff --git a/docs/sql-programming-guide.md b/docs/sql-programming-guide.md
index 8e98cc0c80..e25379bd76 100644
--- a/docs/sql-programming-guide.md
+++ b/docs/sql-programming-guide.md
@@ -14,8 +14,8 @@ title: Spark SQL Programming Guide
Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using
Spark. At the core of this component is a new type of RDD,
-[SchemaRDD](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed
-[Row](api/sql/catalyst/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with
+[SchemaRDD](api/scala/index.html#org.apache.spark.sql.SchemaRDD). SchemaRDDs are composed
+[Row](api/scala/index.html#org.apache.spark.sql.catalyst.expressions.Row) objects along with
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
@@ -27,8 +27,8 @@ file, or by running HiveQL against data stored in [Apache Hive](http://hive.apac
<div data-lang="java" markdown="1">
Spark SQL allows relational queries expressed in SQL, HiveQL, or Scala to be executed using
Spark. At the core of this component is a new type of RDD,
-[JavaSchemaRDD](api/sql/core/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed
-[Row](api/sql/catalyst/index.html#org.apache.spark.sql.api.java.Row) objects along with
+[JavaSchemaRDD](api/scala/index.html#org.apache.spark.sql.api.java.JavaSchemaRDD). JavaSchemaRDDs are composed
+[Row](api/scala/index.html#org.apache.spark.sql.api.java.Row) objects along with
a schema that describes the data types of each column in the row. A JavaSchemaRDD is similar to a table
in a traditional relational database. A JavaSchemaRDD can be created from an existing RDD, parquet
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
@@ -38,8 +38,8 @@ file, or by running HiveQL against data stored in [Apache Hive](http://hive.apac
Spark SQL allows relational queries expressed in SQL or HiveQL to be executed using
Spark. At the core of this component is a new type of RDD,
-[SchemaRDD](api/pyspark/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed
-[Row](api/pyspark/pyspark.sql.Row-class.html) objects along with
+[SchemaRDD](api/python/pyspark.sql.SchemaRDD-class.html). SchemaRDDs are composed
+[Row](api/python/pyspark.sql.Row-class.html) objects along with
a schema that describes the data types of each column in the row. A SchemaRDD is similar to a table
in a traditional relational database. A SchemaRDD can be created from an existing RDD, parquet
file, or by running HiveQL against data stored in [Apache Hive](http://hive.apache.org/).
@@ -56,7 +56,7 @@ file, or by running HiveQL against data stored in [Apache Hive](http://hive.apac
<div data-lang="scala" markdown="1">
The entry point into all relational functionality in Spark is the
-[SQLContext](api/sql/core/index.html#org.apache.spark.sql.SQLContext) class, or one of its
+[SQLContext](api/scala/index.html#org.apache.spark.sql.SQLContext) class, or one of its
descendants. To create a basic SQLContext, all you need is a SparkContext.
{% highlight scala %}
@@ -72,7 +72,7 @@ import sqlContext._
<div data-lang="java" markdown="1">
The entry point into all relational functionality in Spark is the
-[JavaSQLContext](api/sql/core/index.html#org.apache.spark.sql.api.java.JavaSQLContext) class, or one
+[JavaSQLContext](api/scala/index.html#org.apache.spark.sql.api.java.JavaSQLContext) class, or one
of its descendants. To create a basic JavaSQLContext, all you need is a JavaSparkContext.
{% highlight java %}
@@ -85,7 +85,7 @@ JavaSQLContext sqlCtx = new org.apache.spark.sql.api.java.JavaSQLContext(ctx);
<div data-lang="python" markdown="1">
The entry point into all relational functionality in Spark is the
-[SQLContext](api/pyspark/pyspark.sql.SQLContext-class.html) class, or one
+[SQLContext](api/python/pyspark.sql.SQLContext-class.html) class, or one
of its decedents. To create a basic SQLContext, all you need is a SparkContext.
{% highlight python %}
@@ -331,7 +331,7 @@ val teenagers = people.where('age >= 10).where('age <= 19).select('name)
The DSL uses Scala symbols to represent columns in the underlying table, which are identifiers
prefixed with a tick (`'`). Implicit conversions turn these symbols into expressions that are
evaluated by the SQL execution engine. A full list of the functions supported can be found in the
-[ScalaDoc](api/sql/core/index.html#org.apache.spark.sql.SchemaRDD).
+[ScalaDoc](api/scala/index.html#org.apache.spark.sql.SchemaRDD).
<!-- TODO: Include the table of operations here. -->
diff --git a/docs/streaming-custom-receivers.md b/docs/streaming-custom-receivers.md
index 3fb540c9fb..3cfa4516cc 100644
--- a/docs/streaming-custom-receivers.md
+++ b/docs/streaming-custom-receivers.md
@@ -9,7 +9,7 @@ This guide shows the programming model and features by walking through a simple
### Writing a Simple Receiver
-This starts with implementing [NetworkReceiver](api/streaming/index.html#org.apache.spark.streaming.dstream.NetworkReceiver).
+This starts with implementing [NetworkReceiver](api/scala/index.html#org.apache.spark.streaming.dstream.NetworkReceiver).
The following is a simple socket text-stream receiver.
@@ -125,4 +125,4 @@ _A more comprehensive example is provided in the spark streaming examples_
## References
1.[Akka Actor documentation](http://doc.akka.io/docs/akka/2.0.5/scala/actors.html)
-2.[NetworkReceiver](api/streaming/index.html#org.apache.spark.streaming.dstream.NetworkReceiver)
+2.[NetworkReceiver](api/scala/index.html#org.apache.spark.streaming.dstream.NetworkReceiver)
diff --git a/docs/streaming-programming-guide.md b/docs/streaming-programming-guide.md
index f9904d4501..946d6c4879 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -40,7 +40,7 @@ Spark Streaming provides a high-level abstraction called *discretized stream* or
which represents a continuous stream of data. DStreams can be created either from input data
stream from sources such as Kafka and Flume, or by applying high-level
operations on other DStreams. Internally, a DStream is represented as a sequence of
-[RDDs](api/core/index.html#org.apache.spark.rdd.RDD).
+[RDDs](api/scala/index.html#org.apache.spark.rdd.RDD).
This guide shows you how to start writing Spark Streaming programs with DStreams. You can
write Spark Streaming programs in Scala or Java, both of which are presented in this guide. You
@@ -62,7 +62,7 @@ First, we import the names of the Spark Streaming classes, and some implicit
conversions from StreamingContext into our environment, to add useful methods to
other classes we need (like DStream).
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) is the
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) is the
main entry point for all streaming functionality.
{% highlight scala %}
@@ -71,7 +71,7 @@ import org.apache.spark.streaming.StreamingContext._
{% endhighlight %}
Then we create a
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) object.
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) object.
Besides Spark's configuration, we specify that any DStream will be processed
in 1 second batches.
@@ -132,7 +132,7 @@ The complete code can be found in the Spark Streaming example
<div data-lang="java" markdown="1">
First, we create a
-[JavaStreamingContext](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object,
+[JavaStreamingContext](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext) object,
which is the main entry point for all streaming
functionality. Besides Spark's configuration, we specify that any DStream would be processed
in 1 second batches.
@@ -168,7 +168,7 @@ JavaDStream<String> words = lines.flatMap(
generating multiple new records from each record in the source DStream. In this case,
each line will be split into multiple words and the stream of words is represented as the
`words` DStream. Note that we defined the transformation using a
-[FlatMapFunction](api/core/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
+[FlatMapFunction](api/scala/index.html#org.apache.spark.api.java.function.FlatMapFunction) object.
As we will discover along the way, there are a number of such convenience classes in the Java API
that help define DStream transformations.
@@ -192,9 +192,9 @@ wordCounts.print(); // Print a few of the counts to the console
{% endhighlight %}
The `words` DStream is further mapped (one-to-one transformation) to a DStream of `(word,
-1)` pairs, using a [PairFunction](api/core/index.html#org.apache.spark.api.java.function.PairFunction)
+1)` pairs, using a [PairFunction](api/scala/index.html#org.apache.spark.api.java.function.PairFunction)
object. Then, it is reduced to get the frequency of words in each batch of data,
-using a [Function2](api/core/index.html#org.apache.spark.api.java.function.Function2) object.
+using a [Function2](api/scala/index.html#org.apache.spark.api.java.function.Function2) object.
Finally, `wordCounts.print()` will print a few of the counts generated every second.
Note that when these lines are executed, Spark Streaming only sets up the computation it
@@ -333,7 +333,7 @@ for the full list of supported sources and artifacts.
<div data-lang="scala" markdown="1">
To initialize a Spark Streaming program in Scala, a
-[`StreamingContext`](api/streaming/index.html#org.apache.spark.streaming.StreamingContext)
+[`StreamingContext`](api/scala/index.html#org.apache.spark.streaming.StreamingContext)
object has to be created, which is the main entry point of all Spark Streaming functionality.
A `StreamingContext` object can be created by using
@@ -344,7 +344,7 @@ new StreamingContext(master, appName, batchDuration, [sparkHome], [jars])
<div data-lang="java" markdown="1">
To initialize a Spark Streaming program in Java, a
-[`JavaStreamingContext`](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
+[`JavaStreamingContext`](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
object has to be created, which is the main entry point of all Spark Streaming functionality.
A `JavaStreamingContext` object can be created by using
@@ -431,8 +431,8 @@ and process any files created in that directory. Note that
For more details on streams from files, Akka actors and sockets,
see the API documentations of the relevant functions in
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext) for
-Scala and [JavaStreamingContext](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
+[StreamingContext](api/scala/index.html#org.apache.spark.streaming.StreamingContext) for
+Scala and [JavaStreamingContext](api/scala/index.html#org.apache.spark.streaming.api.java.JavaStreamingContext)
for Java.
Additional functionality for creating DStreams from sources such as Kafka, Flume, and Twitter
@@ -802,10 +802,10 @@ output operators are defined:
The complete list of DStream operations is available in the API documentation. For the Scala API,
-see [DStream](api/streaming/index.html#org.apache.spark.streaming.dstream.DStream)
-and [PairDStreamFunctions](api/streaming/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
-For the Java API, see [JavaDStream](api/streaming/index.html#org.apache.spark.streaming.api.java.dstream.DStream)
-and [JavaPairDStream](api/streaming/index.html#org.apache.spark.streaming.api.java.JavaPairDStream).
+see [DStream](api/scala/index.html#org.apache.spark.streaming.dstream.DStream)
+and [PairDStreamFunctions](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions).
+For the Java API, see [JavaDStream](api/scala/index.html#org.apache.spark.streaming.api.java.dstream.DStream)
+and [JavaPairDStream](api/scala/index.html#org.apache.spark.streaming.api.java.JavaPairDStream).
Specifically for the Java API, see [Spark's Java programming guide](java-programming-guide.html)
for more information.
@@ -881,7 +881,7 @@ Cluster resources maybe under-utilized if the number of parallel tasks used in a
computation is not high enough. For example, for distributed reduce operations like `reduceByKey`
and `reduceByKeyAndWindow`, the default number of parallel tasks is 8. You can pass the level of
parallelism as an argument (see the
-[`PairDStreamFunctions`](api/streaming/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
+[`PairDStreamFunctions`](api/scala/index.html#org.apache.spark.streaming.dstream.PairDStreamFunctions)
documentation), or set the [config property](configuration.html#spark-properties)
`spark.default.parallelism` to change the default.
@@ -925,7 +925,7 @@ A good approach to figure out the right batch size for your application is to te
conservative batch size (say, 5-10 seconds) and a low data rate. To verify whether the system
is able to keep up with data rate, you can check the value of the end-to-end delay experienced
by each processed batch (either look for "Total delay" in Spark driver log4j logs, or use the
-[StreamingListener](api/streaming/index.html#org.apache.spark.streaming.scheduler.StreamingListener)
+[StreamingListener](api/scala/index.html#org.apache.spark.streaming.scheduler.StreamingListener)
interface).
If the delay is maintained to be comparable to the batch size, then system is stable. Otherwise,
if the delay is continuously increasing, it means that the system is unable to keep up and it
@@ -952,7 +952,7 @@ exception saying so.
## Monitoring
Besides Spark's in-built [monitoring capabilities](monitoring.html),
the progress of a Spark Streaming program can also be monitored using the [StreamingListener]
-(api/streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
+(api/scala/index.html#org.apache.spark.scheduler.StreamingListener) interface,
which allows you to get statistics of batch processing times, queueing delays,
and total end-to-end delays. Note that this is still an experimental API and it is likely to be
improved upon (i.e., more information reported) in the future.
@@ -965,9 +965,9 @@ in Spark Streaming applications and achieving more consistent batch processing t
* **Default persistence level of DStreams**: Unlike RDDs, the default persistence level of DStreams
serializes the data in memory (that is,
-[StorageLevel.MEMORY_ONLY_SER](api/core/index.html#org.apache.spark.storage.StorageLevel$) for
+[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$) for
DStream compared to
-[StorageLevel.MEMORY_ONLY](api/core/index.html#org.apache.spark.storage.StorageLevel$) for RDDs).
+[StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$) for RDDs).
Even though keeping the data serialized incurs higher serialization/deserialization overheads,
it significantly reduces GC pauses.
@@ -1244,15 +1244,15 @@ and output 30 after recovery.
# Where to Go from Here
* API documentation
- - Main docs of StreamingContext and DStreams in [Scala](api/streaming/index.html#org.apache.spark.streaming.package)
- and [Java](api/streaming/index.html#org.apache.spark.streaming.api.java.package)
+ - Main docs of StreamingContext and DStreams in [Scala](api/scala/index.html#org.apache.spark.streaming.package)
+ and [Java](api/scala/index.html#org.apache.spark.streaming.api.java.package)
- Additional docs for
- [Kafka](api/external/kafka/index.html#org.apache.spark.streaming.kafka.KafkaUtils$),
- [Flume](api/external/flume/index.html#org.apache.spark.streaming.flume.FlumeUtils$),
- [Twitter](api/external/twitter/index.html#org.apache.spark.streaming.twitter.TwitterUtils$),
- [ZeroMQ](api/external/zeromq/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$), and
- [MQTT](api/external/mqtt/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$)
+ [Kafka](api/scala/index.html#org.apache.spark.streaming.kafka.KafkaUtils$),
+ [Flume](api/scala/index.html#org.apache.spark.streaming.flume.FlumeUtils$),
+ [Twitter](api/scala/index.html#org.apache.spark.streaming.twitter.TwitterUtils$),
+ [ZeroMQ](api/scala/index.html#org.apache.spark.streaming.zeromq.ZeroMQUtils$), and
+ [MQTT](api/scala/index.html#org.apache.spark.streaming.mqtt.MQTTUtils$)
* More examples in [Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/streaming/examples)
and [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/streaming/examples)
-* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) describing Spark Streaming
+* [Paper](http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-259.pdf) describing Spark Streaming.
diff --git a/docs/tuning.md b/docs/tuning.md
index cc069f0e84..78e10770a8 100644
--- a/docs/tuning.md
+++ b/docs/tuning.md
@@ -48,7 +48,7 @@ Spark automatically includes Kryo serializers for the many commonly-used core Sc
in the AllScalaRegistrar from the [Twitter chill](https://github.com/twitter/chill) library.
To register your own custom classes with Kryo, create a public class that extends
-[`org.apache.spark.serializer.KryoRegistrator`](api/core/index.html#org.apache.spark.serializer.KryoRegistrator) and set the
+[`org.apache.spark.serializer.KryoRegistrator`](api/scala/index.html#org.apache.spark.serializer.KryoRegistrator) and set the
`spark.kryo.registrator` config property to point to it, as follows:
{% highlight scala %}
@@ -222,7 +222,7 @@ enough. Spark automatically sets the number of "map" tasks to run on each file a
(though you can control it through optional parameters to `SparkContext.textFile`, etc), and for
distributed "reduce" operations, such as `groupByKey` and `reduceByKey`, it uses the largest
parent RDD's number of partitions. You can pass the level of parallelism as a second argument
-(see the [`spark.PairRDDFunctions`](api/core/index.html#org.apache.spark.rdd.PairRDDFunctions) documentation),
+(see the [`spark.PairRDDFunctions`](api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions) documentation),
or set the config property `spark.default.parallelism` to change the default.
In general, we recommend 2-3 tasks per CPU core in your cluster.