From 41b312212094b2accd650813dd45e1767b5465fe Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Tue, 29 Oct 2013 20:57:55 -0700
Subject: Strating to improve README.

---
 docs/img/data_parallel_vs_graph_parallel.png | Bin 0 -> 199060 bytes
 docs/img/edge-cut.png                        | Bin 0 -> 12563 bytes
 docs/img/graph_parallel.png                  | Bin 0 -> 92288 bytes
 docs/img/tables_and_graphs.png               | Bin 0 -> 68905 bytes
 docs/img/vertex-cut.png                      | Bin 0 -> 12246 bytes
 5 files changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 docs/img/data_parallel_vs_graph_parallel.png
 create mode 100644 docs/img/edge-cut.png
 create mode 100644 docs/img/graph_parallel.png
 create mode 100644 docs/img/tables_and_graphs.png
 create mode 100644 docs/img/vertex-cut.png

(limited to 'docs')
diff --git a/docs/img/data_parallel_vs_graph_parallel.png b/docs/img/data_parallel_vs_graph_parallel.png
new file mode 100644
index 0000000000..d9aa811466
Binary files /dev/null and b/docs/img/data_parallel_vs_graph_parallel.png differ
diff --git a/docs/img/edge-cut.png b/docs/img/edge-cut.png
new file mode 100644
index 0000000000..698f4ff181
Binary files /dev/null and b/docs/img/edge-cut.png differ
diff --git a/docs/img/graph_parallel.png b/docs/img/graph_parallel.png
new file mode 100644
index 0000000000..330be5567c
Binary files /dev/null and b/docs/img/graph_parallel.png differ
diff --git a/docs/img/tables_and_graphs.png b/docs/img/tables_and_graphs.png
new file mode 100644
index 0000000000..9af07d3081
Binary files /dev/null and b/docs/img/tables_and_graphs.png differ
diff --git a/docs/img/vertex-cut.png b/docs/img/vertex-cut.png
new file mode 100644
index 0000000000..0a508dcee9
Binary files /dev/null and b/docs/img/vertex-cut.png differ
-- 
cgit v1.2.3


From e4483582fc59330af8a43e8a152959f927103c79 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Thu, 9 Jan 2014 10:23:35 -0800
Subject: Add docs/graphx-programming-guide.md from
 7210257ba3038d5e22d4b60fe9c3113dc45c3dff:README.md

---
 docs/graphx-programming-guide.md | 197 +++++++++++++++++++++++++++++++++++++++
 1 file changed, 197 insertions(+)
 create mode 100644 docs/graphx-programming-guide.md

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
new file mode 100644
index 0000000000..5b06d82225
--- /dev/null
+++ b/docs/graphx-programming-guide.md
@@ -0,0 +1,197 @@
+# GraphX: Unifying Graphs and Tables
+
+
+GraphX extends the distributed fault-tolerant collections API and
+interactive console of [Spark](http://spark.incubator.apache.org) with
+a new graph API which leverages recent advances in graph systems
+(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
+interactively build, transform, and reason about graph structured data
+at scale.
+
+
+## Motivation
+
+From social networks and targeted advertising to protein modeling and
+astrophysics, big graphs capture the structure in data and are central
+to the recent advances in machine learning and data mining. Directly
+applying existing *data-parallel* tools (e.g.,
+[Hadoop](http://hadoop.apache.org) and
+[Spark](http://spark.incubator.apache.org)) to graph computation tasks
+can be cumbersome and inefficient.  The need for intuitive, scalable
+tools for graph computation has lead to the development of new
+*graph-parallel* systems (e.g.,
+[Pregel](http://http://giraph.apache.org) and
+[GraphLab](http://graphlab.org)) which are designed to efficiently
+execute graph algorithms.  Unfortunately, these systems do not address
+the challenges of graph construction and transformation and provide
+limited fault-tolerance and support for interactive analysis.
+
+<p align="center">
+  <img src="https://raw.github.com/amplab/graphx/master/docs/img/data_parallel_vs_graph_parallel.png" />
+</p>
+
+
+
+## Solution
+
+The GraphX project combines the advantages of both data-parallel and
+graph-parallel systems by efficiently expressing graph computation
+within the [Spark](http://spark.incubator.apache.org) framework.  We
+leverage new ideas in distributed graph representation to efficiently
+distribute graphs as tabular data-structures.  Similarly, we leverage
+advances in data-flow systems to exploit in-memory computation and
+fault-tolerance.  We provide powerful new operations to simplify graph
+construction and transformation.  Using these primitives we implement
+the PowerGraph and Pregel abstractions in less than 20 lines of code.
+Finally, by exploiting the Scala foundation of Spark, we enable users
+to interactively load, transform, and compute on massive graphs.
+
+<p align="center">
+  <img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" />
+</p>
+
+## Examples
+
+Suppose I want to build a graph from some text files, restrict the graph
+to important relationships and users, run page-rank on the sub-graph, and
+then finally return attributes associated with the top users.  I can do
+all of this in just a few lines with GraphX:
+
+```scala
+// Connect to the Spark cluster
+val sc = new SparkContext("spark://master.amplab.org", "research")
+
+// Load my user data and prase into tuples of user id and attribute list
+val users = sc.textFile("hdfs://user_attributes.tsv")
+  .map(line => line.split).map( parts => (parts.head, parts.tail) )
+
+// Parse the edge data which is already in userId -> userId format
+val followerGraph = Graph.textFile(sc, "hdfs://followers.tsv")
+
+// Attach the user attributes
+val graph = followerGraph.outerJoinVertices(users){
+  case (uid, deg, Some(attrList)) => attrList
+  // Some users may not have attributes so we set them as empty
+  case (uid, deg, None) => Array.empty[String]
+  }
+
+// Restrict the graph to users which have exactly two attributes
+val subgraph = graph.subgraph((vid, attr) => attr.size == 2)
+
+// Compute the PageRank
+val pagerankGraph = Analytics.pagerank(subgraph)
+
+// Get the attributes of the top pagerank users
+val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
+  case (uid, attrList, Some(pr)) => (pr, attrList)
+  case (uid, attrList, None) => (pr, attrList)
+  }
+
+println(userInfoWithPageRank.top(5))
+
+```
+
+
+## Online Documentation
+
+You can find the latest Spark documentation, including a programming
+guide, on the project webpage at
+<http://spark.incubator.apache.org/documentation.html>.  This README
+file only contains basic setup instructions.
+
+
+## Building
+
+Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
+project is built using Simple Build Tool (SBT), which is packaged with
+it. To build Spark and its example programs, run:
+
+    sbt/sbt assembly
+
+Once you've built Spark, the easiest way to start using it is the
+shell:
+
+    ./spark-shell
+
+Or, for the Python API, the Python shell (`./pyspark`).
+
+Spark also comes with several sample programs in the `examples`
+directory.  To run one of them, use `./run-example <class>
+<params>`. For example:
+
+    ./run-example org.apache.spark.examples.SparkLR local[2]
+
+will run the Logistic Regression example locally on 2 CPUs.
+
+Each of the example programs prints usage help if no params are given.
+
+All of the Spark samples take a `<master>` parameter that is the
+cluster URL to connect to. This can be a mesos:// or spark:// URL, or
+"local" to run locally with one thread, or "local[N]" to run locally
+with N threads.
+
+
+## A Note About Hadoop Versions
+
+Spark uses the Hadoop core library to talk to HDFS and other
+Hadoop-supported storage systems. Because the protocols have changed
+in different versions of Hadoop, you must build Spark against the same
+version that your cluster runs.  You can change the version by setting
+the `SPARK_HADOOP_VERSION` environment when building Spark.
+
+For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
+versions without YARN, use:
+
+    # Apache Hadoop 1.2.1
+    $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
+
+    # Cloudera CDH 4.2.0 with MapReduce v1
+    $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
+
+For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
+with YARN, also set `SPARK_YARN=true`:
+
+    # Apache Hadoop 2.0.5-alpha
+    $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
+
+    # Cloudera CDH 4.2.0 with MapReduce v2
+    $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
+
+For convenience, these variables may also be set through the
+`conf/spark-env.sh` file described below.
+
+When developing a Spark application, specify the Hadoop version by adding the
+"hadoop-client" artifact to your project's dependencies. For example, if you're
+using Hadoop 1.2.1 and build your application using SBT, add this entry to
+`libraryDependencies`:
+
+    "org.apache.hadoop" % "hadoop-client" % "1.2.1"
+
+If your project is built with Maven, add this to your POM file's
+`<dependencies>` section:
+
+    <dependency>
+      <groupId>org.apache.hadoop</groupId>
+      <artifactId>hadoop-client</artifactId>
+      <version>1.2.1</version>
+    </dependency>
+
+
+## Configuration
+
+Please refer to the [Configuration
+guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
+in the online documentation for an overview on how to configure Spark.
+
+
+## Contributing to GraphX
+
+Contributions via GitHub pull requests are gladly accepted from their
+original author. Along with any pull requests, please state that the
+contribution is your original work and that you license the work to
+the project under the project's open source license. Whether or not
+you state this explicitly, by submitting any copyrighted material via
+pull request, email, or other means you agree to license the material
+under the project's open source license and warrant that you have the
+legal authority to do so.
+
-- 
cgit v1.2.3


From b5b0de2de53563c43e1c5844a52b4eeeb2542ea5 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Thu, 9 Jan 2014 13:24:25 -0800
Subject: Start fixing formatting of graphx-programming-guide

---
 docs/graphx-programming-guide.md | 13 ++++++-------
 1 file changed, 6 insertions(+), 7 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 5b06d82225..ebc47f5d1c 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -1,4 +1,7 @@
-# GraphX: Unifying Graphs and Tables
+---
+layout: global
+title: "GraphX: Unifying Graphs and Tables"
+---
 
 
 GraphX extends the distributed fault-tolerant collections API and
@@ -26,11 +29,8 @@ execute graph algorithms.  Unfortunately, these systems do not address
 the challenges of graph construction and transformation and provide
 limited fault-tolerance and support for interactive analysis.
 
-<p align="center">
-  <img src="https://raw.github.com/amplab/graphx/master/docs/img/data_parallel_vs_graph_parallel.png" />
-</p>
-
-
+{:.pagination-centered}
+![Data-parallel vs. graph-parallel]({{ site.url }}/img/data_parallel_vs_graph_parallel.png)
 
 ## Solution
 
@@ -194,4 +194,3 @@ you state this explicitly, by submitting any copyrighted material via
 pull request, email, or other means you agree to license the material
 under the project's open source license and warrant that you have the
 legal authority to do so.
-
-- 
cgit v1.2.3


From b1eeefb4016d69aa0beadd302496c8250766d9b7 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Fri, 10 Jan 2014 00:39:08 -0800
Subject: WIP. Updating figures and cleaning up initial skeleton for GraphX
 Programming guide.

---
 docs/_layouts/global.html                    |  10 +-
 docs/graphx-programming-guide.md             | 277 ++++++++++++---------------
 docs/img/data_parallel_vs_graph_parallel.png | Bin 199060 -> 432725 bytes
 docs/img/edge_cut_vs_vertex_cut.png          | Bin 0 -> 79745 bytes
 docs/img/graph_analytics_pipeline.png        | Bin 0 -> 427220 bytes
 docs/img/graphx_figures.pptx                 | Bin 0 -> 1118035 bytes
 docs/img/graphx_logo.png                     | Bin 0 -> 40324 bytes
 docs/img/graphx_performance_comparison.png   | Bin 0 -> 166343 bytes
 docs/img/property_graph.png                  | Bin 0 -> 79056 bytes
 docs/img/tables_and_graphs.png               | Bin 68905 -> 166265 bytes
 docs/img/vertex_routing_edge_tables.png      | Bin 0 -> 570007 bytes
 docs/index.md                                |   6 +-
 12 files changed, 134 insertions(+), 159 deletions(-)
 create mode 100644 docs/img/edge_cut_vs_vertex_cut.png
 create mode 100644 docs/img/graph_analytics_pipeline.png
 create mode 100644 docs/img/graphx_figures.pptx
 create mode 100644 docs/img/graphx_logo.png
 create mode 100644 docs/img/graphx_performance_comparison.png
 create mode 100644 docs/img/property_graph.png
 create mode 100644 docs/img/vertex_routing_edge_tables.png

(limited to 'docs')

diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index ad7969d012..7721854685 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -21,7 +21,7 @@
         <link rel="stylesheet" href="css/main.css">
 
         <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script>
-        
+
         <link rel="stylesheet" href="css/pygments-default.css">
 
         <!-- Google analytics script -->
@@ -67,10 +67,10 @@
                                 <li class="divider"></li>
                                 <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
                                 <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
-                                <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
+                                <li><a href="graphx-programming-guide.html">GraphX (Graph-Parallel Spark)</a></li>
                             </ul>
                         </li>
-                        
+
                         <li class="dropdown">
                             <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a>
                             <ul class="dropdown-menu">
@@ -79,7 +79,7 @@
                                 <li class="divider"></li>
                                 <li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
                                 <li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
-                                <li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
+                                <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Paralle Spark)</a></li>
                             </ul>
                         </li>
 
@@ -161,7 +161,7 @@
         <script src="js/vendor/jquery-1.8.0.min.js"></script>
         <script src="js/vendor/bootstrap.min.js"></script>
         <script src="js/main.js"></script>
-        
+
         <!-- A script to fix internal hash links because we have an overlapping top bar.
              Based on https://github.com/twitter/bootstrap/issues/193#issuecomment-2281510 -->
         <script>
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index ebc47f5d1c..a551e4306d 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -1,63 +1,141 @@
 ---
 layout: global
-title: "GraphX: Unifying Graphs and Tables"
+title: GraphX Programming Guide
 ---
 
+* This will become a table of contents (this text will be scraped).
+{:toc}
 
-GraphX extends the distributed fault-tolerant collections API and
-interactive console of [Spark](http://spark.incubator.apache.org) with
-a new graph API which leverages recent advances in graph systems
-(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
-interactively build, transform, and reason about graph structured data
-at scale.
-
-
-## Motivation
-
-From social networks and targeted advertising to protein modeling and
-astrophysics, big graphs capture the structure in data and are central
-to the recent advances in machine learning and data mining. Directly
-applying existing *data-parallel* tools (e.g.,
-[Hadoop](http://hadoop.apache.org) and
-[Spark](http://spark.incubator.apache.org)) to graph computation tasks
-can be cumbersome and inefficient.  The need for intuitive, scalable
-tools for graph computation has lead to the development of new
-*graph-parallel* systems (e.g.,
-[Pregel](http://http://giraph.apache.org) and
-[GraphLab](http://graphlab.org)) which are designed to efficiently
-execute graph algorithms.  Unfortunately, these systems do not address
-the challenges of graph construction and transformation and provide
-limited fault-tolerance and support for interactive analysis.
-
-{:.pagination-centered}
-![Data-parallel vs. graph-parallel]({{ site.url }}/img/data_parallel_vs_graph_parallel.png)
-
-## Solution
-
-The GraphX project combines the advantages of both data-parallel and
-graph-parallel systems by efficiently expressing graph computation
-within the [Spark](http://spark.incubator.apache.org) framework.  We
-leverage new ideas in distributed graph representation to efficiently
-distribute graphs as tabular data-structures.  Similarly, we leverage
-advances in data-flow systems to exploit in-memory computation and
-fault-tolerance.  We provide powerful new operations to simplify graph
-construction and transformation.  Using these primitives we implement
-the PowerGraph and Pregel abstractions in less than 20 lines of code.
-Finally, by exploiting the Scala foundation of Spark, we enable users
-to interactively load, transform, and compute on massive graphs.
-
-<p align="center">
-  <img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" />
+<p style="text-align: center;">
+  <img src="img/graphx_logo.png"
+       title="GraphX Logo"
+       alt="GraphX"
+       width="65%" />
 </p>
 
-## Examples
+# Overview
+
+GraphX is the new (alpha) Spark API for graphs and graph-parallel
+computation. At a high-level GraphX, extends the Spark
+[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by
+introducing the [Resilient Distributed property Graph (RDG)](#property_graph):
+a directed graph with properties attached to each vertex and edge.
+To support graph computation, GraphX exposes a set of functions
+(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the
+[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org)
+APIs. In addition, GraphX includes a growing collection of graph
+[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify
+graph analytics tasks.
+
+## Background on Graph-Parallel Computation
+
+From social networks to language modeling, the growing scale and importance of
+graph data has driven the development of numerous new *graph-parallel* systems
+(e.g., [Giraph](http://http://giraph.apache.org) and
+[GraphLab](http://graphlab.org)).  By restricting the types of computation that can be
+expressed and introducing new techniques to partition and distribute graphs,
+these systems can efficiently execute sophisticated graph algorithms orders of
+magnitude faster than more general *data-parallel* systems.
+
+<p style="text-align: center;">
+  <img src="img/data_parallel_vs_graph_parallel.png"
+       title="Data-Parallel vs. Graph-Parallel"
+       alt="Data-Parallel vs. Graph-Parallel"
+       width="50%" />
+</p>
+
+However, the same restrictions that enable these substantial performance gains
+also make it difficult to express many of the important stages in a typical graph-analytics pipeline:
+constructing the graph, modifying its structure, or expressing computation that
+spans multiple graphs.  As a consequence, existing graph analytics pipelines
+compose graph-parallel and data-parallel systems, leading to extensive data
+movement and duplication and a complicated programming model.
+
+<p style="text-align: center;">
+  <img src="img/graph_analytics_pipeline.png"
+       title="Graph Analytics Pipeline"
+       alt="Graph Analytics Pipeline"
+       width="50%" />
+</p>
+
+The goal of the GraphX project is to unify graph-parallel and data-parallel
+computation in one system with a single composable API. This goal is achieved
+through an API that enables users to view data both as a graph and as
+collections (i.e., RDDs) without data movement or duplication and by
+incorporating advances in graph-parallel systems to optimize the execution of
+operations on the graph view.  In preliminary experiments we find that the GraphX
+system is able to achieve performance comparable to state-of-the-art
+graph-parallel systems while easily expressing the entire analytics pipelines.
+
+<p style="text-align: center;">
+  <img src="img/graphx_performance_comparison.png"
+       title="GraphX Performance Comparison"
+       alt="GraphX Performance Comparison"
+       width="50%" />
+</p>
+
+## GraphX Replaces the Spark Bagel API
+
+Prior to the release of GraphX, graph computation in Spark was expressed using
+Bagel, an implementation of the Pregel API.  GraphX improves upon Bagel by exposing
+a richer property graph API, a more streamlined version of the Pregel abstraction,
+and system optimizations to improve performance and reduce memory
+overhead.  While we plan to eventually deprecate the Bagel, we will continue to
+support the API and [Bagel programming guide](bagel-programming-guide.html). However,
+we encourage Bagel to explore the new GraphX API and comment on issues that may
+complicate the transition from Bagel.
+
+# The Property Graph
+<a name="property_graph"></a>
+
+<p style="text-align: center;">
+  <img src="img/edge_cut_vs_vertex_cut.png"
+       title="Edge Cut vs. Vertex Cut"
+       alt="Edge Cut vs. Vertex Cut"
+       width="50%" />
+</p>
+
+<p style="text-align: center;">
+  <img src="img/property_graph.png"
+       title="The Property Graph"
+       alt="The Property Graph"
+       width="50%" />
+</p>
+
+<p style="text-align: center;">
+  <img src="img/vertex_routing_edge_tables.png"
+       title="RDD Graph Representation"
+       alt="RDD Graph Representation"
+       width="50%" />
+</p>
+
+
+# Graph Operators
+
+## Map Reduce Triplets (mapReduceTriplets)
+<a name="mrTriplets"></a>
+
+# Graph Algorithms
+<a name="graph_algorithms"></a>
+
+# Graph Builders
+<a name="graph_builders"></a>
+
+<p style="text-align: center;">
+  <img src="img/tables_and_graphs.png"
+       title="Tables and Graphs"
+       alt="Tables and Graphs"
+       width="50%" />
+</p>
+
+# Examples
 
 Suppose I want to build a graph from some text files, restrict the graph
 to important relationships and users, run page-rank on the sub-graph, and
 then finally return attributes associated with the top users.  I can do
 all of this in just a few lines with GraphX:
 
-```scala
+{% highlight scala %}
 // Connect to the Spark cluster
 val sc = new SparkContext("spark://master.amplab.org", "research")
 
@@ -89,108 +167,5 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
 
 println(userInfoWithPageRank.top(5))
 
-```
-
-
-## Online Documentation
-
-You can find the latest Spark documentation, including a programming
-guide, on the project webpage at
-<http://spark.incubator.apache.org/documentation.html>.  This README
-file only contains basic setup instructions.
-
-
-## Building
-
-Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
-project is built using Simple Build Tool (SBT), which is packaged with
-it. To build Spark and its example programs, run:
-
-    sbt/sbt assembly
-
-Once you've built Spark, the easiest way to start using it is the
-shell:
-
-    ./spark-shell
-
-Or, for the Python API, the Python shell (`./pyspark`).
-
-Spark also comes with several sample programs in the `examples`
-directory.  To run one of them, use `./run-example <class>
-<params>`. For example:
-
-    ./run-example org.apache.spark.examples.SparkLR local[2]
-
-will run the Logistic Regression example locally on 2 CPUs.
-
-Each of the example programs prints usage help if no params are given.
-
-All of the Spark samples take a `<master>` parameter that is the
-cluster URL to connect to. This can be a mesos:// or spark:// URL, or
-"local" to run locally with one thread, or "local[N]" to run locally
-with N threads.
-
-
-## A Note About Hadoop Versions
-
-Spark uses the Hadoop core library to talk to HDFS and other
-Hadoop-supported storage systems. Because the protocols have changed
-in different versions of Hadoop, you must build Spark against the same
-version that your cluster runs.  You can change the version by setting
-the `SPARK_HADOOP_VERSION` environment when building Spark.
-
-For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
-versions without YARN, use:
-
-    # Apache Hadoop 1.2.1
-    $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
-
-    # Cloudera CDH 4.2.0 with MapReduce v1
-    $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
-
-For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
-with YARN, also set `SPARK_YARN=true`:
-
-    # Apache Hadoop 2.0.5-alpha
-    $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
-
-    # Cloudera CDH 4.2.0 with MapReduce v2
-    $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
-
-For convenience, these variables may also be set through the
-`conf/spark-env.sh` file described below.
-
-When developing a Spark application, specify the Hadoop version by adding the
-"hadoop-client" artifact to your project's dependencies. For example, if you're
-using Hadoop 1.2.1 and build your application using SBT, add this entry to
-`libraryDependencies`:
-
-    "org.apache.hadoop" % "hadoop-client" % "1.2.1"
-
-If your project is built with Maven, add this to your POM file's
-`<dependencies>` section:
-
-    <dependency>
-      <groupId>org.apache.hadoop</groupId>
-      <artifactId>hadoop-client</artifactId>
-      <version>1.2.1</version>
-    </dependency>
-
-
-## Configuration
-
-Please refer to the [Configuration
-guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
-in the online documentation for an overview on how to configure Spark.
-
-
-## Contributing to GraphX
+{% endhighlight %}
 
-Contributions via GitHub pull requests are gladly accepted from their
-original author. Along with any pull requests, please state that the
-contribution is your original work and that you license the work to
-the project under the project's open source license. Whether or not
-you state this explicitly, by submitting any copyrighted material via
-pull request, email, or other means you agree to license the material
-under the project's open source license and warrant that you have the
-legal authority to do so.
diff --git a/docs/img/data_parallel_vs_graph_parallel.png b/docs/img/data_parallel_vs_graph_parallel.png
index d9aa811466..d3918f01d8 100644
Binary files a/docs/img/data_parallel_vs_graph_parallel.png and b/docs/img/data_parallel_vs_graph_parallel.png differ
diff --git a/docs/img/edge_cut_vs_vertex_cut.png b/docs/img/edge_cut_vs_vertex_cut.png
new file mode 100644
index 0000000000..ae30396d3f
Binary files /dev/null and b/docs/img/edge_cut_vs_vertex_cut.png differ
diff --git a/docs/img/graph_analytics_pipeline.png b/docs/img/graph_analytics_pipeline.png
new file mode 100644
index 0000000000..6d606e0189
Binary files /dev/null and b/docs/img/graph_analytics_pipeline.png differ
diff --git a/docs/img/graphx_figures.pptx b/docs/img/graphx_figures.pptx
new file mode 100644
index 0000000000..c67ddb4876
Binary files /dev/null and b/docs/img/graphx_figures.pptx differ
diff --git a/docs/img/graphx_logo.png b/docs/img/graphx_logo.png
new file mode 100644
index 0000000000..9869ac148c
Binary files /dev/null and b/docs/img/graphx_logo.png differ
diff --git a/docs/img/graphx_performance_comparison.png b/docs/img/graphx_performance_comparison.png
new file mode 100644
index 0000000000..62dcf098c9
Binary files /dev/null and b/docs/img/graphx_performance_comparison.png differ
diff --git a/docs/img/property_graph.png b/docs/img/property_graph.png
new file mode 100644
index 0000000000..859d4013fb
Binary files /dev/null and b/docs/img/property_graph.png differ
diff --git a/docs/img/tables_and_graphs.png b/docs/img/tables_and_graphs.png
index 9af07d3081..ec37bb45a6 100644
Binary files a/docs/img/tables_and_graphs.png and b/docs/img/tables_and_graphs.png differ
diff --git a/docs/img/vertex_routing_edge_tables.png b/docs/img/vertex_routing_edge_tables.png
new file mode 100644
index 0000000000..4379becc87
Binary files /dev/null and b/docs/img/vertex_routing_edge_tables.png differ
diff --git a/docs/index.md b/docs/index.md
index 86d574daaa..7228809738 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -5,7 +5,7 @@ title: Spark Overview
 
 Apache Spark is a fast and general-purpose cluster computing system.
 It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs.
-It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [Bagel](bagel-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
+It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
 
 # Downloading
 
@@ -77,7 +77,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
   * [Python Programming Guide](python-programming-guide.html): using Spark from Python
 * [Spark Streaming](streaming-programming-guide.html): using the alpha release of Spark Streaming
 * [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
-* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
+* [GraphX (Graphs on Spark)](graphx-programming-guide.html): simple graph processing model
 
 **API Docs:**
 
@@ -85,7 +85,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
 * [Spark for Python (Epydoc)](api/pyspark/index.html)
 * [Spark Streaming for Java/Scala (Scaladoc)](api/streaming/index.html)
 * [MLlib (Machine Learning) for Java/Scala (Scaladoc)](api/mllib/index.html)
-* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html)
+* [GraphX (Graphs on Spark) for Scala (Scaladoc)](api/graphx/index.html)
 
 
 **Deployment guides:**
-- 
cgit v1.2.3


From 6bd9a78e78d42dc5c216af4b6f59a71a002f82e5 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Fri, 10 Jan 2014 11:37:10 -0800
Subject: Add back Bagel links to docs, but mark them superseded

---
 docs/_layouts/global.html        |  4 +++-
 docs/api.md                      |  3 ++-
 docs/bagel-programming-guide.md  | 10 ++++++----
 docs/graphx-programming-guide.md | 14 +++++++-------
 docs/index.md                    |  4 +++-
 5 files changed, 21 insertions(+), 14 deletions(-)

(limited to 'docs')

diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 7721854685..36eb49df14 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -67,6 +67,7 @@
                                 <li class="divider"></li>
                                 <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
                                 <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
+                                <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark, superseded by GraphX)</a></li>
                                 <li><a href="graphx-programming-guide.html">GraphX (Graph-Parallel Spark)</a></li>
                             </ul>
                         </li>
@@ -79,7 +80,8 @@
                                 <li class="divider"></li>
                                 <li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
                                 <li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
-                                <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Paralle Spark)</a></li>
+                                <li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark, superseded by GraphX)</a></li>
+                                <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Parallel Spark)</a></li>
                             </ul>
                         </li>
 
diff --git a/docs/api.md b/docs/api.md
index e86d07770a..7639e58053 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -8,5 +8,6 @@ Here you can find links to the Scaladoc generated for the Spark sbt subprojects.
 - [Spark](api/core/index.html)
 - [Spark Examples](api/examples/index.html)
 - [Spark Streaming](api/streaming/index.html)
-- [Bagel](api/bagel/index.html)
+- [Bagel](api/bagel/index.html) *(superseded by GraphX)*
+- [GraphX](api/graphx/index.html)
 - [PySpark](api/pyspark/index.html)
diff --git a/docs/bagel-programming-guide.md b/docs/bagel-programming-guide.md
index c4f1f6d6ad..a1339ec735 100644
--- a/docs/bagel-programming-guide.md
+++ b/docs/bagel-programming-guide.md
@@ -3,6 +3,8 @@ layout: global
 title: Bagel Programming Guide
 ---
 
+**Bagel has been superseded by [GraphX](graphx-programming-guide.html) for graph processing. New users should use GraphX instead.**
+
 Bagel is a Spark implementation of Google's [Pregel](http://portal.acm.org/citation.cfm?id=1807184) graph processing framework. Bagel currently supports basic graph computation, combiners, and aggregators.
 
 In the Pregel programming model, jobs run as a sequence of iterations called _supersteps_. In each superstep, each vertex in the graph runs a user-specified function that can update state associated with the vertex and send messages to other vertices for use in the *next* iteration.
@@ -21,7 +23,7 @@ To use Bagel in your program, add the following SBT or Maven dependency:
 
 Bagel operates on a graph represented as a [distributed dataset](scala-programming-guide.html) of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.
 
-For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to. 
+For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.
 
 We first extend the default `Vertex` class to store a `Double`
 representing the current PageRank of the vertex, and similarly extend
@@ -38,7 +40,7 @@ import org.apache.spark.bagel.Bagel._
   val active: Boolean) extends Vertex
 
 @serializable class PRMessage(
-  val targetId: String, val rankShare: Double) extends Message             
+  val targetId: String, val rankShare: Double) extends Message
 {% endhighlight %}
 
 Next, we load a sample graph from a text file as a distributed dataset and package it into `PRVertex` objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.
@@ -114,7 +116,7 @@ Here are the actions and types in the Bagel API. See [Bagel.scala](https://githu
 /*** Full form ***/
 
 Bagel.run(sc, vertices, messages, combiner, aggregator, partitioner, numSplits)(compute)
-// where compute takes (vertex: V, combinedMessages: Option[C], aggregated: Option[A], superstep: Int) 
+// where compute takes (vertex: V, combinedMessages: Option[C], aggregated: Option[A], superstep: Int)
 // and returns (newVertex: V, outMessages: Array[M])
 
 /*** Abbreviated forms ***/
@@ -124,7 +126,7 @@ Bagel.run(sc, vertices, messages, combiner, partitioner, numSplits)(compute)
 // and returns (newVertex: V, outMessages: Array[M])
 
 Bagel.run(sc, vertices, messages, combiner, numSplits)(compute)
-// where compute takes (vertex: V, combinedMessages: Option[C], superstep: Int) 
+// where compute takes (vertex: V, combinedMessages: Option[C], superstep: Int)
 // and returns (newVertex: V, outMessages: Array[M])
 
 Bagel.run(sc, vertices, messages, numSplits)(compute)
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index a551e4306d..8ae5f17e12 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -16,7 +16,7 @@ title: GraphX Programming Guide
 # Overview
 
 GraphX is the new (alpha) Spark API for graphs and graph-parallel
-computation. At a high-level GraphX, extends the Spark
+computation. At a high-level, GraphX extends the Spark
 [RDD](api/core/index.html#org.apache.spark.rdd.RDD) by
 introducing the [Resilient Distributed property Graph (RDG)](#property_graph):
 a directed graph with properties attached to each vertex and edge.
@@ -77,12 +77,13 @@ graph-parallel systems while easily expressing the entire analytics pipelines.
 ## GraphX Replaces the Spark Bagel API
 
 Prior to the release of GraphX, graph computation in Spark was expressed using
-Bagel, an implementation of the Pregel API.  GraphX improves upon Bagel by exposing
-a richer property graph API, a more streamlined version of the Pregel abstraction,
-and system optimizations to improve performance and reduce memory
+Bagel, an implementation of the Pregel API.  GraphX improves upon Bagel by
+exposing a richer property graph API, a more streamlined version of the Pregel
+abstraction, and system optimizations to improve performance and reduce memory
 overhead.  While we plan to eventually deprecate the Bagel, we will continue to
-support the API and [Bagel programming guide](bagel-programming-guide.html). However,
-we encourage Bagel to explore the new GraphX API and comment on issues that may
+support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and
+[Bagel programming guide](bagel-programming-guide.html). However, we encourage
+Bagel users to explore the new GraphX API and comment on issues that may
 complicate the transition from Bagel.
 
 # The Property Graph
@@ -168,4 +169,3 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
 println(userInfoWithPageRank.top(5))
 
 {% endhighlight %}
-
diff --git a/docs/index.md b/docs/index.md
index 7228809738..c11dc38b0e 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -77,7 +77,8 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
   * [Python Programming Guide](python-programming-guide.html): using Spark from Python
 * [Spark Streaming](streaming-programming-guide.html): using the alpha release of Spark Streaming
 * [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
-* [GraphX (Graphs on Spark)](graphx-programming-guide.html): simple graph processing model
+* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model *(superseded by GraphX)*
+* [GraphX (Graphs on Spark)](graphx-programming-guide.html): Spark's new API for graphs
 
 **API Docs:**
 
@@ -85,6 +86,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
 * [Spark for Python (Epydoc)](api/pyspark/index.html)
 * [Spark Streaming for Java/Scala (Scaladoc)](api/streaming/index.html)
 * [MLlib (Machine Learning) for Java/Scala (Scaladoc)](api/mllib/index.html)
+* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html) *(superseded by GraphX)*
 * [GraphX (Graphs on Spark) for Scala (Scaladoc)](api/graphx/index.html)
 
 
-- 
cgit v1.2.3


From 3eb83191cb6da8b80f9a4fe30527a28eb1a7bff6 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Fri, 10 Jan 2014 11:37:28 -0800
Subject: Generate GraphX docs

---
 docs/_plugins/copy_api_dirs.rb | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

(limited to 'docs')

diff --git a/docs/_plugins/copy_api_dirs.rb b/docs/_plugins/copy_api_dirs.rb
index 431de909cb..acc6bf0816 100644
--- a/docs/_plugins/copy_api_dirs.rb
+++ b/docs/_plugins/copy_api_dirs.rb
@@ -20,7 +20,7 @@ include FileUtils
 
 if not (ENV['SKIP_API'] == '1' or ENV['SKIP_SCALADOC'] == '1')
   # Build Scaladoc for Java/Scala
-  projects = ["core", "examples", "repl", "bagel", "streaming", "mllib"]
+  projects = ["core", "examples", "repl", "bagel", "graphx", "streaming", "mllib"]
 
   puts "Moving to project root and building scaladoc."
   curr_dir = pwd
-- 
cgit v1.2.3


From 362b9422e45946e94beadd180cd3baa583b6ba23 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Fri, 10 Jan 2014 23:48:32 -0800
Subject: Soften wording about GraphX superseding Bagel

---
 docs/_layouts/global.html       | 4 ++--
 docs/api.md                     | 2 +-
 docs/bagel-programming-guide.md | 2 +-
 docs/index.md                   | 4 ++--
 4 files changed, 6 insertions(+), 6 deletions(-)

(limited to 'docs')

diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 36eb49df14..4287e7141d 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -67,7 +67,7 @@
                                 <li class="divider"></li>
                                 <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
                                 <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
-                                <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark, superseded by GraphX)</a></li>
+                                <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
                                 <li><a href="graphx-programming-guide.html">GraphX (Graph-Parallel Spark)</a></li>
                             </ul>
                         </li>
@@ -80,7 +80,7 @@
                                 <li class="divider"></li>
                                 <li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
                                 <li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
-                                <li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark, superseded by GraphX)</a></li>
+                                <li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
                                 <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Parallel Spark)</a></li>
                             </ul>
                         </li>
diff --git a/docs/api.md b/docs/api.md
index 7639e58053..91c8e51d26 100644
--- a/docs/api.md
+++ b/docs/api.md
@@ -8,6 +8,6 @@ Here you can find links to the Scaladoc generated for the Spark sbt subprojects.
 - [Spark](api/core/index.html)
 - [Spark Examples](api/examples/index.html)
 - [Spark Streaming](api/streaming/index.html)
-- [Bagel](api/bagel/index.html) *(superseded by GraphX)*
+- [Bagel](api/bagel/index.html)
 - [GraphX](api/graphx/index.html)
 - [PySpark](api/pyspark/index.html)
diff --git a/docs/bagel-programming-guide.md b/docs/bagel-programming-guide.md
index a1339ec735..cffa55ee95 100644
--- a/docs/bagel-programming-guide.md
+++ b/docs/bagel-programming-guide.md
@@ -3,7 +3,7 @@ layout: global
 title: Bagel Programming Guide
 ---
 
-**Bagel has been superseded by [GraphX](graphx-programming-guide.html) for graph processing. New users should use GraphX instead.**
+**Bagel will soon be superseded by [GraphX](graphx-programming-guide.html); we recommend that new users try GraphX instead.**
 
 Bagel is a Spark implementation of Google's [Pregel](http://portal.acm.org/citation.cfm?id=1807184) graph processing framework. Bagel currently supports basic graph computation, combiners, and aggregators.
 
diff --git a/docs/index.md b/docs/index.md
index c11dc38b0e..debdb33108 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -77,7 +77,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
   * [Python Programming Guide](python-programming-guide.html): using Spark from Python
 * [Spark Streaming](streaming-programming-guide.html): using the alpha release of Spark Streaming
 * [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library
-* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model *(superseded by GraphX)*
+* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model
 * [GraphX (Graphs on Spark)](graphx-programming-guide.html): Spark's new API for graphs
 
 **API Docs:**
@@ -86,7 +86,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui
 * [Spark for Python (Epydoc)](api/pyspark/index.html)
 * [Spark Streaming for Java/Scala (Scaladoc)](api/streaming/index.html)
 * [MLlib (Machine Learning) for Java/Scala (Scaladoc)](api/mllib/index.html)
-* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html) *(superseded by GraphX)*
+* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html)
 * [GraphX (Graphs on Spark) for Scala (Scaladoc)](api/graphx/index.html)
 
 
-- 
cgit v1.2.3


From b8a44f12a58c336d3d296382dd53467c9538d1e9 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Fri, 10 Jan 2014 23:52:04 -0800
Subject: More edits.

---
 docs/graphx-programming-guide.md | 231 ++++++++++++++++++++++++++++++++++++---
 docs/img/graphx_figures.pptx     | Bin 1118035 -> 1123365 bytes
 docs/img/property_graph.png      | Bin 79056 -> 225151 bytes
 docs/img/triplet.png             | Bin 0 -> 31489 bytes
 4 files changed, 215 insertions(+), 16 deletions(-)
 create mode 100644 docs/img/triplet.png

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 8ae5f17e12..b46cc00d04 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -11,6 +11,7 @@ title: GraphX Programming Guide
        title="GraphX Logo"
        alt="GraphX"
        width="65%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
 # Overview
@@ -42,6 +43,7 @@ magnitude faster than more general *data-parallel* systems.
        title="Data-Parallel vs. Graph-Parallel"
        alt="Data-Parallel vs. Graph-Parallel"
        width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
 However, the same restrictions that enable these substantial performance gains
@@ -56,14 +58,15 @@ movement and duplication and a complicated programming model.
        title="Graph Analytics Pipeline"
        alt="Graph Analytics Pipeline"
        width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
 The goal of the GraphX project is to unify graph-parallel and data-parallel
-computation in one system with a single composable API. This goal is achieved
-through an API that enables users to view data both as a graph and as
-collections (i.e., RDDs) without data movement or duplication and by
-incorporating advances in graph-parallel systems to optimize the execution of
-operations on the graph view.  In preliminary experiments we find that the GraphX
+computation in one system with a single composable API. The GraphX API
+enables users to view data both as a graph and as
+collection (i.e., RDDs) without data movement or duplication. By
+incorporating recent advances in graph-parallel systems, GraphX is able to optimize
+the execution of graph operations. In preliminary experiments we find that the GraphX
 system is able to achieve performance comparable to state-of-the-art
 graph-parallel systems while easily expressing the entire analytics pipelines.
 
@@ -72,6 +75,7 @@ graph-parallel systems while easily expressing the entire analytics pipelines.
        title="GraphX Performance Comparison"
        alt="GraphX Performance Comparison"
        width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
 ## GraphX Replaces the Spark Bagel API
@@ -86,47 +90,242 @@ support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and
 Bagel users to explore the new GraphX API and comment on issues that may
 complicate the transition from Bagel.
 
+# Getting Started
+
+To get started you first need to import Spark and GraphX into your project.  This can be done by
+importing the following:
+
+{% highlight scala %}
+import org.apache.spark._
+import org.apache.spark.graphx._
+{% endhighlight %}
+
+If you are not using the Spark shell you will also need a Spark context.
+
 # The Property Graph
 <a name="property_graph"></a>
 
-<p style="text-align: center;">
-  <img src="img/edge_cut_vs_vertex_cut.png"
-       title="Edge Cut vs. Vertex Cut"
-       alt="Edge Cut vs. Vertex Cut"
-       width="50%" />
-</p>
+The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed graph with
+user defined objects attached to each vertex and edge.  Like RDDs, property graphs are immutable,
+distributed, and fault-tolerant. Vertices are keyed by their vertex identifier (`VertexId`) which is
+a unique 64-bit long. Similarly, edges have corresponding source and destination vertex identifiers.
+Unlike other systems, GraphX does not impose any ordering or constraints on the vertex identifiers.
+
+The property graph is parameterized over the vertex `VD` and edge `ED` types.  These are the types
+of the objects associated with each vertex and edge respectively.  In some cases it can be desirable
+to have vertices of different types.  However, this can be accomplished through inheritance.
+
+> GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
+> int, double, etc...) reducing the memory overhead of the graph representation.
+
+Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
+properties for each vertex and edge:
+
+{% highlight scala %}
+class Graph[VD: ClassTag, ED: ClassTag] {
+  val vertices: RDD[(VertexId, VD)]
+  val edges: RDD[Edge[ED]]
+  // ...
+}
+{% endhighlight %}
+> Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and
+> `EdgeRDD[ED]` respectively. These types extend and are optimized versions of `RDD[(VertexId, VD)]`
+> and `RDD[Edge[ED]]`.
+
+For example, we might construct a property graph consisting of various collaborators on the GraphX
+project. The vertex property contains the username and occupation and the edge property contains
+a string describing the relationships between the users.
 
 <p style="text-align: center;">
   <img src="img/property_graph.png"
        title="The Property Graph"
        alt="The Property Graph"
        width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
+The resulting graph would have the type signature:
+
+{% highlight scala %}
+val userGraph: Graph[(String, String), String]
+{% endhighlight %}
+
+There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
+generators and these are discussed in more detail in the section on
+[graph builders](#graph_builders).  Probably the most general method is to use the
+[graph singleton](api/graphx/index.html#org.apache.spark.graphx.Graph$).
+For example the following code constructs a graph from a collection of RDDs:
+
+{% highlight scala %}
+// Assume the SparkContext has already been constructed
+val sc: SparkContext
+// Create an RDD for the vertices
+val users: RDD[(VertexId, (String, String))] =
+  sc.parallelize(Array((3, ("rxin", "student")), (7, ("jgonzal", "postdoc")),
+                       (5, ("franklin", "prof")), (2, ("istoica", "prof"))))
+// Create an RDD for edges
+val relationships: RDD[Edge[String]] =
+  sc.parallelize(Array(Edge(3, 7, "collab"), Edge(5, 3, "advisor"),
+                       Edge(2, 5, "colleague"), Edge(5, 7, "pi"))
+// Define a default user in case there are relationship with missing user
+val defaultUser = ("John Doe", "Missing")
+// Build the initial Graph
+val graph = Graph(users, relationships, defaultUser)
+{% endhighlight %}
+
+In the above example we make use of the [`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge)
+case class. Edges have a `srcId` and a `dstId` corresponding to the source and destination vertex
+identifiers. In addition, the `Edge` class contains the `attr` member which contains the edge
+property.
+
+We can deconstruct a graph into the respective vertex and edge views by using the `graph.vertices`
+and `graph.edges` members respectively.
+
+{% highlight scala %}
+val graph: Graph[(String, String), String] // Constructed from above
+// Count all users which are postdocs
+graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
+// Count all the edges where src > dst
+graph.edges.filter(e => e.srcId > e.dstId).count
+{% endhighlight %}
+
+> Note that `graph.vertices` returns an `RDD[(VertexId, (String, String))]` and so we must use the
+> scala `case` expression to deconstruct the tuple.  Alternatively, `graph.edges` returns an `RDD`
+> containing `Edge[String]` objects.  We could have also used the case class type constructor as
+> in the following:
+> {% highlight scala %}
+graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
+{% endhighlight %}
+
+In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
+The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD,
+ED]]` consisting of [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet).
+This *join* can be expressed in the following SQL expression:
+
+{% highlight sql %}
+SELECT src.id, dst.id, src.attr, e.attr, dst.attr
+FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst
+ON e.srcId = src.Id AND e.dstId = dst.Id
+{% endhighlight %}
+
+or graphically as:
+
 <p style="text-align: center;">
-  <img src="img/vertex_routing_edge_tables.png"
-       title="RDD Graph Representation"
-       alt="RDD Graph Representation"
+  <img src="img/triplet.png"
+       title="Edge Triplet"
+       alt="Edge Triplet"
        width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
+The [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class extends the
+[`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge) class by adding the `srcAttr` and
+`dstAttr` members which contain the source and destination properties respectively. We can use the
+triplet view of a graph to render a collection of strings describing relationships between users.
+
+{% highlight scala %}
+val graph: Graph[(String, String), String] // Constructed from above
+// Use the triplets view to create an RDD of facts.
+val facts: RDD[String] =
+  graph.triplets.map(et => et.srcAttr._1 + " is the " + et.attr + " of " et.dstAttr)
+{% endhighlight %}
 
 # Graph Operators
 
+Just as RDDs have basic operations like `map`, `filter`, and `reduceByKey`, property graphs also
+have a collection of basic operators that take user defined function and produce new graphs with
+transformed properties and structure.
+
+## Property Operators
+
+In direct analogy to the RDD `map` operator, the property
+graph contains the following:
+
+{% highlight scala %}
+class Graph[VD, ED] {
+  def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
+  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
+  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
+}
+{% endhighlight %}
+
+Each of these operators yields a new graph with the vertex or edge properties modified by the user
+defined `map` function.
+
+> Note that in all cases the graph structure is unaffected.  This is a key feature of these
+> operators which allows the resulting graph to reuse the structural indicies and the unaffected
+> properties of the original graph.
+> While `graph.mapVertices(mapUDF)` is logically equivalent to the following, the following
+> does not preserve the structural indicies and would not benefit from the substantial system
+> optimizations in GraphX.
+> {% highlight scala %}
+val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))}
+val newGraph = Graph(newVertices, graph.edges)
+{% endhighlight %}
+
+These operators are often used to initialize the graph for a particular computation or project away
+unnecessary properties.  For example, given a graph with the out-degrees as the vertex properties
+(we describe how to construct such a graph later) we initialize for PageRank:
+
+{% highlight scala %}
+// Given a graph where the vertex property is the out-degree
+val inputGraph: Graph[Int, String]
+// Construct a graph where each edge contains the weight
+// and each vertex is the initial PageRank
+val outputGraph: Graph[Double, Double] =
+  inputGraph.mapTriplets(et => 1.0/et.srcAttr).mapVertices(v => 1.0)
+{% endhighlight %}
+
+## Structural Operators
+<a name="structural_operators"></a>
+
+
 ## Map Reduce Triplets (mapReduceTriplets)
 <a name="mrTriplets"></a>
 
-# Graph Algorithms
-<a name="graph_algorithms"></a>
 
 # Graph Builders
 <a name="graph_builders"></a>
 
+
+{% highlight scala %}
+val userGraph: Graph[(String, String), String]
+{% endhighlight %}
+
+
+# Optimized Representation
+
+The Property Graph is internally represented as a collection of RDDs
+
+<p style="text-align: center;">
+  <img src="img/edge_cut_vs_vertex_cut.png"
+       title="Edge Cut vs. Vertex Cut"
+       alt="Edge Cut vs. Vertex Cut"
+       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
+</p>
+
+<p style="text-align: center;">
+  <img src="img/vertex_routing_edge_tables.png"
+       title="RDD Graph Representation"
+       alt="RDD Graph Representation"
+       width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
+</p>
+
+
+
+
+# Graph Algorithms
+<a name="graph_algorithms"></a>
+
+
 <p style="text-align: center;">
   <img src="img/tables_and_graphs.png"
        title="Tables and Graphs"
        alt="Tables and Graphs"
        width="50%" />
+  <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
 # Examples
diff --git a/docs/img/graphx_figures.pptx b/docs/img/graphx_figures.pptx
index c67ddb4876..ea4f82ce82 100644
Binary files a/docs/img/graphx_figures.pptx and b/docs/img/graphx_figures.pptx differ
diff --git a/docs/img/property_graph.png b/docs/img/property_graph.png
index 859d4013fb..6f3f89a010 100644
Binary files a/docs/img/property_graph.png and b/docs/img/property_graph.png differ
diff --git a/docs/img/triplet.png b/docs/img/triplet.png
new file mode 100644
index 0000000000..8b82a09bed
Binary files /dev/null and b/docs/img/triplet.png differ
-- 
cgit v1.2.3


From 0c9d39bbaa2a2b1b3ad6d91d2ffd864635b7f41e Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Sat, 11 Jan 2014 00:08:52 -0800
Subject: More organizational changes and dropping the benchmark plot.

---
 docs/graphx-programming-guide.md | 32 ++++++++++++++++++++------------
 1 file changed, 20 insertions(+), 12 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index b46cc00d04..3138286385 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -66,17 +66,7 @@ computation in one system with a single composable API. The GraphX API
 enables users to view data both as a graph and as
 collection (i.e., RDDs) without data movement or duplication. By
 incorporating recent advances in graph-parallel systems, GraphX is able to optimize
-the execution of graph operations. In preliminary experiments we find that the GraphX
-system is able to achieve performance comparable to state-of-the-art
-graph-parallel systems while easily expressing the entire analytics pipelines.
-
-<p style="text-align: center;">
-  <img src="img/graphx_performance_comparison.png"
-       title="GraphX Performance Comparison"
-       alt="GraphX Performance Comparison"
-       width="50%" />
-  <!-- Images are downsized intentionally to improve quality on retina displays -->
-</p>
+the execution of graph operations.
 
 ## GraphX Replaces the Spark Bagel API
 
@@ -279,11 +269,15 @@ val outputGraph: Graph[Double, Double] =
 ## Structural Operators
 <a name="structural_operators"></a>
 
+## Join Operators
+<a name="join_operators"></a>
 
 ## Map Reduce Triplets (mapReduceTriplets)
 <a name="mrTriplets"></a>
 
 
+
+
 # Graph Builders
 <a name="graph_builders"></a>
 
@@ -295,7 +289,8 @@ val userGraph: Graph[(String, String), String]
 
 # Optimized Representation
 
-The Property Graph is internally represented as a collection of RDDs
+This section should give some intuition about how GraphX works and how that affects the user (e.g.,
+things to worry about.)
 
 <p style="text-align: center;">
   <img src="img/edge_cut_vs_vertex_cut.png"
@@ -319,6 +314,19 @@ The Property Graph is internally represented as a collection of RDDs
 # Graph Algorithms
 <a name="graph_algorithms"></a>
 
+This section should describe the various algorithms and how they are used.
+
+## PageRank
+
+## Connected Components
+
+## Shortest Path
+
+## Triangle Counting
+
+## K-Core
+
+## LDA
 
 <p style="text-align: center;">
   <img src="img/tables_and_graphs.png"
-- 
cgit v1.2.3


From 56a245c6bc5147c7f6b11d241bf518c9781cdbe1 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Sat, 11 Jan 2014 00:20:54 -0800
Subject: Addressing comment about Graph Processing in docs.

---
 docs/_layouts/global.html | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

(limited to 'docs')

diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html
index 4287e7141d..c529d89ffd 100755
--- a/docs/_layouts/global.html
+++ b/docs/_layouts/global.html
@@ -68,7 +68,7 @@
                                 <li><a href="streaming-programming-guide.html">Spark Streaming</a></li>
                                 <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li>
                                 <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li>
-                                <li><a href="graphx-programming-guide.html">GraphX (Graph-Parallel Spark)</a></li>
+                                <li><a href="graphx-programming-guide.html">GraphX (Graph Processing)</a></li>
                             </ul>
                         </li>
 
@@ -81,7 +81,7 @@
                                 <li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li>
                                 <li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li>
                                 <li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li>
-                                <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Parallel Spark)</a></li>
+                                <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph Processing)</a></li>
                             </ul>
                         </li>
 
-- 
cgit v1.2.3


From 1f45e4e572130e989cf1f91655c22352ac33b063 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Sat, 11 Jan 2014 09:27:00 -0800
Subject: starting structural operator discussion.

---
 docs/graphx-programming-guide.md | 32 ++++++++++++++++++++++++++++++--
 1 file changed, 30 insertions(+), 2 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 3138286385..9a65745930 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -232,11 +232,9 @@ In direct analogy to the RDD `map` operator, the property
 graph contains the following:
 
 {% highlight scala %}
-class Graph[VD, ED] {
   def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
   def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
   def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
-}
 {% endhighlight %}
 
 Each of these operators yields a new graph with the vertex or edge properties modified by the user
@@ -269,6 +267,36 @@ val outputGraph: Graph[Double, Double] =
 ## Structural Operators
 <a name="structural_operators"></a>
 
+Currently GraphX supports only a simple set of commonly used structural operators and we expect to
+add more in the future.  The following is a list of the basic structural operators.
+
+{% highlight scala %}
+  def reverse: Graph[VD, ED]
+
+  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
+    vpred: (VertexID, VD) => Boolean = ((v,d) => true) ): Graph[VD, ED]
+
+  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
+
+  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
+{% endhighlight %}
+
+The `rerverse` operator returns a new graph with all the edge directions reversed.  This can be
+useful when for example trying to compute the inverse PageRank.
+
+The `subgraph` operator takes vertex and edge predicates and returns the graph containing only the
+vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge
+predicate *and connect vertices that satisfy the vertex predicate*.  The `subgraph` operator can be
+used in number of situations to restrict the graph to the vertices and edges of interest or
+eliminate broken links.
+
+The `mask` operators returns the subgraph containing vertices and edges that are found in the input
+graph.  Finish this description ...
+
+The `groupEdges` operator merges ...
+
+
+
 ## Join Operators
 <a name="join_operators"></a>
 
-- 
cgit v1.2.3


From fac44bbe2c10633e371cf30afa17c5e78290ca9c Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Sat, 11 Jan 2014 11:28:01 -0800
Subject: Finished documenting structural operators and starting join
 operators.

---
 docs/graphx-programming-guide.md | 90 ++++++++++++++++++++++++++++++++--------
 1 file changed, 72 insertions(+), 18 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 9a65745930..a5e75e2cb0 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -95,8 +95,10 @@ If you are not using the Spark shell you will also need a Spark context.
 # The Property Graph
 <a name="property_graph"></a>
 
-The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed graph with
-user defined objects attached to each vertex and edge.  Like RDDs, property graphs are immutable,
+The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed multigraph
+graph with user defined objects attached to each vertex and edge.  As a multigraph it is possible
+for multiple edges to have the same source and destination vertex.  This can be useful when there
+are multiple relationships between the same vertices.  Like RDDs, property graphs are immutable,
 distributed, and fault-tolerant. Vertices are keyed by their vertex identifier (`VertexId`) which is
 a unique 64-bit long. Similarly, edges have corresponding source and destination vertex identifiers.
 Unlike other systems, GraphX does not impose any ordering or constraints on the vertex identifiers.
@@ -106,7 +108,7 @@ of the objects associated with each vertex and edge respectively.  In some cases
 to have vertices of different types.  However, this can be accomplished through inheritance.
 
 > GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
-> int, double, etc...) reducing the memory overhead of the graph representation.
+> int, double, etc...) reducing the in memory footprint.
 
 Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
 properties for each vertex and edge:
@@ -224,7 +226,22 @@ val facts: RDD[String] =
 
 Just as RDDs have basic operations like `map`, `filter`, and `reduceByKey`, property graphs also
 have a collection of basic operators that take user defined function and produce new graphs with
-transformed properties and structure.
+transformed properties and structure.  The core operators that have optimized implementations are
+defined in [`Graph.scala`](api/graphx/index.html#org.apache.spark.graphx.Graph) and convenient
+operators that are expressed as a compositions of the core operators are defined in
+['GraphOps.scala'](api/graphx/index.html#org.apache.spark.graphx.GraphOps).  However, thanks to
+Scala implicits the operators in `GraphOps.scala` are automatically available as members of
+`Graph.scala`.  For example, we can compute the in-degree of each vertex (defined in
+'GraphOps.scala') by the following:
+
+{% highlight scala %}
+val graph: Graph[(String, String), String]
+// Use the implicit GraphOps.inDegrees operator
+val indDegrees: VertexRDD[Int] = graph.inDegrees
+{% endhighlight %}
+
+The reason for differentiating between core graph operations and GraphOps is to be able to support
+various graph representations in the future.
 
 ## Property Operators
 
@@ -232,9 +249,9 @@ In direct analogy to the RDD `map` operator, the property
 graph contains the following:
 
 {% highlight scala %}
-  def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
-  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
-  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
+def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
+def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
+def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
 {% endhighlight %}
 
 Each of these operators yields a new graph with the vertex or edge properties modified by the user
@@ -271,35 +288,72 @@ Currently GraphX supports only a simple set of commonly used structural operator
 add more in the future.  The following is a list of the basic structural operators.
 
 {% highlight scala %}
-  def reverse: Graph[VD, ED]
+def reverse: Graph[VD, ED]
 
-  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
-    vpred: (VertexID, VD) => Boolean = ((v,d) => true) ): Graph[VD, ED]
+def subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
+  vpred: (VertexID, VD) => Boolean = ((v,d) => true) ): Graph[VD, ED]
 
-  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
+def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
 
-  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
+def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
 {% endhighlight %}
 
-The `rerverse` operator returns a new graph with all the edge directions reversed.  This can be
-useful when for example trying to compute the inverse PageRank.
+The `reverse` operator returns a new graph with all the edge directions reversed.  This can be
+useful when, for example, trying to compute the inverse PageRank.  Because the reverse operation
+does not modify vertex or edge properties or change the number of edges, it can be implemented
+efficiently without data-movement or duplication.
 
 The `subgraph` operator takes vertex and edge predicates and returns the graph containing only the
 vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge
 predicate *and connect vertices that satisfy the vertex predicate*.  The `subgraph` operator can be
 used in number of situations to restrict the graph to the vertices and edges of interest or
-eliminate broken links.
+eliminate broken links.  For example in the following code we remove broken links:
 
-The `mask` operators returns the subgraph containing vertices and edges that are found in the input
-graph.  Finish this description ...
+{% highlight scala %}
+val users: RDD[(VertexId, (String, String))]
+val edges: RDD[Edge[String]]
+// Define a default user in case there are relationship with missing user
+val defaultUser = ("John Doe", "Missing")
+// Build the initial Graph
+val graph = Graph(users, relationships, defaultUser)
+// Remove missing vertices as well as the edges to connected to them
+val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
+{% endhighlight %}
 
-The `groupEdges` operator merges ...
+The `mask` operators returns the subgraph containing only the vertices and edges that are found in
+the input graph.  This can be used in conjunction with the `subgraph` operator to restrict a graph
+based on the properties in another related graph.  For example, we might run connected components
+using the graph with missing vertices and then restrict the answer to the valid subgraph.
 
+{% highlight scala %}
+// Run Connected Components
+val ccGraph = graph.connectedComponents()
+// Remove missing vertices as well as the edges to connected to them
+val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
+// Restrict the answer to the valid subgraph
+val validCCGraph = ccGraph.mask(validGraph)
+{% endhighlight %}
 
+The `groupEdges` operator merges parallel edges: duplicate edges between pairs of vertices.  In many
+numerical applications parallel edges can be *added* (their weights combined) into a single edge
+thereby reducing the graph size in memory as well as the cost of computation.
 
 ## Join Operators
 <a name="join_operators"></a>
 
+The ability to move between graph and collection views of data is a key part of GraphX.  In many
+cases it is necessary to bring data from external collections into the graph.  For example, we might
+have extra user properties that we want to merge with an existing graph or we might want to pull
+vertex properties from one graph into another.  These tasks can be accomplished using the *join*
+operators.  Below we list the key join operators:
+
+{% highlight scala %}
+def joinVertices[U](table: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, U) => VD)
+  : Graph[VD, ED]
+def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, Option[U]) => VD2)
+  : Graph[VD2, ED]
+{% endhighlight %}
+
 ## Map Reduce Triplets (mapReduceTriplets)
 <a name="mrTriplets"></a>
 
-- 
cgit v1.2.3


From 732333d78e46ee23025d81ca9fbe6d1e13e9f253 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Sat, 11 Jan 2014 11:49:21 -0800
Subject: Remove GraphLab

---
 docs/graphx-programming-guide.md                   |  13 +-
 .../scala/org/apache/spark/graphx/GraphLab.scala   | 138 ---------------------
 .../scala/org/apache/spark/graphx/Pregel.scala     |   9 +-
 .../graphx/lib/StronglyConnectedComponents.scala   |  50 ++++----
 4 files changed, 40 insertions(+), 170 deletions(-)
 delete mode 100644 graphx/src/main/scala/org/apache/spark/graphx/GraphLab.scala

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index a5e75e2cb0..b19c6b69de 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -18,13 +18,12 @@ title: GraphX Programming Guide
 
 GraphX is the new (alpha) Spark API for graphs and graph-parallel
 computation. At a high-level, GraphX extends the Spark
-[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by
-introducing the [Resilient Distributed property Graph (RDG)](#property_graph):
-a directed graph with properties attached to each vertex and edge.
-To support graph computation, GraphX exposes a set of functions
-(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the
-[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org)
-APIs. In addition, GraphX includes a growing collection of graph
+[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by introducing the
+[Resilient Distributed property Graph (RDG)](#property_graph): a directed graph
+with properties attached to each vertex and edge.  To support graph computation,
+GraphX exposes a set of functions (e.g., [mapReduceTriplets](#mrTriplets)) as
+well as an optimized variant of the [Pregel](http://giraph.apache.org) API. In
+addition, GraphX includes a growing collection of graph
 [algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify
 graph analytics tasks.
 
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/GraphLab.scala b/graphx/src/main/scala/org/apache/spark/graphx/GraphLab.scala
deleted file mode 100644
index 2f828ad807..0000000000
--- a/graphx/src/main/scala/org/apache/spark/graphx/GraphLab.scala
+++ /dev/null
@@ -1,138 +0,0 @@
-package org.apache.spark.graphx
-
-import scala.reflect.ClassTag
-
-import org.apache.spark.Logging
-import scala.collection.JavaConversions._
-import org.apache.spark.rdd.RDD
-
-/**
- * Implements the GraphLab gather-apply-scatter API.
- */
-object GraphLab extends Logging {
-
-  /**
-   * Executes the GraphLab Gather-Apply-Scatter API.
-   *
-   * @param graph the graph on which to execute the GraphLab API
-   * @param gatherFunc executed on each edge triplet
-   *                   adjacent to a vertex. Returns an accumulator which
-   *                   is then merged using the merge function.
-   * @param mergeFunc an accumulative associative operation on the result of
-   *                  the gather type.
-   * @param applyFunc takes a vertex and the final result of the merge operations
-   *                  on the adjacent edges and returns a new vertex value.
-   * @param scatterFunc executed after the apply function. Takes
-   *                    a triplet and signals whether the neighboring vertex program
-   *                    must be recomputed.
-   * @param startVertices a predicate to determine which vertices to start the computation on.
-   *                      These will be the active vertices in the first iteration.
-   * @param numIter the maximum number of iterations to run
-   * @param gatherDirection the direction of edges to consider during the gather phase
-   * @param scatterDirection the direction of edges to consider during the scatter phase
-   *
-   * @tparam VD the graph vertex attribute type
-   * @tparam ED the graph edge attribute type
-   * @tparam A the type accumulated during the gather phase
-   * @return the resulting graph after the algorithm converges
-   *
-   * @note Unlike [[Pregel]], this implementation of [[GraphLab]] does not unpersist RDDs from
-   * previous iterations. As a result, long-running iterative GraphLab programs will eventually fill
-   * the Spark cache. Though Spark will evict RDDs from old iterations eventually, garbage
-   * collection will take longer than necessary since it must examine the entire cache. This will be
-   * fixed in a future update.
-   */
-  def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
-    (graph: Graph[VD, ED], numIter: Int,
-     gatherDirection: EdgeDirection = EdgeDirection.In,
-     scatterDirection: EdgeDirection = EdgeDirection.Out)
-    (gatherFunc: (VertexID, EdgeTriplet[VD, ED]) => A,
-     mergeFunc: (A, A) => A,
-     applyFunc: (VertexID, VD, Option[A]) => VD,
-     scatterFunc: (VertexID, EdgeTriplet[VD, ED]) => Boolean,
-     startVertices: (VertexID, VD) => Boolean = (vid: VertexID, data: VD) => true)
-    : Graph[VD, ED] = {
-
-
-    // Add an active attribute to all vertices to track convergence.
-    var activeGraph: Graph[(Boolean, VD), ED] = graph.mapVertices {
-      case (id, data) => (startVertices(id, data), data)
-    }.cache()
-
-    // The gather function wrapper strips the active attribute and
-    // only invokes the gather function on active vertices
-    def gather(vid: VertexID, e: EdgeTriplet[(Boolean, VD), ED]): Option[A] = {
-      if (e.vertexAttr(vid)._1) {
-        val edgeTriplet = new EdgeTriplet[VD,ED]
-        edgeTriplet.set(e)
-        edgeTriplet.srcAttr = e.srcAttr._2
-        edgeTriplet.dstAttr = e.dstAttr._2
-        Some(gatherFunc(vid, edgeTriplet))
-      } else {
-        None
-      }
-    }
-
-    // The apply function wrapper strips the vertex of the active attribute
-    // and only invokes the apply function on active vertices
-    def apply(vid: VertexID, data: (Boolean, VD), accum: Option[A]): (Boolean, VD) = {
-      val (active, vData) = data
-      if (active) (true, applyFunc(vid, vData, accum))
-      else (false, vData)
-    }
-
-    // The scatter function wrapper strips the vertex of the active attribute
-    // and only invokes the scatter function on active vertices
-    def scatter(rawVertexID: VertexID, e: EdgeTriplet[(Boolean, VD), ED]): Option[Boolean] = {
-      val vid = e.otherVertexId(rawVertexID)
-      if (e.vertexAttr(vid)._1) {
-        val edgeTriplet = new EdgeTriplet[VD,ED]
-        edgeTriplet.set(e)
-        edgeTriplet.srcAttr = e.srcAttr._2
-        edgeTriplet.dstAttr = e.dstAttr._2
-        Some(scatterFunc(vid, edgeTriplet))
-      } else {
-        None
-      }
-    }
-
-    // Used to set the active status of vertices for the next round
-    def applyActive(
-        vid: VertexID, data: (Boolean, VD), newActiveOpt: Option[Boolean]): (Boolean, VD) = {
-      val (prevActive, vData) = data
-      (newActiveOpt.getOrElse(false), vData)
-    }
-
-    // Main Loop ---------------------------------------------------------------------
-    var i = 0
-    var numActive = activeGraph.numVertices
-    var prevActiveGraph: Graph[(Boolean, VD), ED] = null
-    while (i < numIter && numActive > 0) {
-
-      // Gather
-      val gathered: RDD[(VertexID, A)] =
-        activeGraph.aggregateNeighbors(gather, mergeFunc, gatherDirection)
-
-      // Apply
-      val applied = activeGraph.outerJoinVertices(gathered)(apply).cache()
-
-      // Scatter is basically a gather in the opposite direction so we reverse the edge direction
-      val scattered: RDD[(VertexID, Boolean)] =
-        applied.aggregateNeighbors(scatter, _ || _, scatterDirection.reverse)
-
-      prevActiveGraph = activeGraph
-      activeGraph = applied.outerJoinVertices(scattered)(applyActive).cache()
-
-      // Calculate the number of active vertices.
-      numActive = activeGraph.vertices.map{
-        case (vid, data) => if (data._1) 1 else 0
-        }.reduce(_ + _)
-      logInfo("Number active vertices: " + numActive)
-
-      i += 1
-    }
-
-    // Remove the active attribute from the vertex data before returning the graph
-    activeGraph.mapVertices{case (vid, data) => data._2 }
-  }
-}
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
index 2e6453484c..57b087213f 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
@@ -65,6 +65,10 @@ object Pregel {
    *
    * @param maxIterations the maximum number of iterations to run for
    *
+   * @param activeDirection the direction of edges incident to a vertex that received a message in
+   * the previous round on which to run `sendMsg`. For example, if this is `EdgeDirection.Out`, only
+   * out-edges of vertices that received a message in the previous round will run.
+   *
    * @param vprog the user-defined vertex program which runs on each
    * vertex and receives the inbound message and computes a new vertex
    * value.  On the first iteration the vertex program is invoked on
@@ -85,7 +89,8 @@ object Pregel {
    *
    */
   def apply[VD: ClassTag, ED: ClassTag, A: ClassTag]
-    (graph: Graph[VD, ED], initialMsg: A, maxIterations: Int = Int.MaxValue)(
+    (graph: Graph[VD, ED], initialMsg: A, maxIterations: Int = Int.MaxValue,
+      activeDirection: EdgeDirection = EdgeDirection.Out)(
       vprog: (VertexID, VD, A) => VD,
       sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
       mergeMsg: (A, A) => A)
@@ -110,7 +115,7 @@ object Pregel {
       // Send new messages. Vertices that didn't get any messages don't appear in newVerts, so don't
       // get to send messages. We must cache messages so it can be materialized on the next line,
       // allowing us to uncache the previous iteration.
-      messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, EdgeDirection.Out))).cache()
+      messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDirection))).cache()
       // The call to count() materializes `messages`, `newVerts`, and the vertices of `g`. This
       // hides oldMessages (depended on by newVerts), newVerts (depended on by messages), and the
       // vertices of prevG (depended on by newVerts, oldMessages, and the vertices of g).
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
index 9bd227309a..43c4b9cf2d 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/StronglyConnectedComponents.scala
@@ -53,34 +53,38 @@ object StronglyConnectedComponents {
 
       // collect min of all my neighbor's scc values, update if it's smaller than mine
       // then notify any neighbors with scc values larger than mine
-      sccWorkGraph = GraphLab[(VertexID, Boolean), ED, VertexID](sccWorkGraph, Integer.MAX_VALUE)(
-        (vid, e) => e.otherVertexAttr(vid)._1,
-        (vid1, vid2) => math.min(vid1, vid2),
-        (vid, scc, optScc) =>
-          (math.min(scc._1, optScc.getOrElse(scc._1)), scc._2),
-        (vid, e) => e.vertexAttr(vid)._1 < e.otherVertexAttr(vid)._1
-      )
+      sccWorkGraph = Pregel[(VertexID, Boolean), ED, VertexID](sccWorkGraph, Long.MaxValue)(
+        (vid, myScc, neighborScc) => (math.min(myScc._1, neighborScc), myScc._2),
+        e => {
+          if (e.srcId < e.dstId) {
+            Iterator((e.dstId, e.srcAttr._1))
+          } else {
+            Iterator()
+          }
+        },
+        (vid1, vid2) => math.min(vid1, vid2))
 
       // start at root of SCCs. Traverse values in reverse, notify all my neighbors
       // do not propagate if colors do not match!
-      sccWorkGraph = GraphLab[(VertexID, Boolean), ED, Boolean](
-        sccWorkGraph,
-        Integer.MAX_VALUE,
-        EdgeDirection.Out,
-        EdgeDirection.In
-      )(
+      sccWorkGraph = Pregel[(VertexID, Boolean), ED, Boolean](
+        sccWorkGraph, false, activeDirection = EdgeDirection.In)(
         // vertex is final if it is the root of a color
         // or it has the same color as a neighbor that is final
-        (vid, e) => (vid == e.vertexAttr(vid)._1) || (e.vertexAttr(vid)._1 == e.otherVertexAttr(vid)._1),
-        (final1, final2) => final1 || final2,
-        (vid, scc, optFinal) =>
-          (scc._1, scc._2 || optFinal.getOrElse(false)),
-       // activate neighbor if they are not final, you are, and you have the same color
-        (vid, e) => e.vertexAttr(vid)._2 &&
-            !e.otherVertexAttr(vid)._2 && (e.vertexAttr(vid)._1 == e.otherVertexAttr(vid)._1),
-        // start at root of colors
-        (vid, data) => vid == data._1
-      )
+        (vid, myScc, existsSameColorFinalNeighbor) => {
+          val isColorRoot = vid == myScc._1
+          (myScc._1, myScc._2 || isColorRoot || existsSameColorFinalNeighbor)
+        },
+        // activate neighbor if they are not final, you are, and you have the same color
+        e => {
+          val sameColor = e.dstAttr._1 == e.srcAttr._1
+          val onlyDstIsFinal = e.dstAttr._2 && !e.srcAttr._2
+          if (sameColor && onlyDstIsFinal) {
+            Iterator((e.srcId, e.dstAttr._2))
+          } else {
+            Iterator()
+          }
+        },
+        (final1, final2) => final1 || final2)
     }
     sccGraph
   }
-- 
cgit v1.2.3


From 64c4593586233409ff2c41607e7df33f3f13eb0a Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Sat, 11 Jan 2014 13:48:24 -0800
Subject: Finished docummenting join operators and revised some of the initial
 presentation.

---
 docs/graphx-programming-guide.md | 119 +++++++++++++++++++++++++++------------
 docs/img/graphx_figures.pptx     | Bin 1123365 -> 1123363 bytes
 2 files changed, 82 insertions(+), 37 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index b19c6b69de..5c9f1967cc 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -16,16 +16,14 @@ title: GraphX Programming Guide
 
 # Overview
 
-GraphX is the new (alpha) Spark API for graphs and graph-parallel
-computation. At a high-level, GraphX extends the Spark
-[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by introducing the
-[Resilient Distributed property Graph (RDG)](#property_graph): a directed graph
-with properties attached to each vertex and edge.  To support graph computation,
-GraphX exposes a set of functions (e.g., [mapReduceTriplets](#mrTriplets)) as
-well as an optimized variant of the [Pregel](http://giraph.apache.org) API. In
-addition, GraphX includes a growing collection of graph
-[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify
-graph analytics tasks.
+GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level,
+GraphX extends the Spark [RDD](api/core/index.html#org.apache.spark.rdd.RDD) by introducing the
+[Resilient Distributed property Graph (RDG)](#property_graph): a directed multigraph with properties
+attached to each vertex and edge.  To support graph computation, GraphX exposes a set of functions
+(e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
+[mapReduceTriplets](#mrTriplets)) as well as an optimized variant of the
+[Pregel](#pregel) API. In addition, GraphX includes a growing collection of graph
+[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify graph analytics tasks.
 
 ## Background on Graph-Parallel Computation
 
@@ -60,12 +58,10 @@ movement and duplication and a complicated programming model.
   <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
-The goal of the GraphX project is to unify graph-parallel and data-parallel
-computation in one system with a single composable API. The GraphX API
-enables users to view data both as a graph and as
-collection (i.e., RDDs) without data movement or duplication. By
-incorporating recent advances in graph-parallel systems, GraphX is able to optimize
-the execution of graph operations.
+The goal of the GraphX project is to unify graph-parallel and data-parallel computation in one
+system with a single composable API. The GraphX API enables users to view data both as a graph and
+as collections (i.e., RDDs) without data movement or duplication. By incorporating recent advances
+in graph-parallel systems, GraphX is able to optimize the execution of graph operations.
 
 ## GraphX Replaces the Spark Bagel API
 
@@ -95,12 +91,16 @@ If you are not using the Spark shell you will also need a Spark context.
 <a name="property_graph"></a>
 
 The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed multigraph
-graph with user defined objects attached to each vertex and edge.  As a multigraph it is possible
-for multiple edges to have the same source and destination vertex.  This can be useful when there
-are multiple relationships between the same vertices.  Like RDDs, property graphs are immutable,
-distributed, and fault-tolerant. Vertices are keyed by their vertex identifier (`VertexId`) which is
-a unique 64-bit long. Similarly, edges have corresponding source and destination vertex identifiers.
-Unlike other systems, GraphX does not impose any ordering or constraints on the vertex identifiers.
+graph with user defined objects attached to each vertex and edge.  A directed multigraph is a
+directed graph with potentially multiple parallel edges sharing the same source and destination
+vertex.  The ability to support parallel edges simplifies modeling scenarios where there can be
+multiple relationships (e.g., co-worker and friend) between the same vertices.  Note, however there
+can only be one instance of each vertex.
+
+Like RDDs, property graphs are immutable, distributed, and fault-tolerant. Vertices are keyed by
+their vertex identifier (`VertexId`) which is a unique 64-bit long. Similarly, edges have
+corresponding source and destination vertex identifiers. GraphX does not impose any ordering or
+constraints on the vertex identifiers.
 
 The property graph is parameterized over the vertex `VD` and edge `ED` types.  These are the types
 of the objects associated with each vertex and edge respectively.  In some cases it can be desirable
@@ -119,9 +119,12 @@ class Graph[VD: ClassTag, ED: ClassTag] {
   // ...
 }
 {% endhighlight %}
+
 > Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and
-> `EdgeRDD[ED]` respectively. These types extend and are optimized versions of `RDD[(VertexId, VD)]`
-> and `RDD[Edge[ED]]`.
+> `EdgeRDD[ED]` respectively. These classes extend and are optimized versions of `RDD[(VertexId,
+> VD)]` and `RDD[Edge[ED]]` with additional functionality built around the internal index and column
+> oriented representations.  We discuss the `VertexRDD` and `EdgeRDD` API in greater detail in the
+> section on [vertex and edge RDDs](#vertex_and_edge_rdds)
 
 For example, we might construct a property graph consisting of various collaborators on the GraphX
 project. The vertex property contains the username and occupation and the edge property contains
@@ -259,7 +262,7 @@ defined `map` function.
 > Note that in all cases the graph structure is unaffected.  This is a key feature of these
 > operators which allows the resulting graph to reuse the structural indicies and the unaffected
 > properties of the original graph.
-> While `graph.mapVertices(mapUDF)` is logically equivalent to the following, the following
+> While the following is logically equivalent to `graph.mapVertices(mapUDF)`, it
 > does not preserve the structural indicies and would not benefit from the substantial system
 > optimizations in GraphX.
 > {% highlight scala %}
@@ -340,32 +343,74 @@ thereby reducing the graph size in memory as well as the cost of computation.
 ## Join Operators
 <a name="join_operators"></a>
 
-The ability to move between graph and collection views of data is a key part of GraphX.  In many
-cases it is necessary to bring data from external collections into the graph.  For example, we might
-have extra user properties that we want to merge with an existing graph or we might want to pull
-vertex properties from one graph into another.  These tasks can be accomplished using the *join*
-operators.  Below we list the key join operators:
+The ability to move between graph and collection views is a key part of GraphX.  In many cases it is
+necessary to join data from external collections (RDDs) with graphs.  For example, we might have
+extra user properties that we want to merge with an existing graph or we might want to pull vertex
+properties from one graph into another.  These tasks can be accomplished using the *join* operators.
+Below we list the key join operators:
 
 {% highlight scala %}
-def joinVertices[U](table: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, U) => VD)
+def joinVertices[U](table: RDD[(VertexID, U)])(map: (VertexID, VD, U) => VD)
   : Graph[VD, ED]
-def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, Option[U]) => VD2)
+def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Option[U]) => VD2)
   : Graph[VD2, ED]
 {% endhighlight %}
 
-## Map Reduce Triplets (mapReduceTriplets)
-<a name="mrTriplets"></a>
+The `joinVertices` operators, defined in
+[`GraphOps.scala`](api/graphx/index.html#org.apache.spark.graphx.GraphOps), joins the vertices with
+the input RDD and returns a new graph with the vertex properties obtained by applying the user
+defined `map` function to the result of the joined vertices.  Vertices without a matching value in
+the RDD retain their original value.
 
+> Note that if the RDD contains more than one value for a given vertex only one will be used.   It
+> is therefore recommended that the input RDD be first made unique using the following which will
+> also *pre-index* the resulting values to substantially accelerate the subsequent join.
+> {% highlight scala %}
+val nonUniqueCosts: RDD[(VertexId, Double)]
+val uniqueCosts: VertexRDD[Double] =
+  graph.vertices.aggregateUsingIndex(nonUnique, (a,b) => a + b)
+val joinedGraph = graph.joinVertices(uniqueCosts)(
+  (id, oldCost, extraCost) => oldCost + extraCost)
+{% endhighlight %}
 
+The more general `outerJoinVertices` behaves similarly to `joinVertices` except that the user
+defined `map` function is applied to all vertices and can change the vertex property type.  Because
+not all vertices may have a matching value in the input RDD the `map` function takes an `Option`
+type.  For example, we can setup a graph for PageRank by initializing vertex properties with their
+`outDegree`.
 
+{% highlight scala %}
+val outDegrees: VertexRDD[Int] = graph.outDegrees
+val degreeGraph = graph.outerJoinVertices(outDegrees) { (id, oldAttr, outDegOpt) =>
+  outDegOpt match {
+    case Some(outDeg) => outDeg
+    case None => 0 // No outDegree means zero outDegree
+  }
+}
+{% endhighlight %}
+
+> You may have noticed the multiple parameter lists (e.g., `f(a)(b)`) curried function pattern used
+> in the above examples.  While we could have equally written `f(a)(b)` as `f(a,b)` this would mean
+> that type inference on `b` would not depend on `a`.  As a consequence, the user would need to
+> provide type annotation for the user defined function:
+> {% highlight scala %}
+val joinedGraph = graph.joinVertices(uniqueCosts,
+  (id: VertexId, oldCost: Double, extraCost: Double) => oldCost + extraCost)
+{% endhighlight %}
+
+
+## Map Reduce Triplets (mapReduceTriplets)
+<a name="mrTriplets"></a>
+
+# Pregel API
+<a name="pregel"></a>
 
 # Graph Builders
 <a name="graph_builders"></a>
 
+# Vertex and Edge RDDs
+<a name="vertex_and_edge_rdds"></a>
 
-{% highlight scala %}
-val userGraph: Graph[(String, String), String]
-{% endhighlight %}
 
 
 # Optimized Representation
diff --git a/docs/img/graphx_figures.pptx b/docs/img/graphx_figures.pptx
index ea4f82ce82..e567bf08fe 100644
Binary files a/docs/img/graphx_figures.pptx and b/docs/img/graphx_figures.pptx differ
-- 
cgit v1.2.3


From cf57b1b0555b89953f1eb2a2d9819e20fcd17708 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Sat, 11 Jan 2014 17:13:10 -0800
Subject: Correcting typos in documentation.

---
 docs/graphx-programming-guide.md | 145 +++++++++++++++++++++------------------
 1 file changed, 79 insertions(+), 66 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 5c9f1967cc..9a7c4ac179 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -19,11 +19,11 @@ title: GraphX Programming Guide
 GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level,
 GraphX extends the Spark [RDD](api/core/index.html#org.apache.spark.rdd.RDD) by introducing the
 [Resilient Distributed property Graph (RDG)](#property_graph): a directed multigraph with properties
-attached to each vertex and edge.  To support graph computation, GraphX exposes a set of functions
-(e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
-[mapReduceTriplets](#mrTriplets)) as well as an optimized variant of the
-[Pregel](#pregel) API. In addition, GraphX includes a growing collection of graph
-[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify graph analytics tasks.
+attached to each vertex and edge.  To support graph computation, GraphX exposes a set of fundamental
+operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
+[mapReduceTriplets](#mrTriplets)) as well as an optimized variant of the [Pregel](#pregel) API. In
+addition, GraphX includes a growing collection of graph [algorithms](#graph_algorithms) and
+[builders](#graph_builders) to simplify graph analytics tasks.
 
 ## Background on Graph-Parallel Computation
 
@@ -65,15 +65,13 @@ in graph-parallel systems, GraphX is able to optimize the execution of graph ope
 
 ## GraphX Replaces the Spark Bagel API
 
-Prior to the release of GraphX, graph computation in Spark was expressed using
-Bagel, an implementation of the Pregel API.  GraphX improves upon Bagel by
-exposing a richer property graph API, a more streamlined version of the Pregel
-abstraction, and system optimizations to improve performance and reduce memory
-overhead.  While we plan to eventually deprecate the Bagel, we will continue to
-support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and
-[Bagel programming guide](bagel-programming-guide.html). However, we encourage
-Bagel users to explore the new GraphX API and comment on issues that may
-complicate the transition from Bagel.
+Prior to the release of GraphX, graph computation in Spark was expressed using Bagel, an
+implementation of Pregel.  GraphX improves upon Bagel by exposing a richer property graph API, a
+more streamlined version of the Pregel abstraction, and system optimizations to improve performance
+and reduce memory overhead.  While we plan to eventually deprecate the Bagel, we will continue to
+support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and [Bagel programming
+guide](bagel-programming-guide.html). However, we encourage Bagel users to explore the new GraphX
+API and comment on issues that may complicate the transition from Bagel.
 
 # Getting Started
 
@@ -94,41 +92,55 @@ The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a d
 graph with user defined objects attached to each vertex and edge.  A directed multigraph is a
 directed graph with potentially multiple parallel edges sharing the same source and destination
 vertex.  The ability to support parallel edges simplifies modeling scenarios where there can be
-multiple relationships (e.g., co-worker and friend) between the same vertices.  Note, however there
-can only be one instance of each vertex.
-
-Like RDDs, property graphs are immutable, distributed, and fault-tolerant. Vertices are keyed by
-their vertex identifier (`VertexId`) which is a unique 64-bit long. Similarly, edges have
-corresponding source and destination vertex identifiers. GraphX does not impose any ordering or
-constraints on the vertex identifiers.
-
-The property graph is parameterized over the vertex `VD` and edge `ED` types.  These are the types
-of the objects associated with each vertex and edge respectively.  In some cases it can be desirable
-to have vertices of different types.  However, this can be accomplished through inheritance.
+multiple relationships (e.g., co-worker and friend) between the same vertices.  Each vertex is keyed
+by a *unique* 64-bit long identifier (`VertexId`).  Similarly, edges have corresponding source and
+destination vertex identifiers. GraphX does not impose any ordering or constraints on the vertex
+identifiers.  The property graph is parameterized over the vertex `VD` and edge `ED` types.  These
+are the types of the objects associated with each vertex and edge respectively.
 
 > GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
 > int, double, etc...) reducing the in memory footprint.
 
+In some cases we may wish to have vertices with different property types in the same graph. This can
+be accomplished through inheritance.  For example to model users and products as a bipartie graph we
+might do the following:
+
+{% highlight scala %}
+case class VertexProperty
+case class UserProperty extends VertexProperty
+  (val name: String)
+case class ProductProperty extends VertexProperty
+  (val name: String, val price: Double)
+// The graph might then have the type:
+val graph: Graph[VertexProperty, String]
+{% endhighlight %}
+
+Like RDDs, property graphs are immutable, distributed, and fault-tolerant.  Changes to the values or
+structure of the graph are accomplished by producing a new graph with the desired changes. The graph
+is partitioned across the workers using a range of vertex-partitioning heuristics.  As with RDDs,
+each partition of the graph can be recreated on a different machine in the event of a failure.
+
 Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
-properties for each vertex and edge:
+properties for each vertex and edge.  As a consequence, the graph class contains members to access
+the vertices and edges of the graph:
 
 {% highlight scala %}
-class Graph[VD: ClassTag, ED: ClassTag] {
-  val vertices: RDD[(VertexId, VD)]
-  val edges: RDD[Edge[ED]]
-  // ...
-}
+val vertices: VertexRDD[VD]
+val edges: EdgeRDD[ED]
 {% endhighlight %}
 
-> Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and
-> `EdgeRDD[ED]` respectively. These classes extend and are optimized versions of `RDD[(VertexId,
-> VD)]` and `RDD[Edge[ED]]` with additional functionality built around the internal index and column
-> oriented representations.  We discuss the `VertexRDD` and `EdgeRDD` API in greater detail in the
-> section on [vertex and edge RDDs](#vertex_and_edge_rdds)
+The classes `VertexRDD[VD]` and `EdgeRDD[ED]` extend and are optimized versions of `RDD[(VertexId,
+VD)]` and `RDD[Edge[ED]]` respectively.  Both `VertexRDD[VD]` and `EdgeRDD[ED]` provide  additional
+functionality built around graph computation and leverage internal optimizations.  We discuss the
+`VertexRDD` and `EdgeRDD` API in greater detail in the section on [vertex and edge
+RDDs](#vertex_and_edge_rdds) but for now they can be thought of as simply RDDs of the form:
+`RDD[(VertexId, VD)]` and `RDD[Edge[ED]]`.
+
+### Example Property Graph
 
-For example, we might construct a property graph consisting of various collaborators on the GraphX
-project. The vertex property contains the username and occupation and the edge property contains
-a string describing the relationships between the users.
+Suppose we want to construct a property graph consisting of the various collaborators on the GraphX
+project. The vertex property might contain the username and occupation.  We could annotate edges
+with a string describing the relationships between collaborators:
 
 <p style="text-align: center;">
   <img src="img/property_graph.png"
@@ -183,18 +195,19 @@ graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
 graph.edges.filter(e => e.srcId > e.dstId).count
 {% endhighlight %}
 
-> Note that `graph.vertices` returns an `RDD[(VertexId, (String, String))]` and so we must use the
-> scala `case` expression to deconstruct the tuple.  Alternatively, `graph.edges` returns an `RDD`
-> containing `Edge[String]` objects.  We could have also used the case class type constructor as
-> in the following:
+> Note that `graph.vertices` returns an `VertexRDD[(String, String)]` which extends
+> `RDD[(VertexId, (String, String))]` and so we use the scala `case` expression to deconstruct
+> the tuple.  Alternatively, `graph.edges` returns an `EdgeRDD` containing `Edge[String]` objects.
+> We could have also used the case class type constructor as in the following:
 > {% highlight scala %}
 graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
 {% endhighlight %}
 
 In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
 The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD,
-ED]]` consisting of [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet).
-This *join* can be expressed in the following SQL expression:
+ED]]` containing instances of the
+[`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class. This *join* can be
+expressed in the following SQL expression:
 
 {% highlight sql %}
 SELECT src.id, dst.id, src.attr, e.attr, dst.attr
@@ -266,7 +279,7 @@ defined `map` function.
 > does not preserve the structural indicies and would not benefit from the substantial system
 > optimizations in GraphX.
 > {% highlight scala %}
-val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))}
+val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr)) }
 val newGraph = Graph(newVertices, graph.edges)
 {% endhighlight %}
 
@@ -291,12 +304,9 @@ add more in the future.  The following is a list of the basic structural operato
 
 {% highlight scala %}
 def reverse: Graph[VD, ED]
-
-def subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
-  vpred: (VertexID, VD) => Boolean = ((v,d) => true) ): Graph[VD, ED]
-
+def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
+             vpred: (VertexID, VD) => Boolean): Graph[VD, ED]
 def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
-
 def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
 {% endhighlight %}
 
@@ -309,7 +319,7 @@ The `subgraph` operator takes vertex and edge predicates and returns the graph c
 vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge
 predicate *and connect vertices that satisfy the vertex predicate*.  The `subgraph` operator can be
 used in number of situations to restrict the graph to the vertices and edges of interest or
-eliminate broken links.  For example in the following code we remove broken links:
+eliminate broken links. For example in the following code we remove broken links:
 
 {% highlight scala %}
 val users: RDD[(VertexId, (String, String))]
@@ -322,32 +332,35 @@ val graph = Graph(users, relationships, defaultUser)
 val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
 {% endhighlight %}
 
-The `mask` operators returns the subgraph containing only the vertices and edges that are found in
-the input graph.  This can be used in conjunction with the `subgraph` operator to restrict a graph
-based on the properties in another related graph.  For example, we might run connected components
-using the graph with missing vertices and then restrict the answer to the valid subgraph.
+> Note in the above example only the vertex predicate is provided.  The `subgraph` operator defaults
+> to `true` if the vertex or edge predicates are not provided.
+
+The `mask` operator also constructs a subgraph by returning a graph that contains the vertices and
+edges that are also found in the input graph.  This can be used in conjunction with the `subgraph`
+operator to restrict a graph based on the properties in another related graph.  For example, we
+might run connected components using the graph with missing vertices and then restrict the answer to
+the valid subgraph.
 
 {% highlight scala %}
 // Run Connected Components
-val ccGraph = graph.connectedComponents()
+val ccGraph = graph.connectedComponents() // No longer contains missing field
 // Remove missing vertices as well as the edges to connected to them
 val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
 // Restrict the answer to the valid subgraph
 val validCCGraph = ccGraph.mask(validGraph)
 {% endhighlight %}
 
-The `groupEdges` operator merges parallel edges: duplicate edges between pairs of vertices.  In many
-numerical applications parallel edges can be *added* (their weights combined) into a single edge
-thereby reducing the graph size in memory as well as the cost of computation.
+The `groupEdges` operator merges parallel edges (i.e., duplicate edges between pairs of vertices) in
+the multigraph.  In many numerical applications, parallel edges can be *added* (their weights
+combined) into a single edge thereby reducing the size of the graph.
 
 ## Join Operators
 <a name="join_operators"></a>
 
-The ability to move between graph and collection views is a key part of GraphX.  In many cases it is
-necessary to join data from external collections (RDDs) with graphs.  For example, we might have
-extra user properties that we want to merge with an existing graph or we might want to pull vertex
-properties from one graph into another.  These tasks can be accomplished using the *join* operators.
-Below we list the key join operators:
+In many cases it is necessary to join data from external collections (RDDs) with graphs.  For
+example, we might have extra user properties that we want to merge with an existing graph or we
+might want to pull vertex properties from one graph into another.  These tasks can be accomplished
+using the *join* operators. Below we list the key join operators:
 
 {% highlight scala %}
 def joinVertices[U](table: RDD[(VertexID, U)])(map: (VertexID, VD, U) => VD)
@@ -356,7 +369,7 @@ def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Opt
   : Graph[VD2, ED]
 {% endhighlight %}
 
-The `joinVertices` operators, defined in
+The `joinVertices` operator, defined in
 [`GraphOps.scala`](api/graphx/index.html#org.apache.spark.graphx.GraphOps), joins the vertices with
 the input RDD and returns a new graph with the vertex properties obtained by applying the user
 defined `map` function to the result of the joined vertices.  Vertices without a matching value in
-- 
cgit v1.2.3


From f096f4eaf1f8e936eafc2006ecd01faa2f208cf2 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Sun, 12 Jan 2014 10:55:29 -0800
Subject: Link methods in programming guide; document VertexID

---
 docs/graphx-programming-guide.md                   | 155 ++++++++++++---------
 .../scala/org/apache/spark/graphx/package.scala    |   4 +
 2 files changed, 90 insertions(+), 69 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 9a7c4ac179..7f93754edb 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -68,15 +68,14 @@ in graph-parallel systems, GraphX is able to optimize the execution of graph ope
 Prior to the release of GraphX, graph computation in Spark was expressed using Bagel, an
 implementation of Pregel.  GraphX improves upon Bagel by exposing a richer property graph API, a
 more streamlined version of the Pregel abstraction, and system optimizations to improve performance
-and reduce memory overhead.  While we plan to eventually deprecate the Bagel, we will continue to
-support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and [Bagel programming
-guide](bagel-programming-guide.html). However, we encourage Bagel users to explore the new GraphX
-API and comment on issues that may complicate the transition from Bagel.
+and reduce memory overhead.  While we plan to eventually deprecate Bagel, we will continue to
+support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and
+[Bagel programming guide](bagel-programming-guide.html). However, we encourage Bagel users to
+explore the new GraphX API and comment on issues that may complicate the transition from Bagel.
 
 # Getting Started
 
-To get started you first need to import Spark and GraphX into your project.  This can be done by
-importing the following:
+To get started you first need to import Spark and GraphX into your project, as follows:
 
 {% highlight scala %}
 import org.apache.spark._
@@ -89,11 +88,11 @@ If you are not using the Spark shell you will also need a Spark context.
 <a name="property_graph"></a>
 
 The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed multigraph
-graph with user defined objects attached to each vertex and edge.  A directed multigraph is a
-directed graph with potentially multiple parallel edges sharing the same source and destination
-vertex.  The ability to support parallel edges simplifies modeling scenarios where there can be
-multiple relationships (e.g., co-worker and friend) between the same vertices.  Each vertex is keyed
-by a *unique* 64-bit long identifier (`VertexId`).  Similarly, edges have corresponding source and
+with user defined objects attached to each vertex and edge.  A directed multigraph is a directed
+graph with potentially multiple parallel edges sharing the same source and destination vertex.  The
+ability to support parallel edges simplifies modeling scenarios where there can be multiple
+relationships (e.g., co-worker and friend) between the same vertices.  Each vertex is keyed by a
+*unique* 64-bit long identifier (`VertexId`).  Similarly, edges have corresponding source and
 destination vertex identifiers. GraphX does not impose any ordering or constraints on the vertex
 identifiers.  The property graph is parameterized over the vertex `VD` and edge `ED` types.  These
 are the types of the objects associated with each vertex and edge respectively.
@@ -102,8 +101,8 @@ are the types of the objects associated with each vertex and edge respectively.
 > int, double, etc...) reducing the in memory footprint.
 
 In some cases we may wish to have vertices with different property types in the same graph. This can
-be accomplished through inheritance.  For example to model users and products as a bipartie graph we
-might do the following:
+be accomplished through inheritance.  For example to model users and products as a bipartite graph
+we might do the following:
 
 {% highlight scala %}
 case class VertexProperty
@@ -159,8 +158,8 @@ val userGraph: Graph[(String, String), String]
 There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
 generators and these are discussed in more detail in the section on
 [graph builders](#graph_builders).  Probably the most general method is to use the
-[graph singleton](api/graphx/index.html#org.apache.spark.graphx.Graph$).
-For example the following code constructs a graph from a collection of RDDs:
+[Graph object](api/graphx/index.html#org.apache.spark.graphx.Graph$).  For example the following
+code constructs a graph from a collection of RDDs:
 
 {% highlight scala %}
 // Assume the SparkContext has already been constructed
@@ -179,10 +178,11 @@ val defaultUser = ("John Doe", "Missing")
 val graph = Graph(users, relationships, defaultUser)
 {% endhighlight %}
 
-In the above example we make use of the [`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge)
-case class. Edges have a `srcId` and a `dstId` corresponding to the source and destination vertex
-identifiers. In addition, the `Edge` class contains the `attr` member which contains the edge
-property.
+In the above example we make use of the [`Edge`][Edge] case class. Edges have a `srcId` and a
+`dstId` corresponding to the source and destination vertex identifiers. In addition, the `Edge`
+class contains the `attr` member which contains the edge property.
+
+[Edge]: api/graphx/index.html#org.apache.spark.graphx.Edge
 
 We can deconstruct a graph into the respective vertex and edge views by using the `graph.vertices`
 and `graph.edges` members respectively.
@@ -196,18 +196,19 @@ graph.edges.filter(e => e.srcId > e.dstId).count
 {% endhighlight %}
 
 > Note that `graph.vertices` returns an `VertexRDD[(String, String)]` which extends
-> `RDD[(VertexId, (String, String))]` and so we use the scala `case` expression to deconstruct
-> the tuple.  Alternatively, `graph.edges` returns an `EdgeRDD` containing `Edge[String]` objects.
+> `RDD[(VertexId, (String, String))]` and so we use the scala `case` expression to deconstruct the
+> tuple.  On the other hand, `graph.edges` returns an `EdgeRDD` containing `Edge[String]` objects.
 > We could have also used the case class type constructor as in the following:
 > {% highlight scala %}
 graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
 {% endhighlight %}
 
 In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
-The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD,
-ED]]` containing instances of the
-[`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class. This *join* can be
-expressed in the following SQL expression:
+The triplet view logically joins the vertex and edge properties yielding an
+`RDD[EdgeTriplet[VD, ED]]` containing instances of the [`EdgeTriplet`][EdgeTriplet] class. This
+*join* can be expressed in the following SQL expression:
+
+[EdgeTriplet]: api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet
 
 {% highlight sql %}
 SELECT src.id, dst.id, src.attr, e.attr, dst.attr
@@ -225,8 +226,7 @@ or graphically as:
   <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
-The [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class extends the
-[`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge) class by adding the `srcAttr` and
+The [`EdgeTriplet`][EdgeTriplet] class extends the [`Edge`][Edge] class by adding the `srcAttr` and
 `dstAttr` members which contain the source and destination properties respectively. We can use the
 triplet view of a graph to render a collection of strings describing relationships between users.
 
@@ -240,14 +240,15 @@ val facts: RDD[String] =
 # Graph Operators
 
 Just as RDDs have basic operations like `map`, `filter`, and `reduceByKey`, property graphs also
-have a collection of basic operators that take user defined function and produce new graphs with
+have a collection of basic operators that take user defined functions and produce new graphs with
 transformed properties and structure.  The core operators that have optimized implementations are
-defined in [`Graph.scala`](api/graphx/index.html#org.apache.spark.graphx.Graph) and convenient
-operators that are expressed as a compositions of the core operators are defined in
-['GraphOps.scala'](api/graphx/index.html#org.apache.spark.graphx.GraphOps).  However, thanks to
-Scala implicits the operators in `GraphOps.scala` are automatically available as members of
-`Graph.scala`.  For example, we can compute the in-degree of each vertex (defined in
-'GraphOps.scala') by the following:
+defined in [`Graph`][Graph] and convenient operators that are expressed as a compositions of the
+core operators are defined in [`GraphOps`][GraphOps].  However, thanks to Scala implicits the
+operators in `GraphOps` are automatically available as members of `Graph`.  For example, we can
+compute the in-degree of each vertex (defined in `GraphOps`) by the following:
+
+[Graph]: api/graphx/index.html#org.apache.spark.graphx.Graph
+[GraphOps]: api/graphx/index.html#org.apache.spark.graphx.GraphOps
 
 {% highlight scala %}
 val graph: Graph[(String, String), String]
@@ -272,20 +273,24 @@ def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
 Each of these operators yields a new graph with the vertex or edge properties modified by the user
 defined `map` function.
 
-> Note that in all cases the graph structure is unaffected.  This is a key feature of these
-> operators which allows the resulting graph to reuse the structural indicies and the unaffected
-> properties of the original graph.
-> While the following is logically equivalent to `graph.mapVertices(mapUDF)`, it
-> does not preserve the structural indicies and would not benefit from the substantial system
-> optimizations in GraphX.
+> Note that in all cases the graph structure is unaffected. This is a key feature of these operators
+> which allows the resulting graph to reuse the structural indices of the original graph. The
+> following snippets are logically equivalent, but the first one does not preserve the structural
+> indices and would not benefit from the GraphX system optimizations:
 > {% highlight scala %}
 val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr)) }
 val newGraph = Graph(newVertices, graph.edges)
 {% endhighlight %}
+> Instead, use [`mapVertices`][Graph.mapVertices] to preserve the indices:
+> {% highlight scala %}
+val newGraph = graph.mapVertices((id, attr) => mapUdf(id, attr))
+{% endhighlight %}
+
+[Graph.mapVertices]: api/graphx/index.html#org.apache.spark.graphx.Graph@mapVertices[VD2]((VertexID,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED]
 
 These operators are often used to initialize the graph for a particular computation or project away
 unnecessary properties.  For example, given a graph with the out-degrees as the vertex properties
-(we describe how to construct such a graph later) we initialize for PageRank:
+(we describe how to construct such a graph later), we initialize it for PageRank:
 
 {% highlight scala %}
 // Given a graph where the vertex property is the out-degree
@@ -293,7 +298,7 @@ val inputGraph: Graph[Int, String]
 // Construct a graph where each edge contains the weight
 // and each vertex is the initial PageRank
 val outputGraph: Graph[Double, Double] =
-  inputGraph.mapTriplets(et => 1.0/et.srcAttr).mapVertices(v => 1.0)
+  inputGraph.mapTriplets(et => 1.0 / et.srcAttr).mapVertices(v => 1.0)
 {% endhighlight %}
 
 ## Structural Operators
@@ -310,16 +315,20 @@ def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
 def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
 {% endhighlight %}
 
-The `reverse` operator returns a new graph with all the edge directions reversed.  This can be
-useful when, for example, trying to compute the inverse PageRank.  Because the reverse operation
-does not modify vertex or edge properties or change the number of edges, it can be implemented
-efficiently without data-movement or duplication.
+The [`reverse`][Graph.reverse] operator returns a new graph with all the edge directions reversed.
+This can be useful when, for example, trying to compute the inverse PageRank.  Because the reverse
+operation does not modify vertex or edge properties or change the number of edges, it can be
+implemented efficiently without data-movement or duplication.
+
+[Graph.reverse]: api/graphx/index.html#org.apache.spark.graphx.Graph@reverse:Graph[VD,ED]
 
-The `subgraph` operator takes vertex and edge predicates and returns the graph containing only the
-vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge
-predicate *and connect vertices that satisfy the vertex predicate*.  The `subgraph` operator can be
-used in number of situations to restrict the graph to the vertices and edges of interest or
-eliminate broken links. For example in the following code we remove broken links:
+The [`subgraph`][Graph.subgraph] operator takes vertex and edge predicates and returns the graph
+containing only the vertices that satisfy the vertex predicate (evaluate to true) and edges that
+satisfy the edge predicate *and connect vertices that satisfy the vertex predicate*.  The `subgraph`
+operator can be used in number of situations to restrict the graph to the vertices and edges of
+interest or eliminate broken links. For example in the following code we remove broken links:
+
+[Graph.subgraph]: api/graphx/index.html#org.apache.spark.graphx.Graph@subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexID,VD)⇒Boolean):Graph[VD,ED]
 
 {% highlight scala %}
 val users: RDD[(VertexId, (String, String))]
@@ -335,11 +344,13 @@ val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
 > Note in the above example only the vertex predicate is provided.  The `subgraph` operator defaults
 > to `true` if the vertex or edge predicates are not provided.
 
-The `mask` operator also constructs a subgraph by returning a graph that contains the vertices and
-edges that are also found in the input graph.  This can be used in conjunction with the `subgraph`
-operator to restrict a graph based on the properties in another related graph.  For example, we
-might run connected components using the graph with missing vertices and then restrict the answer to
-the valid subgraph.
+The [`mask`][Graph.mask] operator also constructs a subgraph by returning a graph that contains the
+vertices and edges that are also found in the input graph.  This can be used in conjunction with the
+`subgraph` operator to restrict a graph based on the properties in another related graph.  For
+example, we might run connected components using the graph with missing vertices and then restrict
+the answer to the valid subgraph.
+
+[Graph.mask]: api/graphx/index.html#org.apache.spark.graphx.Graph@mask[VD2,ED2](Graph[VD2,ED2])(ClassTag[VD2],ClassTag[ED2]):Graph[VD,ED]
 
 {% highlight scala %}
 // Run Connected Components
@@ -350,9 +361,11 @@ val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
 val validCCGraph = ccGraph.mask(validGraph)
 {% endhighlight %}
 
-The `groupEdges` operator merges parallel edges (i.e., duplicate edges between pairs of vertices) in
-the multigraph.  In many numerical applications, parallel edges can be *added* (their weights
-combined) into a single edge thereby reducing the size of the graph.
+The [`groupEdges`][Graph.groupEdges] operator merges parallel edges (i.e., duplicate edges between
+pairs of vertices) in the multigraph.  In many numerical applications, parallel edges can be *added*
+(their weights combined) into a single edge thereby reducing the size of the graph.
+
+[Graph.groupEdges]: api/graphx/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED]
 
 ## Join Operators
 <a name="join_operators"></a>
@@ -369,11 +382,12 @@ def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Opt
   : Graph[VD2, ED]
 {% endhighlight %}
 
-The `joinVertices` operator, defined in
-[`GraphOps.scala`](api/graphx/index.html#org.apache.spark.graphx.GraphOps), joins the vertices with
-the input RDD and returns a new graph with the vertex properties obtained by applying the user
-defined `map` function to the result of the joined vertices.  Vertices without a matching value in
-the RDD retain their original value.
+The [`joinVertices`][GraphOps.joinVertices] operator joins the vertices with the input RDD and
+returns a new graph with the vertex properties obtained by applying the user defined `map` function
+to the result of the joined vertices.  Vertices without a matching value in the RDD retain their
+original value.
+
+[GraphOps.joinVertices]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@joinVertices[U](RDD[(VertexID,U)])((VertexID,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED]
 
 > Note that if the RDD contains more than one value for a given vertex only one will be used.   It
 > is therefore recommended that the input RDD be first made unique using the following which will
@@ -386,11 +400,14 @@ val joinedGraph = graph.joinVertices(uniqueCosts)(
   (id, oldCost, extraCost) => oldCost + extraCost)
 {% endhighlight %}
 
-The more general `outerJoinVertices` behaves similarly to `joinVertices` except that the user
-defined `map` function is applied to all vertices and can change the vertex property type.  Because
-not all vertices may have a matching value in the input RDD the `map` function takes an `Option`
-type.  For example, we can setup a graph for PageRank by initializing vertex properties with their
-`outDegree`.
+The more general [`outerJoinVertices`][Graph.outerJoinVertices] behaves similarly to `joinVertices`
+except that the user defined `map` function is applied to all vertices and can change the vertex
+property type.  Because not all vertices may have a matching value in the input RDD the `map`
+function takes an `Option` type.  For example, we can setup a graph for PageRank by initializing
+vertex properties with their `outDegree`.
+
+[Graph.outerJoinVertices]: api/graphx/index.html#org.apache.spark.graphx.Graph@outerJoinVertices[U,VD2](RDD[(VertexID,U)])((VertexID,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED]
+
 
 {% highlight scala %}
 val outDegrees: VertexRDD[Int] = graph.outDegrees
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/package.scala b/graphx/src/main/scala/org/apache/spark/graphx/package.scala
index 2501314ca8..e70d2fd09f 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/package.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/package.scala
@@ -3,6 +3,10 @@ package org.apache.spark
 import org.apache.spark.util.collection.OpenHashSet
 
 package object graphx {
+  /**
+   * A 64-bit vertex identifier that uniquely identifies a vertex within a graph. It does not need
+   * to follow any ordering or any constraints other than uniqueness.
+   */
   type VertexID = Long
 
   // TODO: Consider using Char.
-- 
cgit v1.2.3


From 5e35d39e0f26db3b669bc2318bd7b3f9f6c5fc50 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Sun, 12 Jan 2014 13:10:53 -0800
Subject: Add PageRank example and data

---
 docs/graphx-programming-guide.md                   | 32 +++++++++++++++++++++-
 graphx/data/followers.txt                          | 12 ++++++++
 graphx/data/users.txt                              |  6 ++++
 .../org/apache/spark/graphx/lib/PageRank.scala     |  2 +-
 4 files changed, 50 insertions(+), 2 deletions(-)
 create mode 100644 graphx/data/followers.txt
 create mode 100644 graphx/data/users.txt

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 7f93754edb..52668b07c8 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -470,10 +470,40 @@ things to worry about.)
 # Graph Algorithms
 <a name="graph_algorithms"></a>
 
-This section should describe the various algorithms and how they are used.
+GraphX includes a set of graph algorithms in to simplify analytics. The algorithms are contained in the `org.apache.spark.graphx.lib` package and can be accessed directly as methods on `Graph` via an implicit conversion to [`Algorithms`][Algorithms]. This section describes the algorithms and how they are used.
+
+[Algorithms]: api/graphx/index.html#org.apache.spark.graphx.lib.Algorithms
 
 ## PageRank
 
+PageRank measures the importance of each vertex in a graph, assuming an edge from *u* to *v* represents an endorsement of *v*'s importance by *u*. For example, if a Twitter user is followed by many others, the user will be ranked highly.
+
+Spark includes an example social network dataset that we can run PageRank on. A set of users is given in `graphx/data/users.txt`, and a set of relationships between users is given in `graphx/data/followers.txt`. We can compute the PageRank of each user as follows:
+
+{% highlight scala %}
+// Load the implicit conversion to Algorithms
+import org.apache.spark.graphx.lib._
+// Load the datasets into a graph
+val users = sc.textFile("graphx/data/users.txt").map { line =>
+  val fields = line.split("\\s+")
+  (fields(0).toLong, fields(1))
+}
+val followers = sc.textFile("graphx/data/followers.txt").map { line =>
+  val fields = line.split("\\s+")
+  Edge(fields(0).toLong, fields(1).toLong, 1)
+}
+val graph = Graph(users, followers)
+// Run PageRank
+val ranks = graph.pageRank(0.0001).vertices
+// Join the ranks with the usernames
+val ranksByUsername = users.leftOuterJoin(ranks).map {
+  case (id, (username, rankOpt)) => (username, rankOpt.getOrElse(0.0))
+}
+// Print the result
+println(ranksByUsername.collect().mkString("\n"))
+{% endhighlight %}
+
+
 ## Connected Components
 
 ## Shortest Path
diff --git a/graphx/data/followers.txt b/graphx/data/followers.txt
new file mode 100644
index 0000000000..0f46d80806
--- /dev/null
+++ b/graphx/data/followers.txt
@@ -0,0 +1,12 @@
+2 1
+3 1
+4 1
+6 1
+3 2
+6 2
+7 2
+6 3
+7 3
+7 6
+6 7
+3 7
diff --git a/graphx/data/users.txt b/graphx/data/users.txt
new file mode 100644
index 0000000000..ce3d06c600
--- /dev/null
+++ b/graphx/data/users.txt
@@ -0,0 +1,6 @@
+1 BarackObama
+2 ericschmidt
+3 jeresig
+4 justinbieber
+6 matei_zaharia
+7 odersky
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
index 809b6d0855..cf95267e77 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/PageRank.scala
@@ -106,7 +106,7 @@ object PageRank extends Logging {
    * @tparam ED the original edge attribute (not used)
    *
    * @param graph the graph on which to compute PageRank
-   * @param tol the tolerance allowed at convergence (smaller => more * accurate).
+   * @param tol the tolerance allowed at convergence (smaller => more accurate).
    * @param resetProb the random reset probability (alpha)
    *
    * @return the graph containing with each vertex containing the PageRank and each edge
-- 
cgit v1.2.3


From 7a4bb863c7c11e22332763081793e4989af8c526 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Sun, 12 Jan 2014 16:58:18 -0800
Subject: Add connected components example to doc

---
 docs/graphx-programming-guide.md | 20 +++++++++++++++++++-
 graphx/data/followers.txt        |  6 +-----
 graphx/data/users.txt            |  2 +-
 3 files changed, 21 insertions(+), 7 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 52668b07c8..22feccb7ad 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -475,6 +475,7 @@ GraphX includes a set of graph algorithms in to simplify analytics. The algorith
 [Algorithms]: api/graphx/index.html#org.apache.spark.graphx.lib.Algorithms
 
 ## PageRank
+<a name="pagerank"></a>
 
 PageRank measures the importance of each vertex in a graph, assuming an edge from *u* to *v* represents an endorsement of *v*'s importance by *u*. For example, if a Twitter user is followed by many others, the user will be ranked highly.
 
@@ -503,9 +504,26 @@ val ranksByUsername = users.leftOuterJoin(ranks).map {
 println(ranksByUsername.collect().mkString("\n"))
 {% endhighlight %}
 
-
 ## Connected Components
 
+The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected components can approximate clusters. We can compute the connected components of the example social network dataset from the [PageRank section](#pagerank) as follows:
+
+{% highlight scala %}
+// Load the implicit conversion and graph as in the PageRank example
+import org.apache.spark.graphx.lib._
+val users = ...
+val followers = ...
+val graph = Graph(users, followers)
+// Find the connected components
+val cc = graph.connectedComponents().vertices
+// Join the connected components with the usernames
+val ccByUsername = graph.vertices.innerJoin(cc) { (id, username, cc) =>
+  (username, cc)
+}
+// Print the result
+println(ccByUsername.collect().mkString("\n"))
+{% endhighlight %}
+
 ## Shortest Path
 
 ## Triangle Counting
diff --git a/graphx/data/followers.txt b/graphx/data/followers.txt
index 0f46d80806..7bb8e900e2 100644
--- a/graphx/data/followers.txt
+++ b/graphx/data/followers.txt
@@ -1,10 +1,6 @@
 2 1
-3 1
 4 1
-6 1
-3 2
-6 2
-7 2
+1 2
 6 3
 7 3
 7 6
diff --git a/graphx/data/users.txt b/graphx/data/users.txt
index ce3d06c600..26e3b3bb4d 100644
--- a/graphx/data/users.txt
+++ b/graphx/data/users.txt
@@ -1,5 +1,5 @@
 1 BarackObama
-2 ericschmidt
+2 ladygaga
 3 jeresig
 4 justinbieber
 6 matei_zaharia
-- 
cgit v1.2.3


From c787ff5640ad9d6f6dc3b744d73a1cb0c91eb90a Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Sun, 12 Jan 2014 20:49:41 -0800
Subject: Documenting Pregel API

---
 docs/graphx-programming-guide.md | 199 ++++++++++++++++++++++++++++++++++++++-
 1 file changed, 198 insertions(+), 1 deletion(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 22feccb7ad..89759416f4 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -429,12 +429,209 @@ val joinedGraph = graph.joinVertices(uniqueCosts,
 {% endhighlight %}
 
 
-## Map Reduce Triplets (mapReduceTriplets)
+## Neighborhood Aggregation
+
+A key part of graph computation is aggregating information about the neighborhood of each vertex.
+For example we might want to know the number of followers each user has or the average age of the
+the followers of each user.  Many iterative graph algorithms (e.g., PageRank, Shortest Path, and
+connected components) repeatedly aggregate properties of neighboring vertices (e.g., current
+PageRank Value, shortest path to the source, and smallest reachable vertex id).
+
+### Map Reduce Triplets (mapReduceTriplets)
 <a name="mrTriplets"></a>
 
+[Graph.mapReduceTriplets]: api/graphx/index.html#mapReduceTriplets[A](mapFunc:org.apache.spark.graphx.EdgeTriplet[VD,ED]=&gt;Iterator[(org.apache.spark.graphx.VertexID,A)],reduceFunc:(A,A)=&gt;A,activeSetOpt:Option[(org.apache.spark.graphx.VertexRDD[_],org.apache.spark.graphx.EdgeDirection)])(implicitevidence$10:scala.reflect.ClassTag[A]):org.apache.spark.graphx.VertexRDD[A]
+
+These core (heavily optimized) aggregation primitive in GraphX is the
+(`mapReduceTriplets`)[Graph.mapReduceTriplets] operator:
+
+{% highlight scala %}
+def mapReduceTriplets[A](
+    map: EdgeTriplet[VD, ED] => Iterator[(VertexID, A)],
+    reduce: (A, A) => A)
+  : VertexRDD[A]
+{% endhighlight %}
+
+The (`mapReduceTriplets`)[Graph.mapReduceTriplets] operator takes a user defined map function which
+is applied to each triplet and can yield *messages* destined to either (none or both) vertices in
+the triplet.  We currently only support messages destined to the source or destination vertex of the
+triplet to enable optimized preaggregation.  The user defined `reduce` function combines the
+messages destined to each vertex.  The `mapReduceTriplets` operator returns a `VertexRDD[A]`
+containing the aggregate message to each vertex.  Vertices that do not receive a message are not
+included in the returned `VertexRDD`.
+
+> Note that `mapReduceTriplets takes an additional optional `activeSet` (see API docs) which
+> restricts the map phase to edges adjacent to the vertices in the provided `VertexRDD`. Restricting
+> computation to triplets adjacent to a subset of the vertices is often necessary in incremental
+> iterative computation and is a key part of the GraphX implementation of Pregel.
+
+We can use the `mapReduceTriplets` operator to collect information about adjacent vertices.  For
+example if we wanted to compute the average age of followers who are older that each user we could
+do the following.
+
+{% highlight scala %}
+// Graph with age as the vertex property
+val graph: Graph[Double, String] = getFromSomewhereElse()
+// Compute the number of older followers and their total age
+val olderFollowers: VertexRDD[(Int, Double)] = graph.mapReduceTriplets[(Int, Double)](
+  triplet => { // Map Function
+    if (triplet.srcAttr > triplet.dstAttr) {
+      // Send message to destination vertex containing counter and age
+      Iterator((triplet.dstId, (1, triplet.srcAttr)))
+    } else {
+      // Don't send a message for this triplet
+      Iterator.empty
+    }
+  },
+  // Add counter and age
+  (a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
+)
+// Divide total age by number of older followers to get average age of older followers
+val avgAgeOlderFollowers: VertexRDD[Double] =
+  olderFollowers.mapValues { case (count, totalAge) => totalAge / count }
+{% endhighlight %}
+
+> Note that the `mapReduceTriplets` operation performs optimally when the messages (and their sums)
+> are constant sized (e.g., floats and addition instead of lists and concatenation).  More
+> precisely, the result of `mapReduceTriplets` should be sub-linear in the degree of each vertex.
+
+Because it is often necessary to aggregate information about neighboring vertices we also provide an
+alternative interface defined in [`GraphOps`][GraphOps]:
+
+{% highlight scala %}
+def aggregateNeighbors[A](
+    map: (VertexID, EdgeTriplet[VD, ED]) => Option[A],
+    reduce: (A, A) => A,
+    edgeDir: EdgeDirection)
+  : VertexRDD[A]
+{% endhighlight %}
+
+The `aggregateNeighbors` operator is implemented directly on top of `mapReduceTriplets` but allows
+the user to define the logic in a more vertex centric manner.  Here the `map` function is provided
+the vertex to which the message is sent as well as one of the edges and returns the optional message
+value.  The `edgeDir` determines whether the `map` function is run on `In`, `Out`, or `All` edges
+adjacent to each vertex.
+
+### Computing Degree Information
+
+A common aggregation task is computing the degree of each vertex: the number of edges adjacent to
+each vertex.  In the context of directed graphs it often necessary to know the in-degree, out-
+degree, and the total degree of each vertex.  The  [`GraphOps`][GraphOps] class contains a
+collection of operators to compute the degrees of each vertex.  For example in the following we
+compute the max in, out, and total degrees:
+
+{% highlight scala %}
+// Define a reduce operation to compute the highest degree vertex
+def maxReduce(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
+  if (a._2 > b._2) a else b
+}
+// Compute the max degrees
+val maxInDegree: (VertexId, Int)  = graph.inDegrees.reduce(maxReduce)
+val maxOutDegree: (VertexId, Int) = graph.outDegrees.reduce(maxReduce)
+val maxDegrees: (VertexId, Int)   = graph.degrees.reduce(maxReduce)
+{% endhighlight %}
+
+
+### Collecting Neighbors
+
+In some cases it may be easier to express computation by collecting neighboring vertices and their
+attributes at each vertex. This can be easily accomplished using the `collectNeighborIds` and the
+`collectNeighbors` operators.
+
+{% highlight scala %}
+def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexID]] =
+def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[ Array[(VertexID, VD)] ]
+{% endhighlight %}
+
+> Note that these operators can be quite costly as they duplicate information and require
+> substantial communication.  If possible try expressing the same computation using the
+> `mapReduceTriplets` operator directly.
+
 # Pregel API
 <a name="pregel"></a>
 
+Graphs are inherently recursive data-structures as properties of a vertices depend on properties of
+their neighbors which intern depend on properties of the neighbors of their neighbors.  As a
+consequence many important graph algorithms iteratively recompute the properties of each vertex
+until a fixed-point condition is reached.  A range of graph-parallel abstractions have been proposed
+to express these iterative algorithms.  GraphX exposes a Pregel operator which is a fusion of
+the widely used Pregel and GraphLab abstractions.
+
+At a high-level the GraphX variant of the Pregel abstraction is a bulk-synchronous parallel
+messaging abstraction constrained to the topology of the graph.  The Pregel operator executes in a
+series of super-steps in which vertices receive the sum of their inbound messages from the previous
+super-step, compute a new property value, and then send messages to neighboring vertices in the next
+super-step.  Vertices that do not receive a message are skipped within a super-step.  The Pregel
+operators terminates iteration and returns the final graph when there are no messages remaining.
+
+> Note, unlike more standard Pregel implementations, vertices in GraphX can only send messages to
+> neighboring vertices and the message construction is done in parallel using a user defined
+> messaging function.  These constraints allow additional optimization within GraphX.
+
+The following is type signature of the Pregel operator as well as a *sketch* of its implementation
+(note calls to graph.cache have been removed):
+
+{% highlight scala %}
+def pregel[A]
+    (initialMsg: A,
+     maxIter: Int = Int.MaxValue,
+     activeDir: EdgeDirection = EdgeDirection.Out)
+    (vprog: (VertexID, VD, A) => VD,
+     sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
+     mergeMsg: (A, A) => A)
+  : Graph[VD, ED] = {
+  // Receive the initial message at each vertex
+  var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
+  // compute the messages
+  var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
+  var activeMessages = messages.count()
+  // Loop until no messages remain or maxIterations is achieved
+  var i = 0
+  while (activeMessages > 0 && i < maxIterations) {
+    // Receive the messages: -----------------------------------------------------------------------
+    // Run the vertex program on all vertices that receive messages
+    val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
+    // Merge the new vertex values back into the graph
+    g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
+    // Send Messages: ------------------------------------------------------------------------------
+    // Vertices that didn't receive a message above don't appear in newVerts and therefore don't
+    // get to send messages.  More precisely the map phase of mapReduceTriplets is only invoked
+    // on edges in the activeDir of vertices in newVerts
+    messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
+    activeMessages = messages.count()
+    i += 1
+  }
+  g
+}
+{% endhighlight %}
+
+Notice that Pregel takes two argument lists (i.e., `graph.pregel(list1)(list2)`).  The first
+argument list contains configuration parameters including the initial message, the maximum number of
+iterations, and the edge direction in which to send messages (by default along out edges).  The
+second argument list contains the user defined functions for receiving messages (the vertex program
+`vprog`), computing messages (`sendMsg`), and combining messages `mergeMsg`.
+
+We can use the Pregel operator to express computation such single source shortest path in the
+following example.
+
+{% highlight scala %}
+val graph: Graph[String, Double] // A graph with edge attributes containing distances
+val sourceId: VertexId = 42 // The ultimate source
+// Initialize the graph such that all vertices except the root have distance infinity.
+val initialGraph = graph.mapVertices((id, _) => if (id == shourceId) 0.0 else Double.PositiveInfinity)
+val sssp = initialGraph.pregel(Double.PositiveInfinity)(
+  (id, dist, newDist) => math.min(dist, newDist) // Vertex Program
+  triplet => {  // Send Message
+    if(triplet.srcAttr + triplet.attr < triplet.dstAttr) {
+      Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
+    } else {
+      Iterator.empty
+    }
+  },
+  (a,b) => math.min(a,b) // Merge Message
+  )
+{% endhighlight %}
+
 # Graph Builders
 <a name="graph_builders"></a>
 
-- 
cgit v1.2.3


From 20c509b805dbfd0ebb11d2d7bd53a4379249a86f Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Sun, 12 Jan 2014 21:41:21 -0800
Subject: Add TriangleCount example

---
 docs/graphx-programming-guide.md                   | 31 +++++++++++++++++++---
 .../apache/spark/graphx/lib/TriangleCount.scala    |  5 ++--
 2 files changed, 29 insertions(+), 7 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 89759416f4..0e228d8f28 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -676,7 +676,9 @@ GraphX includes a set of graph algorithms in to simplify analytics. The algorith
 
 PageRank measures the importance of each vertex in a graph, assuming an edge from *u* to *v* represents an endorsement of *v*'s importance by *u*. For example, if a Twitter user is followed by many others, the user will be ranked highly.
 
-Spark includes an example social network dataset that we can run PageRank on. A set of users is given in `graphx/data/users.txt`, and a set of relationships between users is given in `graphx/data/followers.txt`. We can compute the PageRank of each user as follows:
+GraphX comes with static and dynamic implementations of PageRank as methods on the [`PageRank` object][PageRank]. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphX also includes an example social network dataset that we can run PageRank on. A set of users is given in `graphx/data/users.txt`, and a set of relationships between users is given in `graphx/data/followers.txt`. We compute the PageRank of each user as follows:
+
+[PageRank]: api/graphx/index.html#org.apache.spark.graphx.lib.PageRank$
 
 {% highlight scala %}
 // Load the implicit conversion to Algorithms
@@ -703,7 +705,9 @@ println(ranksByUsername.collect().mkString("\n"))
 
 ## Connected Components
 
-The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected components can approximate clusters. We can compute the connected components of the example social network dataset from the [PageRank section](#pagerank) as follows:
+The connected components algorithm labels each connected component of the graph with the ID of its lowest-numbered vertex. For example, in a social network, connected components can approximate clusters. GraphX contains an implementation of the algorithm in the [`ConnectedComponents` object][ConnectedComponents], and we compute the connected components of the example social network dataset from the [PageRank section](#pagerank) as follows:
+
+[ConnectedComponents]: api/graphx/index.html#org.apache.spark.graphx.lib.ConnectedComponents$
 
 {% highlight scala %}
 // Load the implicit conversion and graph as in the PageRank example
@@ -721,10 +725,29 @@ val ccByUsername = graph.vertices.innerJoin(cc) { (id, username, cc) =>
 println(ccByUsername.collect().mkString("\n"))
 {% endhighlight %}
 
-## Shortest Path
-
 ## Triangle Counting
 
+A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the [`TriangleCount` object][TriangleCount] that determines the number of triangles passing through each vertex, providing a measure of clustering. We compute the triangle count of the social network dataset from the [PageRank section](#pagerank). *Note that `TriangleCount` requires the edges to be in canonical orientation (`srcId < dstId`) and the graph to be partitioned using [`Graph#partitionBy`][Graph.partitionBy].*
+
+[TriangleCount]: api/graphx/index.html#org.apache.spark.graphx.lib.TriangleCount$
+[Graph.partitionBy]: api/graphx/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]
+
+{% highlight scala %}
+// Load the implicit conversion and graph as in the PageRank example
+import org.apache.spark.graphx.lib._
+val users = ...
+// Load the edges in canonical order and partition the graph for triangle count
+val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(RandomVertexCut)
+// Find the triangle count for each vertex
+val triCounts = graph.triangleCount().vertices
+// Join the triangle counts with the usernames
+val triCountByUsername = graph.vertices.innerJoin(triCounts) { (id, username, tc) =>
+  (username, tc)
+}
+// Print the result
+println(triCountByUsername.collect().mkString("\n"))
+{% endhighlight %}
+
 ## K-Core
 
 ## LDA
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala
index c6b1c736dd..58da9e3aed 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/TriangleCount.scala
@@ -19,9 +19,8 @@ object TriangleCount {
    *
    *
    * @param graph a graph with `sourceId` less than `destId`. The graph must have been partitioned
-   * using Graph.partitionBy.
-   *
-   * @return
+   * using [[org.apache.spark.graphx.Graph#partitionBy]], and its edges must be in canonical
+   * orientation (srcId < dstId).
    */
   def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD,ED]): Graph[Int, ED] = {
     // Remove redundant edges
-- 
cgit v1.2.3


From d691e9f47ed9b43b422712047183142d01c5e8c2 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Sun, 12 Jan 2014 21:47:16 -0800
Subject: Move algorithms to GraphOps

---
 docs/graphx-programming-guide.md                   | 12 +---
 .../main/scala/org/apache/spark/graphx/Graph.scala |  4 +-
 .../scala/org/apache/spark/graphx/GraphOps.scala   | 51 ++++++++++++++++-
 .../org/apache/spark/graphx/lib/Algorithms.scala   | 66 ----------------------
 .../org/apache/spark/graphx/lib/package.scala      |  8 ---
 5 files changed, 54 insertions(+), 87 deletions(-)
 delete mode 100644 graphx/src/main/scala/org/apache/spark/graphx/lib/Algorithms.scala
 delete mode 100644 graphx/src/main/scala/org/apache/spark/graphx/lib/package.scala

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 0e228d8f28..572afc101b 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -667,9 +667,7 @@ things to worry about.)
 # Graph Algorithms
 <a name="graph_algorithms"></a>
 
-GraphX includes a set of graph algorithms in to simplify analytics. The algorithms are contained in the `org.apache.spark.graphx.lib` package and can be accessed directly as methods on `Graph` via an implicit conversion to [`Algorithms`][Algorithms]. This section describes the algorithms and how they are used.
-
-[Algorithms]: api/graphx/index.html#org.apache.spark.graphx.lib.Algorithms
+GraphX includes a set of graph algorithms in to simplify analytics. The algorithms are contained in the `org.apache.spark.graphx.lib` package and can be accessed directly as methods on `Graph` via [`GraphOps`][GraphOps]. This section describes the algorithms and how they are used.
 
 ## PageRank
 <a name="pagerank"></a>
@@ -681,8 +679,6 @@ GraphX comes with static and dynamic implementations of PageRank as methods on t
 [PageRank]: api/graphx/index.html#org.apache.spark.graphx.lib.PageRank$
 
 {% highlight scala %}
-// Load the implicit conversion to Algorithms
-import org.apache.spark.graphx.lib._
 // Load the datasets into a graph
 val users = sc.textFile("graphx/data/users.txt").map { line =>
   val fields = line.split("\\s+")
@@ -710,8 +706,7 @@ The connected components algorithm labels each connected component of the graph
 [ConnectedComponents]: api/graphx/index.html#org.apache.spark.graphx.lib.ConnectedComponents$
 
 {% highlight scala %}
-// Load the implicit conversion and graph as in the PageRank example
-import org.apache.spark.graphx.lib._
+// Load the graph as in the PageRank example
 val users = ...
 val followers = ...
 val graph = Graph(users, followers)
@@ -733,8 +728,7 @@ A vertex is part of a triangle when it has two adjacent vertices with an edge be
 [Graph.partitionBy]: api/graphx/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]
 
 {% highlight scala %}
-// Load the implicit conversion and graph as in the PageRank example
-import org.apache.spark.graphx.lib._
+// Load the graph as in the PageRank example
 val users = ...
 // Load the edges in canonical order and partition the graph for triangle count
 val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(RandomVertexCut)
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala b/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala
index 56513cac20..7d4f0de3d6 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/Graph.scala
@@ -15,9 +15,7 @@ import org.apache.spark.storage.StorageLevel
  * RDDs, the graph is a functional data-structure in which mutating
  * operations return new graphs.
  *
- * @note [[GraphOps]] contains additional convenience operations.
- * [[lib.Algorithms]] contains graph algorithms; to access these,
- * import `org.apache.spark.graphx.lib._`.
+ * @note [[GraphOps]] contains additional convenience operations and graph algorithms.
  *
  * @tparam VD the vertex attribute type
  * @tparam ED the edge attribute type
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala b/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
index 4fdff29f5a..2b3b95e2ca 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
@@ -2,9 +2,10 @@ package org.apache.spark.graphx
 
 import scala.reflect.ClassTag
 
-import org.apache.spark.rdd.RDD
 import org.apache.spark.SparkContext._
 import org.apache.spark.SparkException
+import org.apache.spark.graphx.lib._
+import org.apache.spark.rdd.RDD
 
 /**
  * Contains additional functionality for [[Graph]]. All operations are expressed in terms of the
@@ -298,4 +299,52 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) {
     Pregel(graph, initialMsg, maxIterations, activeDirection)(vprog, sendMsg, mergeMsg)
   }
 
+  /**
+   * Run a dynamic version of PageRank returning a graph with vertex attributes containing the
+   * PageRank and edge attributes containing the normalized edge weight.
+   *
+   * @see [[org.apache.spark.graphx.lib.PageRank]], method `runUntilConvergence`.
+   */
+  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double] = {
+    PageRank.runUntilConvergence(graph, tol, resetProb)
+  }
+
+  /**
+   * Run PageRank for a fixed number of iterations returning a graph with vertex attributes
+   * containing the PageRank and edge attributes the normalized edge weight.
+   *
+   * @see [[org.apache.spark.graphx.lib.PageRank]], method `run`.
+   */
+  def staticPageRank(numIter: Int, resetProb: Double = 0.15): Graph[Double, Double] = {
+    PageRank.run(graph, numIter, resetProb)
+  }
+
+  /**
+   * Compute the connected component membership of each vertex and return a graph with the vertex
+   * value containing the lowest vertex id in the connected component containing that vertex.
+   *
+   * @see [[org.apache.spark.graphx.lib.ConnectedComponents]]
+   */
+  def connectedComponents(): Graph[VertexID, ED] = {
+    ConnectedComponents.run(graph)
+  }
+
+  /**
+   * Compute the number of triangles passing through each vertex.
+   *
+   * @see [[org.apache.spark.graphx.lib.TriangleCount]]
+   */
+  def triangleCount(): Graph[Int, ED] = {
+    TriangleCount.run(graph)
+  }
+
+  /**
+   * Compute the strongly connected component (SCC) of each vertex and return a graph with the
+   * vertex value containing the lowest vertex id in the SCC containing that vertex.
+   *
+   * @see [[org.apache.spark.graphx.lib.StronglyConnectedComponents]]
+   */
+  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED] = {
+    StronglyConnectedComponents.run(graph, numIter)
+  }
 } // end of GraphOps
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/Algorithms.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/Algorithms.scala
deleted file mode 100644
index cbcd9c24a0..0000000000
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/Algorithms.scala
+++ /dev/null
@@ -1,66 +0,0 @@
-package org.apache.spark.graphx.lib
-
-import scala.reflect.ClassTag
-
-import org.apache.spark.graphx._
-
-/**
- * Provides graph algorithms directly on [[org.apache.spark.graphx.Graph]] via an implicit
- * conversion.
- * @example
- * {{{
- * import org.apache.spark.graph.lib._
- * val graph: Graph[_, _] = loadGraph()
- * graph.connectedComponents()
- * }}}
- */
-class Algorithms[VD: ClassTag, ED: ClassTag](self: Graph[VD, ED]) {
-  /**
-   * Run a dynamic version of PageRank returning a graph with vertex attributes containing the
-   * PageRank and edge attributes containing the normalized edge weight.
-   *
-   * @see [[org.apache.spark.graphx.lib.PageRank]], method `runUntilConvergence`.
-   */
-  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double] = {
-    PageRank.runUntilConvergence(self, tol, resetProb)
-  }
-
-  /**
-   * Run PageRank for a fixed number of iterations returning a graph with vertex attributes
-   * containing the PageRank and edge attributes the normalized edge weight.
-   *
-   * @see [[org.apache.spark.graphx.lib.PageRank]], method `run`.
-   */
-  def staticPageRank(numIter: Int, resetProb: Double = 0.15): Graph[Double, Double] = {
-    PageRank.run(self, numIter, resetProb)
-  }
-
-  /**
-   * Compute the connected component membership of each vertex and return a graph with the vertex
-   * value containing the lowest vertex id in the connected component containing that vertex.
-   *
-   * @see [[org.apache.spark.graphx.lib.ConnectedComponents]]
-   */
-  def connectedComponents(): Graph[VertexID, ED] = {
-    ConnectedComponents.run(self)
-  }
-
-  /**
-   * Compute the number of triangles passing through each vertex.
-   *
-   * @see [[org.apache.spark.graphx.lib.TriangleCount]]
-   */
-  def triangleCount(): Graph[Int, ED] = {
-    TriangleCount.run(self)
-  }
-
-  /**
-   * Compute the strongly connected component (SCC) of each vertex and return a graph with the
-   * vertex value containing the lowest vertex id in the SCC containing that vertex.
-   *
-   * @see [[org.apache.spark.graphx.lib.StronglyConnectedComponents]]
-   */
-  def stronglyConnectedComponents(numIter: Int): Graph[VertexID, ED] = {
-    StronglyConnectedComponents.run(self, numIter)
-  }
-}
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/package.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/package.scala
deleted file mode 100644
index f6f2626c9d..0000000000
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/package.scala
+++ /dev/null
@@ -1,8 +0,0 @@
-package org.apache.spark.graphx
-
-import scala.reflect.ClassTag
-
-package object lib {
-  implicit def graphToAlgorithms[VD: ClassTag, ED: ClassTag](
-      graph: Graph[VD, ED]): Algorithms[VD, ED] = new Algorithms(graph)
-}
-- 
cgit v1.2.3


From 1efe78a1013af1aa97d07c18b27f1ccbb90c2790 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Sun, 12 Jan 2014 22:03:03 -0800
Subject: Use GraphLoader for algorithms examples in doc

---
 docs/graphx-programming-guide.md | 36 +++++++++++++++++++-----------------
 1 file changed, 19 insertions(+), 17 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 572afc101b..f2f5a88828 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -674,24 +674,22 @@ GraphX includes a set of graph algorithms in to simplify analytics. The algorith
 
 PageRank measures the importance of each vertex in a graph, assuming an edge from *u* to *v* represents an endorsement of *v*'s importance by *u*. For example, if a Twitter user is followed by many others, the user will be ranked highly.
 
-GraphX comes with static and dynamic implementations of PageRank as methods on the [`PageRank` object][PageRank]. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). GraphX also includes an example social network dataset that we can run PageRank on. A set of users is given in `graphx/data/users.txt`, and a set of relationships between users is given in `graphx/data/followers.txt`. We compute the PageRank of each user as follows:
+GraphX comes with static and dynamic implementations of PageRank as methods on the [`PageRank` object][PageRank]. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). [`GraphOps`][GraphOps] allows calling these algorithms directly as methods on `Graph`.
+
+GraphX also includes an example social network dataset that we can run PageRank on. A set of users is given in `graphx/data/users.txt`, and a set of relationships between users is given in `graphx/data/followers.txt`. We compute the PageRank of each user as follows:
 
 [PageRank]: api/graphx/index.html#org.apache.spark.graphx.lib.PageRank$
 
 {% highlight scala %}
-// Load the datasets into a graph
+// Load the edges as a graph
+val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
+// Run PageRank
+val ranks = graph.pageRank(0.0001).vertices
+// Join the ranks with the usernames
 val users = sc.textFile("graphx/data/users.txt").map { line =>
   val fields = line.split("\\s+")
   (fields(0).toLong, fields(1))
 }
-val followers = sc.textFile("graphx/data/followers.txt").map { line =>
-  val fields = line.split("\\s+")
-  Edge(fields(0).toLong, fields(1).toLong, 1)
-}
-val graph = Graph(users, followers)
-// Run PageRank
-val ranks = graph.pageRank(0.0001).vertices
-// Join the ranks with the usernames
 val ranksByUsername = users.leftOuterJoin(ranks).map {
   case (id, (username, rankOpt)) => (username, rankOpt.getOrElse(0.0))
 }
@@ -707,13 +705,15 @@ The connected components algorithm labels each connected component of the graph
 
 {% highlight scala %}
 // Load the graph as in the PageRank example
-val users = ...
-val followers = ...
-val graph = Graph(users, followers)
+val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
 // Find the connected components
 val cc = graph.connectedComponents().vertices
 // Join the connected components with the usernames
-val ccByUsername = graph.vertices.innerJoin(cc) { (id, username, cc) =>
+val users = sc.textFile("graphx/data/users.txt").map { line =>
+  val fields = line.split("\\s+")
+  (fields(0).toLong, fields(1))
+}
+val ccByUsername = users.join(cc).map { case (id, (username, cc)) =>
   (username, cc)
 }
 // Print the result
@@ -728,14 +728,16 @@ A vertex is part of a triangle when it has two adjacent vertices with an edge be
 [Graph.partitionBy]: api/graphx/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]
 
 {% highlight scala %}
-// Load the graph as in the PageRank example
-val users = ...
 // Load the edges in canonical order and partition the graph for triangle count
 val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(RandomVertexCut)
 // Find the triangle count for each vertex
 val triCounts = graph.triangleCount().vertices
 // Join the triangle counts with the usernames
-val triCountByUsername = graph.vertices.innerJoin(triCounts) { (id, username, tc) =>
+val users = sc.textFile("graphx/data/users.txt").map { line =>
+  val fields = line.split("\\s+")
+  (fields(0).toLong, fields(1))
+}
+val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
   (username, tc)
 }
 // Print the result
-- 
cgit v1.2.3


From 66c9d0092ae28e07c4fae8b026cca6cf74f1c37a Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Sun, 12 Jan 2014 22:11:04 -0800
Subject: Tested and corrected all examples up to mask in the
 graphx-programming-guide.

---
 docs/graphx-programming-guide.md | 37 ++++++++++++++++++++-----------------
 1 file changed, 20 insertions(+), 17 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index f2f5a88828..2697b2def7 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -80,6 +80,8 @@ To get started you first need to import Spark and GraphX into your project, as f
 {% highlight scala %}
 import org.apache.spark._
 import org.apache.spark.graphx._
+// To make some of the examples work we will also need RDD
+import org.apache.spark.rdd.RDD
 {% endhighlight %}
 
 If you are not using the Spark shell you will also need a Spark context.
@@ -105,13 +107,11 @@ be accomplished through inheritance.  For example to model users and products as
 we might do the following:
 
 {% highlight scala %}
-case class VertexProperty
-case class UserProperty extends VertexProperty
-  (val name: String)
-case class ProductProperty extends VertexProperty
-  (val name: String, val price: Double)
+class VertexProperty()
+case class UserProperty(val name: String) extends VertexProperty
+case class ProductProperty(val name: String, val price: Double) extends VertexProperty
 // The graph might then have the type:
-val graph: Graph[VertexProperty, String]
+var graph: Graph[VertexProperty, String] = null
 {% endhighlight %}
 
 Like RDDs, property graphs are immutable, distributed, and fault-tolerant.  Changes to the values or
@@ -165,13 +165,13 @@ code constructs a graph from a collection of RDDs:
 // Assume the SparkContext has already been constructed
 val sc: SparkContext
 // Create an RDD for the vertices
-val users: RDD[(VertexId, (String, String))] =
-  sc.parallelize(Array((3, ("rxin", "student")), (7, ("jgonzal", "postdoc")),
-                       (5, ("franklin", "prof")), (2, ("istoica", "prof"))))
+val users: RDD[(VertexID, (String, String))] =
+  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
+                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof"))))
 // Create an RDD for edges
 val relationships: RDD[Edge[String]] =
-  sc.parallelize(Array(Edge(3, 7, "collab"), Edge(5, 3, "advisor"),
-                       Edge(2, 5, "colleague"), Edge(5, 7, "pi"))
+  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
+                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi")))
 // Define a default user in case there are relationship with missing user
 val defaultUser = ("John Doe", "Missing")
 // Build the initial Graph
@@ -200,7 +200,7 @@ graph.edges.filter(e => e.srcId > e.dstId).count
 > tuple.  On the other hand, `graph.edges` returns an `EdgeRDD` containing `Edge[String]` objects.
 > We could have also used the case class type constructor as in the following:
 > {% highlight scala %}
-graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
+graph.edges.filter { case Edge(src, dst, prop) => src > dst }.count
 {% endhighlight %}
 
 In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
@@ -234,7 +234,9 @@ triplet view of a graph to render a collection of strings describing relationshi
 val graph: Graph[(String, String), String] // Constructed from above
 // Use the triplets view to create an RDD of facts.
 val facts: RDD[String] =
-  graph.triplets.map(et => et.srcAttr._1 + " is the " + et.attr + " of " et.dstAttr)
+  graph.triplets.map(triplet =>
+    triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1)
+facts.collect.foreach(println(_))
 {% endhighlight %}
 
 # Graph Operators
@@ -294,11 +296,12 @@ unnecessary properties.  For example, given a graph with the out-degrees as the
 
 {% highlight scala %}
 // Given a graph where the vertex property is the out-degree
-val inputGraph: Graph[Int, String]
+val inputGraph: Graph[Int, String] =
+  graph.outerJoinVertices(graph.outDegrees)((vid, _, degOpt) => degOpt.getOrElse(0))
 // Construct a graph where each edge contains the weight
 // and each vertex is the initial PageRank
 val outputGraph: Graph[Double, Double] =
-  inputGraph.mapTriplets(et => 1.0 / et.srcAttr).mapVertices(v => 1.0)
+  inputGraph.mapTriplets(triplet => 1.0 / triplet.srcAttr).mapVertices((id, _) => 1.0)
 {% endhighlight %}
 
 ## Structural Operators
@@ -338,7 +341,7 @@ val defaultUser = ("John Doe", "Missing")
 // Build the initial Graph
 val graph = Graph(users, relationships, defaultUser)
 // Remove missing vertices as well as the edges to connected to them
-val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
+val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")
 {% endhighlight %}
 
 > Note in the above example only the vertex predicate is provided.  The `subgraph` operator defaults
@@ -356,7 +359,7 @@ the answer to the valid subgraph.
 // Run Connected Components
 val ccGraph = graph.connectedComponents() // No longer contains missing field
 // Remove missing vertices as well as the edges to connected to them
-val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
+val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")
 // Restrict the answer to the valid subgraph
 val validCCGraph = ccGraph.mask(validGraph)
 {% endhighlight %}
-- 
cgit v1.2.3


From 80e4d98dc656e20dacbd8cdbf92d4912673b42ae Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Mon, 13 Jan 2014 13:40:16 -0800
Subject: Improving documentation and identifying potential bug in CC
 calculation.

---
 docs/graphx-programming-guide.md                   | 33 +++++++++++++---
 .../scala/org/apache/spark/graphx/GraphOps.scala   |  4 +-
 .../spark/graphx/lib/ConnectedComponents.scala     | 44 +++++++++++++++-------
 .../graphx/lib/ConnectedComponentsSuite.scala      | 30 +++++++++++++++
 4 files changed, 89 insertions(+), 22 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 2697b2def7..ed976b8989 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -84,7 +84,8 @@ import org.apache.spark.graphx._
 import org.apache.spark.rdd.RDD
 {% endhighlight %}
 
-If you are not using the Spark shell you will also need a Spark context.
+If you are not using the Spark shell you will also need a `SparkContext`.  To learn more about
+getting started with Spark refer to the [Spark Quick Start Guide](quick-start.html).
 
 # The Property Graph
 <a name="property_graph"></a>
@@ -190,7 +191,7 @@ and `graph.edges` members respectively.
 {% highlight scala %}
 val graph: Graph[(String, String), String] // Constructed from above
 // Count all users which are postdocs
-graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
+graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc" }.count
 // Count all the edges where src > dst
 graph.edges.filter(e => e.srcId > e.dstId).count
 {% endhighlight %}
@@ -258,8 +259,10 @@ val graph: Graph[(String, String), String]
 val indDegrees: VertexRDD[Int] = graph.inDegrees
 {% endhighlight %}
 
-The reason for differentiating between core graph operations and GraphOps is to be able to support
-various graph representations in the future.
+The reason for differentiating between core graph operations and [`GraphOps`][GraphOps] is to be
+able to support different graph representations in the future.  Each graph representation must
+provide implementations of the core operations and reuse many of the useful operations defined in
+[`GraphOps`][GraphOps].
 
 ## Property Operators
 
@@ -334,14 +337,32 @@ interest or eliminate broken links. For example in the following code we remove
 [Graph.subgraph]: api/graphx/index.html#org.apache.spark.graphx.Graph@subgraph((EdgeTriplet[VD,ED])⇒Boolean,(VertexID,VD)⇒Boolean):Graph[VD,ED]
 
 {% highlight scala %}
-val users: RDD[(VertexId, (String, String))]
-val edges: RDD[Edge[String]]
+// Create an RDD for the vertices
+val users: RDD[(VertexID, (String, String))] =
+  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
+                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),
+                       (4L, ("peter", "student"))))
+// Create an RDD for edges
+val relationships: RDD[Edge[String]] =
+  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
+                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),
+                       Edge(4L, 0L, "student"),   Edge(5L, 0L, "colleague")))
 // Define a default user in case there are relationship with missing user
 val defaultUser = ("John Doe", "Missing")
 // Build the initial Graph
 val graph = Graph(users, relationships, defaultUser)
+// Notice that there is a user 0 (for which we have no information) connecting users
+// 4 (peter) and 5 (franklin).
+graph.triplets.map(
+    triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
+  ).collect.foreach(println(_))
 // Remove missing vertices as well as the edges to connected to them
 val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")
+// The valid subgraph will disconnect users 4 and 5 by removing user 0
+validGraph.vertices.collect.foreach(println(_))
+validGraph.triplets.map(
+    triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
+  ).collect.foreach(println(_))
 {% endhighlight %}
 
 > Note in the above example only the vertex predicate is provided.  The `subgraph` operator defaults
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala b/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
index 2b3b95e2ca..a0a40e2d9a 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
@@ -325,8 +325,8 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) {
    *
    * @see [[org.apache.spark.graphx.lib.ConnectedComponents]]
    */
-  def connectedComponents(): Graph[VertexID, ED] = {
-    ConnectedComponents.run(graph)
+  def connectedComponents(undirected: Boolean = true): Graph[VertexID, ED] = {
+    ConnectedComponents.run(graph, undirected)
   }
 
   /**
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala b/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala
index 4a83e2dbb8..d078d2acdb 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/lib/ConnectedComponents.scala
@@ -14,26 +14,42 @@ object ConnectedComponents {
    * @tparam ED the edge attribute type (preserved in the computation)
    *
    * @param graph the graph for which to compute the connected components
+   * @param undirected compute reachability ignoring edge direction.
    *
    * @return a graph with vertex attributes containing the smallest vertex in each
    *         connected component
    */
-  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): Graph[VertexID, ED] = {
+  def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED], undirected: Boolean = true):
+    Graph[VertexID, ED] = {
     val ccGraph = graph.mapVertices { case (vid, _) => vid }
-
-    def sendMessage(edge: EdgeTriplet[VertexID, ED]) = {
-      if (edge.srcAttr < edge.dstAttr) {
-        Iterator((edge.dstId, edge.srcAttr))
-      } else if (edge.srcAttr > edge.dstAttr) {
-        Iterator((edge.srcId, edge.dstAttr))
-      } else {
-        Iterator.empty
+    if (undirected) {
+      def sendMessage(edge: EdgeTriplet[VertexID, ED]) = {
+        if (edge.srcAttr < edge.dstAttr) {
+          Iterator((edge.dstId, edge.srcAttr))
+        } else if (edge.srcAttr > edge.dstAttr) {
+          Iterator((edge.srcId, edge.dstAttr))
+        } else {
+          Iterator.empty
+        }
+      }
+      val initialMessage = Long.MaxValue
+      Pregel(ccGraph, initialMessage, activeDirection = EdgeDirection.Both)(
+        vprog = (id, attr, msg) => math.min(attr, msg),
+        sendMsg = sendMessage,
+        mergeMsg = (a, b) => math.min(a, b))
+    } else {
+      def sendMessage(edge: EdgeTriplet[VertexID, ED]) = {
+        if (edge.srcAttr < edge.dstAttr) {
+          Iterator((edge.dstId, edge.srcAttr))
+        } else {
+          Iterator.empty
+        }
       }
+      val initialMessage = Long.MaxValue
+      Pregel(ccGraph, initialMessage, activeDirection = EdgeDirection.Out)(
+        vprog = (id, attr, msg) => math.min(attr, msg),
+        sendMsg = sendMessage,
+        mergeMsg = (a, b) => math.min(a, b))
     }
-    val initialMessage = Long.MaxValue
-    Pregel(ccGraph, initialMessage)(
-      vprog = (id, attr, msg) => math.min(attr, msg),
-      sendMsg = sendMessage,
-      mergeMsg = (a, b) => math.min(a, b))
   } // end of connectedComponents
 }
diff --git a/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala b/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala
index 66612b381f..86da8f1b46 100644
--- a/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala
+++ b/graphx/src/test/scala/org/apache/spark/graphx/lib/ConnectedComponentsSuite.scala
@@ -80,4 +80,34 @@ class ConnectedComponentsSuite extends FunSuite with LocalSparkContext {
     }
   } // end of reverse chain connected components
 
+  test("Connected Components on a Toy Connected Graph") {
+    withSpark { sc =>
+      // Create an RDD for the vertices
+      val users: RDD[(VertexID, (String, String))] =
+        sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
+                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),
+                       (4L, ("peter", "student"))))
+      // Create an RDD for edges
+      val relationships: RDD[Edge[String]] =
+        sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
+                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "pi"),
+                       Edge(4L, 0L, "student"),   Edge(5L, 0L, "colleague")))
+      // Edges are:
+      //   2 ---> 5 ---> 3
+      //          | \
+      //          V   \|
+      //   4 ---> 0    7
+      //
+      // Define a default user in case there are relationship with missing user
+      val defaultUser = ("John Doe", "Missing")
+      // Build the initial Graph
+      val graph = Graph(users, relationships, defaultUser)
+      val ccGraph = graph.connectedComponents(undirected = true)
+      val vertices = ccGraph.vertices.collect
+      for ( (id, cc) <- vertices ) {
+        assert(cc == 0)
+      }
+    }
+  } // end of toy connected components
+
 }
-- 
cgit v1.2.3


From 15ca89b11edbb2800efc992d6cf4eba787a00873 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 14:54:33 -0800
Subject: Fix mapReduceTriplets links in doc

---
 docs/graphx-programming-guide.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 2697b2def7..1832ded888 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -443,10 +443,10 @@ PageRank Value, shortest path to the source, and smallest reachable vertex id).
 ### Map Reduce Triplets (mapReduceTriplets)
 <a name="mrTriplets"></a>
 
-[Graph.mapReduceTriplets]: api/graphx/index.html#mapReduceTriplets[A](mapFunc:org.apache.spark.graphx.EdgeTriplet[VD,ED]=&gt;Iterator[(org.apache.spark.graphx.VertexID,A)],reduceFunc:(A,A)=&gt;A,activeSetOpt:Option[(org.apache.spark.graphx.VertexRDD[_],org.apache.spark.graphx.EdgeDirection)])(implicitevidence$10:scala.reflect.ClassTag[A]):org.apache.spark.graphx.VertexRDD[A]
+[Graph.mapReduceTriplets]: api/graphx/index.html#org.apache.spark.graphx.Graph@mapReduceTriplets[A](mapFunc:org.apache.spark.graphx.EdgeTriplet[VD,ED]=&gt;Iterator[(org.apache.spark.graphx.VertexID,A)],reduceFunc:(A,A)=&gt;A,activeSetOpt:Option[(org.apache.spark.graphx.VertexRDD[_],org.apache.spark.graphx.EdgeDirection)])(implicitevidence$10:scala.reflect.ClassTag[A]):org.apache.spark.graphx.VertexRDD[A]
 
-These core (heavily optimized) aggregation primitive in GraphX is the
-(`mapReduceTriplets`)[Graph.mapReduceTriplets] operator:
+The core (heavily optimized) aggregation primitive in GraphX is the
+[`mapReduceTriplets`][Graph.mapReduceTriplets] operator:
 
 {% highlight scala %}
 def mapReduceTriplets[A](
@@ -455,7 +455,7 @@ def mapReduceTriplets[A](
   : VertexRDD[A]
 {% endhighlight %}
 
-The (`mapReduceTriplets`)[Graph.mapReduceTriplets] operator takes a user defined map function which
+The [`mapReduceTriplets`][Graph.mapReduceTriplets] operator takes a user defined map function which
 is applied to each triplet and can yield *messages* destined to either (none or both) vertices in
 the triplet.  We currently only support messages destined to the source or destination vertex of the
 triplet to enable optimized preaggregation.  The user defined `reduce` function combines the
-- 
cgit v1.2.3


From 97cd27e31b18f4c41ef556aee2ab65350694f8b8 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 14:54:48 -0800
Subject: Add graph loader links to doc

---
 docs/graphx-programming-guide.md | 13 +++++++++++++
 1 file changed, 13 insertions(+)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 1832ded888..7f1559d1e2 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -638,6 +638,19 @@ val sssp = initialGraph.pregel(Double.PositiveInfinity)(
 # Graph Builders
 <a name="graph_builders"></a>
 
+[`GraphLoader.edgeListFile`][GraphLoader.edgeListFile]
+
+[`Graph.apply`][Graph.apply]
+
+[`Graph.fromEdgeTuples`][Graph.fromEdgeTuples]
+
+[`Graph.fromEdges`][Graph.fromEdges]
+
+[GraphLoader.edgeListFile]: api/graphx/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]
+[Graph.apply]: api/graphx/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexID,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
+[Graph.fromEdgeTuples]: api/graphx/index.html#org.apache.spark.graphx.Graph$@fromEdgeTuples[VD](RDD[(VertexID,VertexID)],VD,Option[PartitionStrategy])(ClassTag[VD]):Graph[VD,Int]
+[Graph.fromEdges]: api/graphx/index.html#org.apache.spark.graphx.Graph$@fromEdges[VD,ED](RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
+
 # Vertex and Edge RDDs
 <a name="vertex_and_edge_rdds"></a>
 
-- 
cgit v1.2.3


From 1bd5cefcae2769d48ad5ef4b8197193371c754da Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 16:15:10 -0800
Subject: Remove aggregateNeighbors

---
 docs/graphx-programming-guide.md                   | 17 ------
 .../scala/org/apache/spark/graphx/GraphOps.scala   | 64 ++--------------------
 .../org/apache/spark/graphx/GraphOpsSuite.scala    | 26 ---------
 3 files changed, 5 insertions(+), 102 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 002ba0cf73..e6afd092be 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -519,23 +519,6 @@ val avgAgeOlderFollowers: VertexRDD[Double] =
 > are constant sized (e.g., floats and addition instead of lists and concatenation).  More
 > precisely, the result of `mapReduceTriplets` should be sub-linear in the degree of each vertex.
 
-Because it is often necessary to aggregate information about neighboring vertices we also provide an
-alternative interface defined in [`GraphOps`][GraphOps]:
-
-{% highlight scala %}
-def aggregateNeighbors[A](
-    map: (VertexID, EdgeTriplet[VD, ED]) => Option[A],
-    reduce: (A, A) => A,
-    edgeDir: EdgeDirection)
-  : VertexRDD[A]
-{% endhighlight %}
-
-The `aggregateNeighbors` operator is implemented directly on top of `mapReduceTriplets` but allows
-the user to define the logic in a more vertex centric manner.  Here the `map` function is provided
-the vertex to which the message is sent as well as one of the edges and returns the optional message
-value.  The `edgeDir` determines whether the `map` function is run on `In`, `Out`, or `All` edges
-adjacent to each vertex.
-
 ### Computing Degree Information
 
 A common aggregation task is computing the degree of each vertex: the number of edges adjacent to
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala b/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
index a0a40e2d9a..578eb331c1 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala
@@ -55,60 +55,6 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) {
     }
   }
 
-  /**
-   * Computes a statistic for the neighborhood of each vertex.
-   *
-   * @param mapFunc the function applied to each edge adjacent to each vertex. The mapFunc can
-   * optionally return `None`, in which case it does not contribute to the final sum.
-   * @param reduceFunc the function used to merge the results of each map operation
-   * @param direction the direction of edges to consider (e.g., In, Out, Both).
-   * @tparam A the aggregation type
-   *
-   * @return an RDD containing tuples of vertex identifiers and
-   * their resulting value. Vertices with no neighbors will not appear in the RDD.
-   *
-   * @example We can use this function to compute the average follower
-   * age for each user:
-   *
-   * {{{
-   * val graph: Graph[Int,Int] = GraphLoader.edgeListFile(sc, "webgraph")
-   * val averageFollowerAge: RDD[(Int, Int)] =
-   *   graph.aggregateNeighbors[(Int,Double)](
-   *     (vid, edge) => Some((edge.otherVertex(vid).data, 1)),
-   *     (a, b) => (a._1 + b._1, a._2 + b._2),
-   *     -1,
-   *     EdgeDirection.In)
-   *     .mapValues{ case (sum,followers) => sum.toDouble / followers}
-   * }}}
-   */
-  def aggregateNeighbors[A: ClassTag](
-      mapFunc: (VertexID, EdgeTriplet[VD, ED]) => Option[A],
-      reduceFunc: (A, A) => A,
-      dir: EdgeDirection)
-    : VertexRDD[A] = {
-    // Define a new map function over edge triplets
-    val mf = (et: EdgeTriplet[VD,ED]) => {
-      // Compute the message to the dst vertex
-      val dst =
-        if (dir == EdgeDirection.In || dir == EdgeDirection.Both) {
-          mapFunc(et.dstId, et)
-        } else { Option.empty[A] }
-      // Compute the message to the source vertex
-      val src =
-        if (dir == EdgeDirection.Out || dir == EdgeDirection.Both) {
-          mapFunc(et.srcId, et)
-        } else { Option.empty[A] }
-      // construct the return array
-      (src, dst) match {
-        case (None, None) => Iterator.empty
-        case (Some(srcA),None) => Iterator((et.srcId, srcA))
-        case (None, Some(dstA)) => Iterator((et.dstId, dstA))
-        case (Some(srcA), Some(dstA)) => Iterator((et.srcId, srcA), (et.dstId, dstA))
-      }
-    }
-    graph.mapReduceTriplets(mf, reduceFunc)
-  } // end of aggregateNeighbors
-
   /**
    * Collect the neighbor vertex ids for each vertex.
    *
@@ -152,11 +98,11 @@ class GraphOps[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]) {
    *
    * @return the vertex set of neighboring vertex attributes for each vertex
    */
-  def collectNeighbors(edgeDirection: EdgeDirection) :
-    VertexRDD[ Array[(VertexID, VD)] ] = {
-    val nbrs = graph.aggregateNeighbors[Array[(VertexID,VD)]](
-      (vid, edge) =>
-        Some(Array( (edge.otherVertexId(vid), edge.otherVertexAttr(vid)) )),
+  def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexID, VD)]] = {
+    val nbrs = graph.mapReduceTriplets[Array[(VertexID,VD)]](
+      edge => Iterator(
+        (edge.srcId, Array((edge.dstId, edge.dstAttr))),
+        (edge.dstId, Array((edge.srcId, edge.srcAttr)))),
       (a, b) => a ++ b,
       edgeDirection)
 
diff --git a/graphx/src/test/scala/org/apache/spark/graphx/GraphOpsSuite.scala b/graphx/src/test/scala/org/apache/spark/graphx/GraphOpsSuite.scala
index cd3c0bbd30..7a901409d5 100644
--- a/graphx/src/test/scala/org/apache/spark/graphx/GraphOpsSuite.scala
+++ b/graphx/src/test/scala/org/apache/spark/graphx/GraphOpsSuite.scala
@@ -8,32 +8,6 @@ import org.scalatest.FunSuite
 
 class GraphOpsSuite extends FunSuite with LocalSparkContext {
 
-  test("aggregateNeighbors") {
-    withSpark { sc =>
-      val n = 3
-      val star =
-        Graph.fromEdgeTuples(sc.parallelize((1 to n).map(x => (0: VertexID, x: VertexID))), 1)
-
-      val indegrees = star.aggregateNeighbors(
-        (vid, edge) => Some(1),
-        (a: Int, b: Int) => a + b,
-        EdgeDirection.In)
-      assert(indegrees.collect().toSet === (1 to n).map(x => (x, 1)).toSet)
-
-      val outdegrees = star.aggregateNeighbors(
-        (vid, edge) => Some(1),
-        (a: Int, b: Int) => a + b,
-        EdgeDirection.Out)
-      assert(outdegrees.collect().toSet === Set((0, n)))
-
-      val noVertexValues = star.aggregateNeighbors[Int](
-        (vid: VertexID, edge: EdgeTriplet[Int, Int]) => None,
-        (a: Int, b: Int) => throw new Exception("reduceFunc called unexpectedly"),
-        EdgeDirection.In)
-      assert(noVertexValues.collect().toSet === Set.empty[(VertexID, Int)])
-    }
-  }
-
   test("joinVertices") {
     withSpark { sc =>
       val vertices =
-- 
cgit v1.2.3


From cfe4a29dcb516ceae5f243ac3b5d0c3a505d7f5a Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Mon, 13 Jan 2014 17:15:21 -0800
Subject: Improvements in example code for the programming guide as well as
 adding serialization support for GraphImpl to address issues with failed
 closure capture.

---
 docs/graphx-programming-guide.md                   | 39 ++++++++++++----------
 .../org/apache/spark/graphx/impl/GraphImpl.scala   |  3 ++
 2 files changed, 25 insertions(+), 17 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index e6afd092be..c82c3d7358 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -478,24 +478,26 @@ def mapReduceTriplets[A](
 
 The [`mapReduceTriplets`][Graph.mapReduceTriplets] operator takes a user defined map function which
 is applied to each triplet and can yield *messages* destined to either (none or both) vertices in
-the triplet.  We currently only support messages destined to the source or destination vertex of the
-triplet to enable optimized preaggregation.  The user defined `reduce` function combines the
+the triplet.  To facilitate optimized pre-aggregation, we currently only support messages destined
+to the source or destination vertex of the triplet.  The user defined `reduce` function combines the
 messages destined to each vertex.  The `mapReduceTriplets` operator returns a `VertexRDD[A]`
-containing the aggregate message to each vertex.  Vertices that do not receive a message are not
-included in the returned `VertexRDD`.
+containing the aggregate message (of type `A`) destined to each vertex.  Vertices that do not
+receive a message are not included in the returned `VertexRDD`.
 
-> Note that `mapReduceTriplets takes an additional optional `activeSet` (see API docs) which
+> Note that `mapReduceTriplets` takes an additional optional `activeSet` (see API docs) which
 > restricts the map phase to edges adjacent to the vertices in the provided `VertexRDD`. Restricting
 > computation to triplets adjacent to a subset of the vertices is often necessary in incremental
 > iterative computation and is a key part of the GraphX implementation of Pregel.
 
-We can use the `mapReduceTriplets` operator to collect information about adjacent vertices.  For
-example if we wanted to compute the average age of followers who are older that each user we could
-do the following.
+In the following example we use the `mapReduceTriplets` operator to compute the average age of the
+more senior followers of each user.
 
 {% highlight scala %}
-// Graph with age as the vertex property
-val graph: Graph[Double, String] = getFromSomewhereElse()
+// Import Random graph generation library
+import org.apache.spark.graphx.util.GraphGenerators
+// Create a graph with "age" as the vertex property.  Here we use a random graph for simplicity.
+val graph: Graph[Double, Int] =
+  GraphGenerators.logNormalGraph(sc, numVertices = 100).mapVertices( (id, _) => id.toDouble )
 // Compute the number of older followers and their total age
 val olderFollowers: VertexRDD[(Int, Double)] = graph.mapReduceTriplets[(Int, Double)](
   triplet => { // Map Function
@@ -511,13 +513,16 @@ val olderFollowers: VertexRDD[(Int, Double)] = graph.mapReduceTriplets[(Int, Dou
   (a, b) => (a._1 + b._1, a._2 + b._2) // Reduce Function
 )
 // Divide total age by number of older followers to get average age of older followers
-val avgAgeOlderFollowers: VertexRDD[Double] =
-  olderFollowers.mapValues { case (count, totalAge) => totalAge / count }
+val avgAgeOfOlderFollowers: VertexRDD[Double] =
+  olderFollowers.mapValues( (id, value) => value match { case (count, totalAge) => totalAge / count } )
+// Display the results
+avgAgeOfOlderFollowers.collect.foreach(println(_))
 {% endhighlight %}
 
 > Note that the `mapReduceTriplets` operation performs optimally when the messages (and their sums)
 > are constant sized (e.g., floats and addition instead of lists and concatenation).  More
-> precisely, the result of `mapReduceTriplets` should be sub-linear in the degree of each vertex.
+> precisely, the result of `mapReduceTriplets` should ideally be sub-linear in the degree of each
+> vertex.
 
 ### Computing Degree Information
 
@@ -529,13 +534,13 @@ compute the max in, out, and total degrees:
 
 {% highlight scala %}
 // Define a reduce operation to compute the highest degree vertex
-def maxReduce(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
+def max(a: (VertexID, Int), b: (VertexID, Int)): (VertexID, Int) = {
   if (a._2 > b._2) a else b
 }
 // Compute the max degrees
-val maxInDegree: (VertexId, Int)  = graph.inDegrees.reduce(maxReduce)
-val maxOutDegree: (VertexId, Int) = graph.outDegrees.reduce(maxReduce)
-val maxDegrees: (VertexId, Int)   = graph.degrees.reduce(maxReduce)
+val maxInDegree: (VertexID, Int)  = graph.inDegrees.reduce(max)
+val maxOutDegree: (VertexID, Int) = graph.outDegrees.reduce(max)
+val maxDegrees: (VertexID, Int)   = graph.degrees.reduce(max)
 {% endhighlight %}
 
 
diff --git a/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala b/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala
index c21f8935d9..916eb9763c 100644
--- a/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala
+++ b/graphx/src/main/scala/org/apache/spark/graphx/impl/GraphImpl.scala
@@ -32,6 +32,9 @@ class GraphImpl[VD: ClassTag, ED: ClassTag] protected (
     @transient val replicatedVertexView: ReplicatedVertexView[VD])
   extends Graph[VD, ED] with Serializable {
 
+  /** Default construct is provided to support serialization */
+  protected def this() = this(null, null, null, null)
+
   /** Return a RDD that brings edges together with their source and destination vertices. */
   @transient override val triplets: RDD[EdgeTriplet[VD, ED]] = {
     val vdTag = classTag[VD]
-- 
cgit v1.2.3


From 622b7f7d391375cced8633e4a2546dbca60a3907 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Mon, 13 Jan 2014 17:46:47 -0800
Subject: Minor changes in graphx programming guide.

---
 docs/graphx-programming-guide.md | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index c82c3d7358..c6505d21f1 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -543,7 +543,6 @@ val maxOutDegree: (VertexID, Int) = graph.outDegrees.reduce(max)
 val maxDegrees: (VertexID, Int)   = graph.degrees.reduce(max)
 {% endhighlight %}
 
-
 ### Collecting Neighbors
 
 In some cases it may be easier to express computation by collecting neighboring vertices and their
@@ -562,8 +561,8 @@ def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[ Array[(VertexID,
 # Pregel API
 <a name="pregel"></a>
 
-Graphs are inherently recursive data-structures as properties of a vertices depend on properties of
-their neighbors which intern depend on properties of the neighbors of their neighbors.  As a
+Graphs are inherently recursive data-structures as properties of vertices depend on properties of
+their neighbors which intern depend on properties of *their* neighbors.  As a
 consequence many important graph algorithms iteratively recompute the properties of each vertex
 until a fixed-point condition is reached.  A range of graph-parallel abstractions have been proposed
 to express these iterative algorithms.  GraphX exposes a Pregel operator which is a fusion of
-- 
cgit v1.2.3


From 552de5d42e395bad19f5d5fe6dcc1e678bb994a8 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Mon, 13 Jan 2014 18:40:35 -0800
Subject: Finished second pass on pregel docs.

---
 docs/graphx-programming-guide.md | 45 +++++++++++++++++++++++++++++-----------
 1 file changed, 33 insertions(+), 12 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index c6505d21f1..77d807874f 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -484,10 +484,28 @@ messages destined to each vertex.  The `mapReduceTriplets` operator returns a `V
 containing the aggregate message (of type `A`) destined to each vertex.  Vertices that do not
 receive a message are not included in the returned `VertexRDD`.
 
-> Note that `mapReduceTriplets` takes an additional optional `activeSet` (see API docs) which
-> restricts the map phase to edges adjacent to the vertices in the provided `VertexRDD`. Restricting
-> computation to triplets adjacent to a subset of the vertices is often necessary in incremental
-> iterative computation and is a key part of the GraphX implementation of Pregel.
+<blockquote>
+<p>
+Note that <code>mapReduceTriplets</code> takes an additional optional <code>activeSet</code>
+(see API docs) which restricts the map phase to edges adjacent to the vertices in the provided
+<code>VertexRDD</code>:
+</p>
+{% highlight scala %}
+  activeSetOpt: Option[(VertexRDD[_], EdgeDirection)] = None
+{% endhighlight %}
+<p>
+The EdgeDirection specifies which edges adjacent to the vertex set are included in the map phase. If
+the direction is <code>In</code>, <code>mapFunc</code> will only be run only on edges with
+destination in the active set. If the direction is <code>Out</code>, <code>mapFunc</code> will only
+be run only on edges originating from vertices in the active set.  If the direction is
+<code>Either</code>, <code>mapFunc</code> will be run only on edges with <i>either</i> vertex in the
+active set.  If the direction is <code>Both</code>, <code>mapFunc</code> will be run only on edges
+with both vertices in the active set.  The active set must be derived from the set of vertices in
+the graph. Restricting computation to triplets adjacent to a subset of the vertices is often
+necessary in incremental iterative computation and is a key part of the GraphX implementation of
+Pregel.
+</p>
+</blockquote>
 
 In the following example we use the `mapReduceTriplets` operator to compute the average age of the
 more senior followers of each user.
@@ -565,15 +583,18 @@ Graphs are inherently recursive data-structures as properties of vertices depend
 their neighbors which intern depend on properties of *their* neighbors.  As a
 consequence many important graph algorithms iteratively recompute the properties of each vertex
 until a fixed-point condition is reached.  A range of graph-parallel abstractions have been proposed
-to express these iterative algorithms.  GraphX exposes a Pregel operator which is a fusion of
+to express these iterative algorithms.  GraphX exposes a Pregel-like operator which is a fusion of
 the widely used Pregel and GraphLab abstractions.
 
-At a high-level the GraphX variant of the Pregel abstraction is a bulk-synchronous parallel
-messaging abstraction constrained to the topology of the graph.  The Pregel operator executes in a
-series of super-steps in which vertices receive the sum of their inbound messages from the previous
-super-step, compute a new property value, and then send messages to neighboring vertices in the next
-super-step.  Vertices that do not receive a message are skipped within a super-step.  The Pregel
-operators terminates iteration and returns the final graph when there are no messages remaining.
+At a high-level the Pregel operator in GraphX is a bulk-synchronous parallel messaging abstraction
+*constrained to the topology of the graph*.  The Pregel operator executes in a series of super-steps
+in which vertices receive the *sum* of their inbound messages from the previous super- step, compute
+a new value for the vertex property, and then send messages to neighboring vertices in the next
+super-step.  Unlike Pregel and instead more like GraphLab messages are computed in parallel as a
+function of the edge triplet and the message computation has access to both the source and
+destination vertex attributes.  Vertices that do not receive a message are skipped within a super-
+step.  The Pregel operators terminates iteration and returns the final graph when there are no
+messages remaining.
 
 > Note, unlike more standard Pregel implementations, vertices in GraphX can only send messages to
 > neighboring vertices and the message construction is done in parallel using a user defined
@@ -588,7 +609,7 @@ def pregel[A]
      maxIter: Int = Int.MaxValue,
      activeDir: EdgeDirection = EdgeDirection.Out)
     (vprog: (VertexID, VD, A) => VD,
-     sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID,A)],
+     sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID, A)],
      mergeMsg: (A, A) => A)
   : Graph[VD, ED] = {
   // Receive the initial message at each vertex
-- 
cgit v1.2.3


From ee8931d2c6503716de640d6d1249c515e1fd85d3 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Mon, 13 Jan 2014 19:30:25 -0800
Subject: Finished documenting vertexrdd.

---
 docs/graphx-programming-guide.md | 53 ++++++++++++++++++++++++++++++++++++++++
 1 file changed, 53 insertions(+)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 77d807874f..76de26c7cd 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -683,7 +683,60 @@ val sssp = initialGraph.pregel(Double.PositiveInfinity)(
 # Vertex and Edge RDDs
 <a name="vertex_and_edge_rdds"></a>
 
+GraphX exposes `RDD` views of the vertices and edges stored within the graph.  However, because
+GraphX maintains the vertices and edges in optimized data-structures and these data-structures
+provide additional functionality, the vertices and edges are returned as `VertexRDD` and `EdgeRDD`
+respectively.  In this section we review some of the additional useful functionality in these types.
 
+## VertexRDDs
+
+The `VertexRDD[A]` extends the more traditional `RDD[(VertexId, A)]` but adds the additional
+constraint that each `VertexId` occurs only *once*.  Moreover, `VertexRDD[A]` represents a *set* of
+vertices each with an attribute of type `A`.  Internally, this is achieved by storing the vertex
+attributes in a reusable hash-map data-structure.  As a consequence if two `VertexRDD`s are derived
+from the same base `VertexRDD` (e.g., by `filter` or `mapValues`) they can be joined in constant
+time without hash evaluations. To leverage this indexed data-structure, the `VertexRDD` exposes the
+following additional functionality:
+
+{% highlight scala %}
+// Filter the vertex set but preserves the internal index
+def filter(pred: Tuple2[VertexID, VD] => Boolean): VertexRDD[VD]
+// Transform the values without changing the ids (preserves the internal index)
+def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
+def mapValues[VD2](map: (VertexID, VD) => VD2): VertexRDD[VD2]
+// Remove vertices from this set that appear in the other set
+def diff(other: VertexRDD[VD]): VertexRDD[VD]
+// Join operators that take advantage of the internal indexing to accelerate joins (substantially)
+def leftJoin[VD2, VD3](other: RDD[(VertexID, VD2)])(f: (VertexID, VD, Option[VD2]) => VD3): VertexRDD[VD3]
+def innerJoin[U, VD2](other: RDD[(VertexID, U)])(f: (VertexID, VD, U) => VD2): VertexRDD[VD2]
+// Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
+def aggregateUsingIndex[VD2](other: RDD[(VertexID, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
+{% endhighlight %}
+
+Notice, for example,  how the `filter` operator returns an `VertexRDD`.  Filter is actually
+implemented using a `BitSet` thereby reusing the index and preserving the ability to do fast joins
+with other `VertexRDD`s.  Likewise, the `mapValues` operators do not allow the `map` function to
+change the `VertexId` thereby enabling the same `HashMap` data-structures to be reused.  Both the
+`leftJoin` and `innerJoin` are able to identify when joining two `VertexRDD`s derived from the same
+`HashMap` and implement the join by linear scan rather than costly point lookups.
+
+The `aggregateUsingIndex` operator can be slightly confusing but is also useful for efficient
+construction of a new `VertexRDD` from an `RDD[(VertexId, A)]`.  Conceptually, if I have constructed
+a `VertexRDD[B]` over a set of vertices, *which is a super-set* of the vertices in some
+`RDD[(VertexId, A)]` then I can reuse the index to both aggregate and then subsequently index the
+RDD.  For example:
+
+{% highlight scala %}
+val setA: VertexRDD[Int] = VertexRDD(sc.parallelize(0L until 100L).map(id => (id, 1)))
+val rddB: RDD[(VertexID, Double)] = sc.parallelize(0L until 100L).flatMap(id => List((id, 1.0), (id, 2.0)))
+// There should be 200 entries in rddB
+rddB.count
+val setB: VertexRDD[Double] = setA.aggregateUsingIndex(rddB, _ + _)
+// There should be 100 entries in setB
+setB.count
+// Joining A and B should now be fast!
+val setC: VertexRDD[Double] = setA.innerJoin(setB)((id, a, b) => a + b)
+{% endhighlight %}
 
 # Optimized Representation
 
-- 
cgit v1.2.3


From 59e4384e19b0d7390259fa42daae95ae6f12f793 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 21:02:09 -0800
Subject: Fix Pregel SSSP example in programming guide

---
 docs/graphx-programming-guide.md | 22 ++++++++++++++--------
 1 file changed, 14 insertions(+), 8 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 76de26c7cd..91cc5b69cc 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -511,7 +511,7 @@ In the following example we use the `mapReduceTriplets` operator to compute the
 more senior followers of each user.
 
 {% highlight scala %}
-// Import Random graph generation library
+// Import random graph generation library
 import org.apache.spark.graphx.util.GraphGenerators
 // Create a graph with "age" as the vertex property.  Here we use a random graph for simplicity.
 val graph: Graph[Double, Int] =
@@ -643,18 +643,23 @@ iterations, and the edge direction in which to send messages (by default along o
 second argument list contains the user defined functions for receiving messages (the vertex program
 `vprog`), computing messages (`sendMsg`), and combining messages `mergeMsg`.
 
-We can use the Pregel operator to express computation such single source shortest path in the
-following example.
+We can use the Pregel operator to express computation such as single source
+shortest path in the following example.
 
 {% highlight scala %}
-val graph: Graph[String, Double] // A graph with edge attributes containing distances
-val sourceId: VertexId = 42 // The ultimate source
+import org.apache.spark.graphx._
+// Import random graph generation library
+import org.apache.spark.graphx.util.GraphGenerators
+// A graph with edge attributes containing distances
+val graph: Graph[Int, Double] =
+  GraphGenerators.logNormalGraph(sc, numVertices = 100).mapEdges(e => e.attr.toDouble)
+val sourceId: VertexID = 42 // The ultimate source
 // Initialize the graph such that all vertices except the root have distance infinity.
-val initialGraph = graph.mapVertices((id, _) => if (id == shourceId) 0.0 else Double.PositiveInfinity)
+val initialGraph = graph.mapVertices((id, _) => if (id == sourceId) 0.0 else Double.PositiveInfinity)
 val sssp = initialGraph.pregel(Double.PositiveInfinity)(
-  (id, dist, newDist) => math.min(dist, newDist) // Vertex Program
+  (id, dist, newDist) => math.min(dist, newDist), // Vertex Program
   triplet => {  // Send Message
-    if(triplet.srcAttr + triplet.attr < triplet.dstAttr) {
+    if (triplet.srcAttr + triplet.attr < triplet.dstAttr) {
       Iterator((triplet.dstId, triplet.srcAttr + triplet.attr))
     } else {
       Iterator.empty
@@ -662,6 +667,7 @@ val sssp = initialGraph.pregel(Double.PositiveInfinity)(
   },
   (a,b) => math.min(a,b) // Merge Message
   )
+println(sssp.vertices.collect.mkString("\n"))
 {% endhighlight %}
 
 # Graph Builders
-- 
cgit v1.2.3


From e14a14bcde1637af04cc4c3bd708fed5670e4959 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 21:12:58 -0800
Subject: Remove K-Core and LDA sections from guide; they are unimplemented

---
 docs/graphx-programming-guide.md | 4 ----
 1 file changed, 4 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 91cc5b69cc..69cadc1e84 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -848,10 +848,6 @@ val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =
 println(triCountByUsername.collect().mkString("\n"))
 {% endhighlight %}
 
-## K-Core
-
-## LDA
-
 <p style="text-align: center;">
   <img src="img/tables_and_graphs.png"
        title="Tables and Graphs"
-- 
cgit v1.2.3


From 67795dbbfb3857e9677e3104b8bd6fd2cd5633a9 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 21:45:11 -0800
Subject: Write Graph Builders section in guide

---
 docs/graphx-programming-guide.md | 54 ++++++++++++++++++++++++++++++++++++----
 1 file changed, 49 insertions(+), 5 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 69cadc1e84..aadeb38960 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -673,13 +673,57 @@ println(sssp.vertices.collect.mkString("\n"))
 # Graph Builders
 <a name="graph_builders"></a>
 
-[`GraphLoader.edgeListFile`][GraphLoader.edgeListFile]
+GraphX provides several ways of building a graph from a collection of vertices and edges in an RDD or on disk. None of the graph builders repartitions the graph's edges by default; instead, edges are left in their default partitions (such as their original blocks in HDFS). [`Graph.groupEdges`][Graph.groupEdges] requires the graph to be repartitioned because it assumes identical edges will be colocated on the same partition, so you must call [`Graph.partitionBy`][Graph.partitionBy] before calling `groupEdges`.
 
-[`Graph.apply`][Graph.apply]
+{% highlight scala %}
+object GraphLoader {
+  def edgeListFile(
+      sc: SparkContext,
+      path: String,
+      canonicalOrientation: Boolean = false,
+      minEdgePartitions: Int = 1)
+    : Graph[Int, Int]
+}
+{% endhighlight %}
+
+[`GraphLoader.edgeListFile`][GraphLoader.edgeListFile] provides a way to load a graph from a list of edges on disk. It parses an adjacency list of (source vertex ID, destination vertex ID) pairs of the following form, skipping comment lines that begin with `#`:
+
+~~~
+# This is a comment
+2 1
+4 1
+1 2
+~~~
+
+It creates a `Graph` from the specified edges, automatically creating any vertices mentioned by edges. All vertex and edge attributes default to 1. The `canonicalOrientation` argument allows reorienting edges in the positive direction (`srcId < dstId`), which is required by the [connected components][ConnectedComponents] algorithm. The `minEdgePartitions` argument specifies the minimum number of edge partitions to generate; there may be more edge partitions than specified if, for example, the HDFS file has more blocks.
+
+{% highlight scala %}
+object Graph {
+  def apply[VD, ED](
+      vertices: RDD[(VertexID, VD)],
+      edges: RDD[Edge[ED]],
+      defaultVertexAttr: VD = null)
+    : Graph[VD, ED]
+
+  def fromEdges[VD, ED](
+      edges: RDD[Edge[ED]],
+      defaultValue: VD): Graph[VD, ED]
+
+  def fromEdgeTuples[VD](
+      rawEdges: RDD[(VertexID, VertexID)],
+      defaultValue: VD,
+      uniqueEdges: Option[PartitionStrategy] = None): Graph[VD, Int]
+
+}
+{% endhighlight %}
+
+[`Graph.apply`][Graph.apply] allows creating a graph from RDDs of vertices and edges. Duplicate vertices are picked arbitrarily and vertices found in the edge RDD but not the vertex RDD are assigned the default attribute.
+
+[`Graph.fromEdges`][Graph.fromEdges] allows creating a graph from only an RDD of edges, automatically creating any vertices mentioned by edges and assigning them the default value.
 
-[`Graph.fromEdgeTuples`][Graph.fromEdgeTuples]
+[`Graph.fromEdgeTuples`][Graph.fromEdgeTuples] allows creating a graph from only an RDD of edge tuples, assigning the edges the value 1, and automatically creating any vertices mentioned by edges and assigning them the default value. It also supports deduplicating the edges; to deduplicate, pass `Some` of a [`PartitionStrategy`][PartitionStrategy] as the `uniqueEdges` parameter (for example, `uniqueEdges = Some(PartitionStrategy.RandomVertexCut)`). A partition strategy is necessary to colocate identical edges on the same partition so they can be deduplicated.
 
-[`Graph.fromEdges`][Graph.fromEdges]
+[PartitionStrategy]: api/graphx/index.html#org.apache.spark.graphx.PartitionStrategy$
 
 [GraphLoader.edgeListFile]: api/graphx/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]
 [Graph.apply]: api/graphx/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexID,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]
@@ -826,7 +870,7 @@ println(ccByUsername.collect().mkString("\n"))
 
 ## Triangle Counting
 
-A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the [`TriangleCount` object][TriangleCount] that determines the number of triangles passing through each vertex, providing a measure of clustering. We compute the triangle count of the social network dataset from the [PageRank section](#pagerank). *Note that `TriangleCount` requires the edges to be in canonical orientation (`srcId < dstId`) and the graph to be partitioned using [`Graph#partitionBy`][Graph.partitionBy].*
+A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the [`TriangleCount` object][TriangleCount] that determines the number of triangles passing through each vertex, providing a measure of clustering. We compute the triangle count of the social network dataset from the [PageRank section](#pagerank). *Note that `TriangleCount` requires the edges to be in canonical orientation (`srcId < dstId`) and the graph to be partitioned using [`Graph.partitionBy`][Graph.partitionBy].*
 
 [TriangleCount]: api/graphx/index.html#org.apache.spark.graphx.lib.TriangleCount$
 [Graph.partitionBy]: api/graphx/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]
-- 
cgit v1.2.3


From 6f6f8c928ce493357d4d32e46971c5e401682ea8 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 21:55:35 -0800
Subject: Wrap methods in the appropriate class/object declaration

---
 docs/graphx-programming-guide.md | 149 ++++++++++++++++++++++-----------------
 1 file changed, 85 insertions(+), 64 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index aadeb38960..29d397c371 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -256,7 +256,7 @@ compute the in-degree of each vertex (defined in `GraphOps`) by the following:
 {% highlight scala %}
 val graph: Graph[(String, String), String]
 // Use the implicit GraphOps.inDegrees operator
-val indDegrees: VertexRDD[Int] = graph.inDegrees
+val inDegrees: VertexRDD[Int] = graph.inDegrees
 {% endhighlight %}
 
 The reason for differentiating between core graph operations and [`GraphOps`][GraphOps] is to be
@@ -270,9 +270,11 @@ In direct analogy to the RDD `map` operator, the property
 graph contains the following:
 
 {% highlight scala %}
-def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
-def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
-def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
+class Graph[VD, ED] {
+  def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
+  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
+  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
+}
 {% endhighlight %}
 
 Each of these operators yields a new graph with the vertex or edge properties modified by the user
@@ -314,11 +316,13 @@ Currently GraphX supports only a simple set of commonly used structural operator
 add more in the future.  The following is a list of the basic structural operators.
 
 {% highlight scala %}
-def reverse: Graph[VD, ED]
-def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
-             vpred: (VertexID, VD) => Boolean): Graph[VD, ED]
-def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
-def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
+class Graph[VD, ED] {
+  def reverse: Graph[VD, ED]
+  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
+               vpred: (VertexID, VD) => Boolean): Graph[VD, ED]
+  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
+  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
+}
 {% endhighlight %}
 
 The [`reverse`][Graph.reverse] operator returns a new graph with all the edge directions reversed.
@@ -400,10 +404,12 @@ might want to pull vertex properties from one graph into another.  These tasks c
 using the *join* operators. Below we list the key join operators:
 
 {% highlight scala %}
-def joinVertices[U](table: RDD[(VertexID, U)])(map: (VertexID, VD, U) => VD)
-  : Graph[VD, ED]
-def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Option[U]) => VD2)
-  : Graph[VD2, ED]
+class Graph[VD, ED] {
+  def joinVertices[U](table: RDD[(VertexID, U)])(map: (VertexID, VD, U) => VD)
+    : Graph[VD, ED]
+  def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Option[U]) => VD2)
+    : Graph[VD2, ED]
+}
 {% endhighlight %}
 
 The [`joinVertices`][GraphOps.joinVertices] operator joins the vertices with the input RDD and
@@ -470,10 +476,12 @@ The core (heavily optimized) aggregation primitive in GraphX is the
 [`mapReduceTriplets`][Graph.mapReduceTriplets] operator:
 
 {% highlight scala %}
-def mapReduceTriplets[A](
-    map: EdgeTriplet[VD, ED] => Iterator[(VertexID, A)],
-    reduce: (A, A) => A)
-  : VertexRDD[A]
+class Graph[VD, ED] {
+  def mapReduceTriplets[A](
+      map: EdgeTriplet[VD, ED] => Iterator[(VertexID, A)],
+      reduce: (A, A) => A)
+    : VertexRDD[A]
+}
 {% endhighlight %}
 
 The [`mapReduceTriplets`][Graph.mapReduceTriplets] operator takes a user defined map function which
@@ -564,12 +572,19 @@ val maxDegrees: (VertexID, Int)   = graph.degrees.reduce(max)
 ### Collecting Neighbors
 
 In some cases it may be easier to express computation by collecting neighboring vertices and their
-attributes at each vertex. This can be easily accomplished using the `collectNeighborIds` and the
-`collectNeighbors` operators.
+attributes at each vertex. This can be easily accomplished using the
+[`collectNeighborIds`][GraphOps.collectNeighborIds] and the
+[`collectNeighbors`][GraphOps.collectNeighbors] operators.
+
+[GraphOps.collectNeighborIds]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@collectNeighborIds(EdgeDirection):VertexRDD[Array[VertexID]]
+[GraphOps.collectNeighbors]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@collectNeighbors(EdgeDirection):VertexRDD[Array[(VertexID,VD)]]
+
 
 {% highlight scala %}
-def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexID]] =
-def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[ Array[(VertexID, VD)] ]
+class GraphOps[VD, ED] {
+  def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexID]]
+  def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[ Array[(VertexID, VD)] ]
+}
 {% endhighlight %}
 
 > Note that these operators can be quite costly as they duplicate information and require
@@ -600,40 +615,44 @@ messages remaining.
 > neighboring vertices and the message construction is done in parallel using a user defined
 > messaging function.  These constraints allow additional optimization within GraphX.
 
-The following is type signature of the Pregel operator as well as a *sketch* of its implementation
-(note calls to graph.cache have been removed):
+The following is the type signature of the [Pregel operator][GraphOps.pregel] as well as a *sketch*
+of its implementation (note calls to graph.cache have been removed):
+
+[GraphOps.pregel]: api/graphx/index.html#org.apache.spark.graphx.GraphOps@pregel[A](A,Int,EdgeDirection)((VertexID,VD,A)⇒VD,(EdgeTriplet[VD,ED])⇒Iterator[(VertexID,A)],(A,A)⇒A)(ClassTag[A]):Graph[VD,ED]
 
 {% highlight scala %}
-def pregel[A]
-    (initialMsg: A,
-     maxIter: Int = Int.MaxValue,
-     activeDir: EdgeDirection = EdgeDirection.Out)
-    (vprog: (VertexID, VD, A) => VD,
-     sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID, A)],
-     mergeMsg: (A, A) => A)
-  : Graph[VD, ED] = {
-  // Receive the initial message at each vertex
-  var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
-  // compute the messages
-  var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
-  var activeMessages = messages.count()
-  // Loop until no messages remain or maxIterations is achieved
-  var i = 0
-  while (activeMessages > 0 && i < maxIterations) {
-    // Receive the messages: -----------------------------------------------------------------------
-    // Run the vertex program on all vertices that receive messages
-    val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
-    // Merge the new vertex values back into the graph
-    g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
-    // Send Messages: ------------------------------------------------------------------------------
-    // Vertices that didn't receive a message above don't appear in newVerts and therefore don't
-    // get to send messages.  More precisely the map phase of mapReduceTriplets is only invoked
-    // on edges in the activeDir of vertices in newVerts
-    messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
-    activeMessages = messages.count()
-    i += 1
+class GraphOps[VD, ED] {
+  def pregel[A]
+      (initialMsg: A,
+       maxIter: Int = Int.MaxValue,
+       activeDir: EdgeDirection = EdgeDirection.Out)
+      (vprog: (VertexID, VD, A) => VD,
+       sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexID, A)],
+       mergeMsg: (A, A) => A)
+    : Graph[VD, ED] = {
+    // Receive the initial message at each vertex
+    var g = mapVertices( (vid, vdata) => vprog(vid, vdata, initialMsg) ).cache()
+    // compute the messages
+    var messages = g.mapReduceTriplets(sendMsg, mergeMsg)
+    var activeMessages = messages.count()
+    // Loop until no messages remain or maxIterations is achieved
+    var i = 0
+    while (activeMessages > 0 && i < maxIterations) {
+      // Receive the messages: -----------------------------------------------------------------------
+      // Run the vertex program on all vertices that receive messages
+      val newVerts = g.vertices.innerJoin(messages)(vprog).cache()
+      // Merge the new vertex values back into the graph
+      g = g.outerJoinVertices(newVerts) { (vid, old, newOpt) => newOpt.getOrElse(old) }.cache()
+      // Send Messages: ------------------------------------------------------------------------------
+      // Vertices that didn't receive a message above don't appear in newVerts and therefore don't
+      // get to send messages.  More precisely the map phase of mapReduceTriplets is only invoked
+      // on edges in the activeDir of vertices in newVerts
+      messages = g.mapReduceTriplets(sendMsg, mergeMsg, Some((newVerts, activeDir))).cache()
+      activeMessages = messages.count()
+      i += 1
+    }
+    g
   }
-  g
 }
 {% endhighlight %}
 
@@ -749,18 +768,20 @@ time without hash evaluations. To leverage this indexed data-structure, the `Ver
 following additional functionality:
 
 {% highlight scala %}
-// Filter the vertex set but preserves the internal index
-def filter(pred: Tuple2[VertexID, VD] => Boolean): VertexRDD[VD]
-// Transform the values without changing the ids (preserves the internal index)
-def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
-def mapValues[VD2](map: (VertexID, VD) => VD2): VertexRDD[VD2]
-// Remove vertices from this set that appear in the other set
-def diff(other: VertexRDD[VD]): VertexRDD[VD]
-// Join operators that take advantage of the internal indexing to accelerate joins (substantially)
-def leftJoin[VD2, VD3](other: RDD[(VertexID, VD2)])(f: (VertexID, VD, Option[VD2]) => VD3): VertexRDD[VD3]
-def innerJoin[U, VD2](other: RDD[(VertexID, U)])(f: (VertexID, VD, U) => VD2): VertexRDD[VD2]
-// Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
-def aggregateUsingIndex[VD2](other: RDD[(VertexID, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
+class VertexRDD[VD] {
+  // Filter the vertex set but preserves the internal index
+  def filter(pred: Tuple2[VertexID, VD] => Boolean): VertexRDD[VD]
+  // Transform the values without changing the ids (preserves the internal index)
+  def mapValues[VD2](map: VD => VD2): VertexRDD[VD2]
+  def mapValues[VD2](map: (VertexID, VD) => VD2): VertexRDD[VD2]
+  // Remove vertices from this set that appear in the other set
+  def diff(other: VertexRDD[VD]): VertexRDD[VD]
+  // Join operators that take advantage of the internal indexing to accelerate joins (substantially)
+  def leftJoin[VD2, VD3](other: RDD[(VertexID, VD2)])(f: (VertexID, VD, Option[VD2]) => VD3): VertexRDD[VD3]
+  def innerJoin[U, VD2](other: RDD[(VertexID, U)])(f: (VertexID, VD, U) => VD2): VertexRDD[VD2]
+  // Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.
+  def aggregateUsingIndex[VD2](other: RDD[(VertexID, VD2)], reduceFunc: (VD2, VD2) => VD2): VertexRDD[VD2]
+}
 {% endhighlight %}
 
 Notice, for example,  how the `filter` operator returns an `VertexRDD`.  Filter is actually
-- 
cgit v1.2.3


From 2cd9358ccf28186e88016b6542d7c0c90536651f Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 22:29:23 -0800
Subject: Finish 6f6f8c928ce493357d4d32e46971c5e401682ea8

---
 docs/graphx-programming-guide.md | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 29d397c371..226299a759 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -125,8 +125,10 @@ properties for each vertex and edge.  As a consequence, the graph class contains
 the vertices and edges of the graph:
 
 {% highlight scala %}
-val vertices: VertexRDD[VD]
-val edges: EdgeRDD[ED]
+class Graph[VD, ED] {
+  val vertices: VertexRDD[VD]
+  val edges: EdgeRDD[ED]
+}
 {% endhighlight %}
 
 The classes `VertexRDD[VD]` and `EdgeRDD[ED]` extend and are optimized versions of `RDD[(VertexId,
-- 
cgit v1.2.3


From af645be5b8d41d5a0fd4a529956c5ab438198db4 Mon Sep 17 00:00:00 2001
From: Ankur Dave <ankurdave@gmail.com>
Date: Mon, 13 Jan 2014 22:29:45 -0800
Subject: Fix all code examples in guide

---
 docs/graphx-programming-guide.md | 46 ++++++++++++++++++++--------------------
 graphx/data/users.txt            | 13 ++++++------
 2 files changed, 30 insertions(+), 29 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 226299a759..a7ab00306e 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -357,7 +357,7 @@ val relationships: RDD[Edge[String]] =
 val defaultUser = ("John Doe", "Missing")
 // Build the initial Graph
 val graph = Graph(users, relationships, defaultUser)
-// Notice that there is a user 0 (for which we have no information) connecting users
+// Notice that there is a user 0 (for which we have no information) connected to users
 // 4 (peter) and 5 (franklin).
 graph.triplets.map(
     triplet => triplet.srcAttr._1 + " is the " + triplet.attr + " of " + triplet.dstAttr._1
@@ -858,11 +858,11 @@ val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
 val ranks = graph.pageRank(0.0001).vertices
 // Join the ranks with the usernames
 val users = sc.textFile("graphx/data/users.txt").map { line =>
-  val fields = line.split("\\s+")
+  val fields = line.split(",")
   (fields(0).toLong, fields(1))
 }
-val ranksByUsername = users.leftOuterJoin(ranks).map {
-  case (id, (username, rankOpt)) => (username, rankOpt.getOrElse(0.0))
+val ranksByUsername = users.join(ranks).map {
+  case (id, (username, rank)) => (username, rank)
 }
 // Print the result
 println(ranksByUsername.collect().mkString("\n"))
@@ -881,11 +881,11 @@ val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
 val cc = graph.connectedComponents().vertices
 // Join the connected components with the usernames
 val users = sc.textFile("graphx/data/users.txt").map { line =>
-  val fields = line.split("\\s+")
+  val fields = line.split(",")
   (fields(0).toLong, fields(1))
 }
-val ccByUsername = users.join(cc).map { case (id, (username, cc)) =>
-  (username, cc)
+val ccByUsername = users.join(cc).map {
+  case (id, (username, cc)) => (username, cc)
 }
 // Print the result
 println(ccByUsername.collect().mkString("\n"))
@@ -900,12 +900,12 @@ A vertex is part of a triangle when it has two adjacent vertices with an edge be
 
 {% highlight scala %}
 // Load the edges in canonical order and partition the graph for triangle count
-val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(RandomVertexCut)
+val graph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt", true).partitionBy(PartitionStrategy.RandomVertexCut)
 // Find the triangle count for each vertex
 val triCounts = graph.triangleCount().vertices
 // Join the triangle counts with the usernames
 val users = sc.textFile("graphx/data/users.txt").map { line =>
-  val fields = line.split("\\s+")
+  val fields = line.split(",")
   (fields(0).toLong, fields(1))
 }
 val triCountByUsername = users.join(triCounts).map { case (id, (username, tc)) =>
@@ -934,32 +934,32 @@ all of this in just a few lines with GraphX:
 // Connect to the Spark cluster
 val sc = new SparkContext("spark://master.amplab.org", "research")
 
-// Load my user data and prase into tuples of user id and attribute list
-val users = sc.textFile("hdfs://user_attributes.tsv")
-  .map(line => line.split).map( parts => (parts.head, parts.tail) )
+// Load my user data and parse into tuples of user id and attribute list
+val users = (sc.textFile("graphx/data/users.txt")
+  .map(line => line.split(",")).map( parts => (parts.head.toLong, parts.tail) ))
 
 // Parse the edge data which is already in userId -> userId format
-val followerGraph = Graph.textFile(sc, "hdfs://followers.tsv")
+val followerGraph = GraphLoader.edgeListFile(sc, "graphx/data/followers.txt")
 
 // Attach the user attributes
-val graph = followerGraph.outerJoinVertices(users){
+val graph = followerGraph.outerJoinVertices(users) {
   case (uid, deg, Some(attrList)) => attrList
   // Some users may not have attributes so we set them as empty
   case (uid, deg, None) => Array.empty[String]
-  }
+}
 
-// Restrict the graph to users which have exactly two attributes
-val subgraph = graph.subgraph((vid, attr) => attr.size == 2)
+// Restrict the graph to users with usernames and names
+val subgraph = graph.subgraph(vpred = (vid, attr) => attr.size == 2)
 
 // Compute the PageRank
-val pagerankGraph = Analytics.pagerank(subgraph)
+val pagerankGraph = subgraph.pageRank(0.001)
 
 // Get the attributes of the top pagerank users
-val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
-  case (uid, attrList, Some(pr)) => (pr, attrList)
-  case (uid, attrList, None) => (pr, attrList)
-  }
+val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices) {
+  case (uid, attrList, Some(pr)) => (pr, attrList.toList)
+  case (uid, attrList, None) => (0.0, attrList.toList)
+}
 
-println(userInfoWithPageRank.top(5))
+println(userInfoWithPageRank.vertices.top(5)(Ordering.by(_._2._1)).mkString("\n"))
 
 {% endhighlight %}
diff --git a/graphx/data/users.txt b/graphx/data/users.txt
index 26e3b3bb4d..982d19d50b 100644
--- a/graphx/data/users.txt
+++ b/graphx/data/users.txt
@@ -1,6 +1,7 @@
-1 BarackObama
-2 ladygaga
-3 jeresig
-4 justinbieber
-6 matei_zaharia
-7 odersky
+1,BarackObama,Barack Obama
+2,ladygaga,Goddess of Love
+3,jeresig,John Resig
+4,justinbieber,Justin Bieber
+6,matei_zaharia,Matei Zaharia
+7,odersky,Martin Odersky
+8,anonsys
-- 
cgit v1.2.3


From 4bafc4f41f5c5ed686c024d5f49cf31bbc08ce88 Mon Sep 17 00:00:00 2001
From: "Joseph E. Gonzalez" <joseph.e.gonzalez@gmail.com>
Date: Mon, 13 Jan 2014 22:55:26 -0800
Subject: adding documentation about EdgeRDD

---
 docs/graphx-programming-guide.md | 42 ++++++++++++++++++++++++++++++++++++++--
 1 file changed, 40 insertions(+), 2 deletions(-)

(limited to 'docs')

diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index a7ab00306e..9fbde4eb09 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -811,10 +811,34 @@ setB.count
 val setC: VertexRDD[Double] = setA.innerJoin(setB)((id, a, b) => a + b)
 {% endhighlight %}
 
+## EdgeRDDs
+
+The `EdgeRDD[ED]`, which extends `RDD[Edge[ED]]` is considerably simpler than the `VertexRDD`.
+GraphX organizes the edges in blocks partitioned using one of the various partitioning strategies
+defined in [`PartitionStrategy`][PartitionStrategy].  Within each partition, edge attributes and
+adjacency structure, are stored separately enabling maximum reuse when changing attribute values.
+
+[PartitionStrategy]: api/graphx/index.html#org.apache.spark.graphx.PartitionStrategy
+
+The three additional functions exposed by the `EdgeRDD` are:
+{% highlight scala %}
+// Transform the edge attributes while preserving the structure
+def mapValues[ED2](f: Edge[ED] => ED2): EdgeRDD[ED2]
+// Revere the edges reusing both attributes and structure
+def reverse: EdgeRDD[ED]
+// Join two `EdgeRDD`s partitioned using the same partitioning strategy.
+def innerJoin[ED2, ED3](other: EdgeRDD[ED2])(f: (VertexID, VertexID, ED, ED2) => ED3): EdgeRDD[ED3]
+{% endhighlight %}
+
+In most applications we have found that operations on the `EdgeRDD` are accomplished through the
+graph or rely on operations defined in the base `RDD` class.
+
 # Optimized Representation
 
-This section should give some intuition about how GraphX works and how that affects the user (e.g.,
-things to worry about.)
+While a detailed description of the optimizations used in the GraphX representation of distributed
+graphs is beyond the scope of this guide, some high-level understanding may aid in the design of
+scalable algorithms as well as optimal use of the API.  GraphX adopts a vertex-cut approach to
+distributed graph partitioning:
 
 <p style="text-align: center;">
   <img src="img/edge_cut_vs_vertex_cut.png"
@@ -824,6 +848,15 @@ things to worry about.)
   <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
+Rather than splitting graphs along edges, GraphX partitions the graph along vertices which can
+reduce both the communication and storage overhead.  Logically, this corresponds to assigning edges
+to machines and allowing vertices to span multiple machines.  The exact method of assigning edges
+depends on the [`PartitionStrategy`][PartitionStrategy] and there are several tradeoffs to the
+various heuristics.  Users can choose between different strategies by repartitioning the graph with
+the [`Graph.partitionBy`][Graph.partitionBy] operator.
+
+[Graph.partitionBy]: api/graphx/index.html#org.apache.spark.graphx.Graph$@partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy):org.apache.spark.graphx.Graph[VD,ED]
+
 <p style="text-align: center;">
   <img src="img/vertex_routing_edge_tables.png"
        title="RDD Graph Representation"
@@ -832,6 +865,11 @@ things to worry about.)
   <!-- Images are downsized intentionally to improve quality on retina displays -->
 </p>
 
+Once the edges have be partitioned the key challenge to efficient graph-parallel computation is
+efficiently joining vertex attributes with the edges.  Because real-world graphs typically have more
+edges than vertices, we move vertex attributes to the edges.
+
+
 
 
 
-- 
cgit v1.2.3