Finished documenting structural operators and starting join operators.

author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> 2014-01-11 11:28:01 -0800
committer: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> 2014-01-11 11:28:01 -0800
commit: fac44bbe2c10633e371cf30afa17c5e78290ca9c (patch)
tree: 21d4b929fb8570def98b2abbc1a2ef80b14f7925 /docs/graphx-programming-guide.md
parent: 1f45e4e572130e989cf1f91655c22352ac33b063 (diff)
download: spark-fac44bbe2c10633e371cf30afa17c5e78290ca9c.tar.gz
spark-fac44bbe2c10633e371cf30afa17c5e78290ca9c.tar.bz2
spark-fac44bbe2c10633e371cf30afa17c5e78290ca9c.zip
1 files changed, 72 insertions, 18 deletions
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 9a65745930..a5e75e2cb0 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -95,8 +95,10 @@ If you are not using the Spark shell you will also need a Spark context.
 # The Property Graph
 <a name="property_graph"></a>
 
-The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed graph with
-user defined objects attached to each vertex and edge.  Like RDDs, property graphs are immutable,
+The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed multigraph
+graph with user defined objects attached to each vertex and edge.  As a multigraph it is possible
+for multiple edges to have the same source and destination vertex.  This can be useful when there
+are multiple relationships between the same vertices.  Like RDDs, property graphs are immutable,
 distributed, and fault-tolerant. Vertices are keyed by their vertex identifier (`VertexId`) which is
 a unique 64-bit long. Similarly, edges have corresponding source and destination vertex identifiers.
 Unlike other systems, GraphX does not impose any ordering or constraints on the vertex identifiers.
@@ -106,7 +108,7 @@ of the objects associated with each vertex and edge respectively.  In some cases
 to have vertices of different types.  However, this can be accomplished through inheritance.
 
 > GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
-> int, double, etc...) reducing the memory overhead of the graph representation.
+> int, double, etc...) reducing the in memory footprint.
 
 Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
 properties for each vertex and edge:
@@ -224,7 +226,22 @@ val facts: RDD[String] =
 
 Just as RDDs have basic operations like `map`, `filter`, and `reduceByKey`, property graphs also
 have a collection of basic operators that take user defined function and produce new graphs with
-transformed properties and structure.
+transformed properties and structure.  The core operators that have optimized implementations are
+defined in [`Graph.scala`](api/graphx/index.html#org.apache.spark.graphx.Graph) and convenient
+operators that are expressed as a compositions of the core operators are defined in
+['GraphOps.scala'](api/graphx/index.html#org.apache.spark.graphx.GraphOps).  However, thanks to
+Scala implicits the operators in `GraphOps.scala` are automatically available as members of
+`Graph.scala`.  For example, we can compute the in-degree of each vertex (defined in
+'GraphOps.scala') by the following:
+
+{% highlight scala %}
+val graph: Graph[(String, String), String]
+// Use the implicit GraphOps.inDegrees operator
+val indDegrees: VertexRDD[Int] = graph.inDegrees
+{% endhighlight %}
+
+The reason for differentiating between core graph operations and GraphOps is to be able to support
+various graph representations in the future.
 
 ## Property Operators
 
@@ -232,9 +249,9 @@ In direct analogy to the RDD `map` operator, the property
 graph contains the following:
 
 {% highlight scala %}
-  def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
-  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
-  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
+def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
+def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
+def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
 {% endhighlight %}
 
 Each of these operators yields a new graph with the vertex or edge properties modified by the user
@@ -271,35 +288,72 @@ Currently GraphX supports only a simple set of commonly used structural operator
 add more in the future.  The following is a list of the basic structural operators.
 
 {% highlight scala %}
-  def reverse: Graph[VD, ED]
+def reverse: Graph[VD, ED]
 
-  def subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
-    vpred: (VertexID, VD) => Boolean = ((v,d) => true) ): Graph[VD, ED]
+def subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
+  vpred: (VertexID, VD) => Boolean = ((v,d) => true) ): Graph[VD, ED]
 
-  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
+def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
 
-  def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
+def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
 {% endhighlight %}
 
-The `rerverse` operator returns a new graph with all the edge directions reversed.  This can be
-useful when for example trying to compute the inverse PageRank.
+The `reverse` operator returns a new graph with all the edge directions reversed.  This can be
+useful when, for example, trying to compute the inverse PageRank.  Because the reverse operation
+does not modify vertex or edge properties or change the number of edges, it can be implemented
+efficiently without data-movement or duplication.
 
 The `subgraph` operator takes vertex and edge predicates and returns the graph containing only the
 vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge
 predicate *and connect vertices that satisfy the vertex predicate*.  The `subgraph` operator can be
 used in number of situations to restrict the graph to the vertices and edges of interest or
-eliminate broken links.
+eliminate broken links.  For example in the following code we remove broken links:
 
-The `mask` operators returns the subgraph containing vertices and edges that are found in the input
-graph.  Finish this description ...
+{% highlight scala %}
+val users: RDD[(VertexId, (String, String))]
+val edges: RDD[Edge[String]]
+// Define a default user in case there are relationship with missing user
+val defaultUser = ("John Doe", "Missing")
+// Build the initial Graph
+val graph = Graph(users, relationships, defaultUser)
+// Remove missing vertices as well as the edges to connected to them
+val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
+{% endhighlight %}
 
-The `groupEdges` operator merges ...
+The `mask` operators returns the subgraph containing only the vertices and edges that are found in
+the input graph.  This can be used in conjunction with the `subgraph` operator to restrict a graph
+based on the properties in another related graph.  For example, we might run connected components
+using the graph with missing vertices and then restrict the answer to the valid subgraph.
 
+{% highlight scala %}
+// Run Connected Components
+val ccGraph = graph.connectedComponents()
+// Remove missing vertices as well as the edges to connected to them
+val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
+// Restrict the answer to the valid subgraph
+val validCCGraph = ccGraph.mask(validGraph)
+{% endhighlight %}
 
+The `groupEdges` operator merges parallel edges: duplicate edges between pairs of vertices.  In many
+numerical applications parallel edges can be *added* (their weights combined) into a single edge
+thereby reducing the graph size in memory as well as the cost of computation.
 
 ## Join Operators
 <a name="join_operators"></a>
 
+The ability to move between graph and collection views of data is a key part of GraphX.  In many
+cases it is necessary to bring data from external collections into the graph.  For example, we might
+have extra user properties that we want to merge with an existing graph or we might want to pull
+vertex properties from one graph into another.  These tasks can be accomplished using the *join*
+operators.  Below we list the key join operators:
+
+{% highlight scala %}
+def joinVertices[U](table: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, U) => VD)
+  : Graph[VD, ED]
+def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(mapFunc: (VertexID, VD, Option[U]) => VD2)
+  : Graph[VD2, ED]
+{% endhighlight %}
+
 ## Map Reduce Triplets (mapReduceTriplets)
 <a name="mrTriplets"></a>
author	Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>	2014-01-11 11:28:01 -0800
committer	Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>	2014-01-11 11:28:01 -0800
commit	fac44bbe2c10633e371cf30afa17c5e78290ca9c (patch)
tree	21d4b929fb8570def98b2abbc1a2ef80b14f7925 /docs/graphx-programming-guide.md
parent	1f45e4e572130e989cf1f91655c22352ac33b063 (diff)
download	spark-fac44bbe2c10633e371cf30afa17c5e78290ca9c.tar.gz spark-fac44bbe2c10633e371cf30afa17c5e78290ca9c.tar.bz2 spark-fac44bbe2c10633e371cf30afa17c5e78290ca9c.zip