aboutsummaryrefslogtreecommitdiff
path: root/docs
diff options
context:
space:
mode:
Diffstat (limited to 'docs')
-rw-r--r--docs/graphx-programming-guide.md231
-rw-r--r--docs/img/graphx_figures.pptxbin1118035 -> 1123365 bytes
-rw-r--r--docs/img/property_graph.pngbin79056 -> 225151 bytes
-rw-r--r--docs/img/triplet.pngbin0 -> 31489 bytes
4 files changed, 215 insertions, 16 deletions
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index 8ae5f17e12..b46cc00d04 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -11,6 +11,7 @@ title: GraphX Programming Guide
title="GraphX Logo"
alt="GraphX"
width="65%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
# Overview
@@ -42,6 +43,7 @@ magnitude faster than more general *data-parallel* systems.
title="Data-Parallel vs. Graph-Parallel"
alt="Data-Parallel vs. Graph-Parallel"
width="50%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
However, the same restrictions that enable these substantial performance gains
@@ -56,14 +58,15 @@ movement and duplication and a complicated programming model.
title="Graph Analytics Pipeline"
alt="Graph Analytics Pipeline"
width="50%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
The goal of the GraphX project is to unify graph-parallel and data-parallel
-computation in one system with a single composable API. This goal is achieved
-through an API that enables users to view data both as a graph and as
-collections (i.e., RDDs) without data movement or duplication and by
-incorporating advances in graph-parallel systems to optimize the execution of
-operations on the graph view. In preliminary experiments we find that the GraphX
+computation in one system with a single composable API. The GraphX API
+enables users to view data both as a graph and as
+collection (i.e., RDDs) without data movement or duplication. By
+incorporating recent advances in graph-parallel systems, GraphX is able to optimize
+the execution of graph operations. In preliminary experiments we find that the GraphX
system is able to achieve performance comparable to state-of-the-art
graph-parallel systems while easily expressing the entire analytics pipelines.
@@ -72,6 +75,7 @@ graph-parallel systems while easily expressing the entire analytics pipelines.
title="GraphX Performance Comparison"
alt="GraphX Performance Comparison"
width="50%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
## GraphX Replaces the Spark Bagel API
@@ -86,47 +90,242 @@ support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and
Bagel users to explore the new GraphX API and comment on issues that may
complicate the transition from Bagel.
+# Getting Started
+
+To get started you first need to import Spark and GraphX into your project. This can be done by
+importing the following:
+
+{% highlight scala %}
+import org.apache.spark._
+import org.apache.spark.graphx._
+{% endhighlight %}
+
+If you are not using the Spark shell you will also need a Spark context.
+
# The Property Graph
<a name="property_graph"></a>
-<p style="text-align: center;">
- <img src="img/edge_cut_vs_vertex_cut.png"
- title="Edge Cut vs. Vertex Cut"
- alt="Edge Cut vs. Vertex Cut"
- width="50%" />
-</p>
+The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed graph with
+user defined objects attached to each vertex and edge. Like RDDs, property graphs are immutable,
+distributed, and fault-tolerant. Vertices are keyed by their vertex identifier (`VertexId`) which is
+a unique 64-bit long. Similarly, edges have corresponding source and destination vertex identifiers.
+Unlike other systems, GraphX does not impose any ordering or constraints on the vertex identifiers.
+
+The property graph is parameterized over the vertex `VD` and edge `ED` types. These are the types
+of the objects associated with each vertex and edge respectively. In some cases it can be desirable
+to have vertices of different types. However, this can be accomplished through inheritance.
+
+> GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
+> int, double, etc...) reducing the memory overhead of the graph representation.
+
+Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
+properties for each vertex and edge:
+
+{% highlight scala %}
+class Graph[VD: ClassTag, ED: ClassTag] {
+ val vertices: RDD[(VertexId, VD)]
+ val edges: RDD[Edge[ED]]
+ // ...
+}
+{% endhighlight %}
+> Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and
+> `EdgeRDD[ED]` respectively. These types extend and are optimized versions of `RDD[(VertexId, VD)]`
+> and `RDD[Edge[ED]]`.
+
+For example, we might construct a property graph consisting of various collaborators on the GraphX
+project. The vertex property contains the username and occupation and the edge property contains
+a string describing the relationships between the users.
<p style="text-align: center;">
<img src="img/property_graph.png"
title="The Property Graph"
alt="The Property Graph"
width="50%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
+The resulting graph would have the type signature:
+
+{% highlight scala %}
+val userGraph: Graph[(String, String), String]
+{% endhighlight %}
+
+There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
+generators and these are discussed in more detail in the section on
+[graph builders](#graph_builders). Probably the most general method is to use the
+[graph singleton](api/graphx/index.html#org.apache.spark.graphx.Graph$).
+For example the following code constructs a graph from a collection of RDDs:
+
+{% highlight scala %}
+// Assume the SparkContext has already been constructed
+val sc: SparkContext
+// Create an RDD for the vertices
+val users: RDD[(VertexId, (String, String))] =
+ sc.parallelize(Array((3, ("rxin", "student")), (7, ("jgonzal", "postdoc")),
+ (5, ("franklin", "prof")), (2, ("istoica", "prof"))))
+// Create an RDD for edges
+val relationships: RDD[Edge[String]] =
+ sc.parallelize(Array(Edge(3, 7, "collab"), Edge(5, 3, "advisor"),
+ Edge(2, 5, "colleague"), Edge(5, 7, "pi"))
+// Define a default user in case there are relationship with missing user
+val defaultUser = ("John Doe", "Missing")
+// Build the initial Graph
+val graph = Graph(users, relationships, defaultUser)
+{% endhighlight %}
+
+In the above example we make use of the [`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge)
+case class. Edges have a `srcId` and a `dstId` corresponding to the source and destination vertex
+identifiers. In addition, the `Edge` class contains the `attr` member which contains the edge
+property.
+
+We can deconstruct a graph into the respective vertex and edge views by using the `graph.vertices`
+and `graph.edges` members respectively.
+
+{% highlight scala %}
+val graph: Graph[(String, String), String] // Constructed from above
+// Count all users which are postdocs
+graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
+// Count all the edges where src > dst
+graph.edges.filter(e => e.srcId > e.dstId).count
+{% endhighlight %}
+
+> Note that `graph.vertices` returns an `RDD[(VertexId, (String, String))]` and so we must use the
+> scala `case` expression to deconstruct the tuple. Alternatively, `graph.edges` returns an `RDD`
+> containing `Edge[String]` objects. We could have also used the case class type constructor as
+> in the following:
+> {% highlight scala %}
+graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
+{% endhighlight %}
+
+In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
+The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD,
+ED]]` consisting of [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet).
+This *join* can be expressed in the following SQL expression:
+
+{% highlight sql %}
+SELECT src.id, dst.id, src.attr, e.attr, dst.attr
+FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst
+ON e.srcId = src.Id AND e.dstId = dst.Id
+{% endhighlight %}
+
+or graphically as:
+
<p style="text-align: center;">
- <img src="img/vertex_routing_edge_tables.png"
- title="RDD Graph Representation"
- alt="RDD Graph Representation"
+ <img src="img/triplet.png"
+ title="Edge Triplet"
+ alt="Edge Triplet"
width="50%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
+The [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class extends the
+[`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge) class by adding the `srcAttr` and
+`dstAttr` members which contain the source and destination properties respectively. We can use the
+triplet view of a graph to render a collection of strings describing relationships between users.
+
+{% highlight scala %}
+val graph: Graph[(String, String), String] // Constructed from above
+// Use the triplets view to create an RDD of facts.
+val facts: RDD[String] =
+ graph.triplets.map(et => et.srcAttr._1 + " is the " + et.attr + " of " et.dstAttr)
+{% endhighlight %}
# Graph Operators
+Just as RDDs have basic operations like `map`, `filter`, and `reduceByKey`, property graphs also
+have a collection of basic operators that take user defined function and produce new graphs with
+transformed properties and structure.
+
+## Property Operators
+
+In direct analogy to the RDD `map` operator, the property
+graph contains the following:
+
+{% highlight scala %}
+class Graph[VD, ED] {
+ def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED]
+ def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
+ def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
+}
+{% endhighlight %}
+
+Each of these operators yields a new graph with the vertex or edge properties modified by the user
+defined `map` function.
+
+> Note that in all cases the graph structure is unaffected. This is a key feature of these
+> operators which allows the resulting graph to reuse the structural indicies and the unaffected
+> properties of the original graph.
+> While `graph.mapVertices(mapUDF)` is logically equivalent to the following, the following
+> does not preserve the structural indicies and would not benefit from the substantial system
+> optimizations in GraphX.
+> {% highlight scala %}
+val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))}
+val newGraph = Graph(newVertices, graph.edges)
+{% endhighlight %}
+
+These operators are often used to initialize the graph for a particular computation or project away
+unnecessary properties. For example, given a graph with the out-degrees as the vertex properties
+(we describe how to construct such a graph later) we initialize for PageRank:
+
+{% highlight scala %}
+// Given a graph where the vertex property is the out-degree
+val inputGraph: Graph[Int, String]
+// Construct a graph where each edge contains the weight
+// and each vertex is the initial PageRank
+val outputGraph: Graph[Double, Double] =
+ inputGraph.mapTriplets(et => 1.0/et.srcAttr).mapVertices(v => 1.0)
+{% endhighlight %}
+
+## Structural Operators
+<a name="structural_operators"></a>
+
+
## Map Reduce Triplets (mapReduceTriplets)
<a name="mrTriplets"></a>
-# Graph Algorithms
-<a name="graph_algorithms"></a>
# Graph Builders
<a name="graph_builders"></a>
+
+{% highlight scala %}
+val userGraph: Graph[(String, String), String]
+{% endhighlight %}
+
+
+# Optimized Representation
+
+The Property Graph is internally represented as a collection of RDDs
+
+<p style="text-align: center;">
+ <img src="img/edge_cut_vs_vertex_cut.png"
+ title="Edge Cut vs. Vertex Cut"
+ alt="Edge Cut vs. Vertex Cut"
+ width="50%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
+</p>
+
+<p style="text-align: center;">
+ <img src="img/vertex_routing_edge_tables.png"
+ title="RDD Graph Representation"
+ alt="RDD Graph Representation"
+ width="50%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
+</p>
+
+
+
+
+# Graph Algorithms
+<a name="graph_algorithms"></a>
+
+
<p style="text-align: center;">
<img src="img/tables_and_graphs.png"
title="Tables and Graphs"
alt="Tables and Graphs"
width="50%" />
+ <!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
# Examples
diff --git a/docs/img/graphx_figures.pptx b/docs/img/graphx_figures.pptx
index c67ddb4876..ea4f82ce82 100644
--- a/docs/img/graphx_figures.pptx
+++ b/docs/img/graphx_figures.pptx
Binary files differ
diff --git a/docs/img/property_graph.png b/docs/img/property_graph.png
index 859d4013fb..6f3f89a010 100644
--- a/docs/img/property_graph.png
+++ b/docs/img/property_graph.png
Binary files differ
diff --git a/docs/img/triplet.png b/docs/img/triplet.png
new file mode 100644
index 0000000000..8b82a09bed
--- /dev/null
+++ b/docs/img/triplet.png
Binary files differ