diff options
-rw-r--r-- | docs/graphx-programming-guide.md | 231 | ||||
-rw-r--r-- | docs/img/graphx_figures.pptx | bin | 1118035 -> 1123365 bytes | |||
-rw-r--r-- | docs/img/property_graph.png | bin | 79056 -> 225151 bytes | |||
-rw-r--r-- | docs/img/triplet.png | bin | 0 -> 31489 bytes |
4 files changed, 215 insertions, 16 deletions
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md index 8ae5f17e12..b46cc00d04 100644 --- a/docs/graphx-programming-guide.md +++ b/docs/graphx-programming-guide.md @@ -11,6 +11,7 @@ title: GraphX Programming Guide title="GraphX Logo" alt="GraphX" width="65%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> </p> # Overview @@ -42,6 +43,7 @@ magnitude faster than more general *data-parallel* systems. title="Data-Parallel vs. Graph-Parallel" alt="Data-Parallel vs. Graph-Parallel" width="50%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> </p> However, the same restrictions that enable these substantial performance gains @@ -56,14 +58,15 @@ movement and duplication and a complicated programming model. title="Graph Analytics Pipeline" alt="Graph Analytics Pipeline" width="50%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> </p> The goal of the GraphX project is to unify graph-parallel and data-parallel -computation in one system with a single composable API. This goal is achieved -through an API that enables users to view data both as a graph and as -collections (i.e., RDDs) without data movement or duplication and by -incorporating advances in graph-parallel systems to optimize the execution of -operations on the graph view. In preliminary experiments we find that the GraphX +computation in one system with a single composable API. The GraphX API +enables users to view data both as a graph and as +collection (i.e., RDDs) without data movement or duplication. By +incorporating recent advances in graph-parallel systems, GraphX is able to optimize +the execution of graph operations. In preliminary experiments we find that the GraphX system is able to achieve performance comparable to state-of-the-art graph-parallel systems while easily expressing the entire analytics pipelines. @@ -72,6 +75,7 @@ graph-parallel systems while easily expressing the entire analytics pipelines. title="GraphX Performance Comparison" alt="GraphX Performance Comparison" width="50%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> </p> ## GraphX Replaces the Spark Bagel API @@ -86,47 +90,242 @@ support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and Bagel users to explore the new GraphX API and comment on issues that may complicate the transition from Bagel. +# Getting Started + +To get started you first need to import Spark and GraphX into your project. This can be done by +importing the following: + +{% highlight scala %} +import org.apache.spark._ +import org.apache.spark.graphx._ +{% endhighlight %} + +If you are not using the Spark shell you will also need a Spark context. + # The Property Graph <a name="property_graph"></a> -<p style="text-align: center;"> - <img src="img/edge_cut_vs_vertex_cut.png" - title="Edge Cut vs. Vertex Cut" - alt="Edge Cut vs. Vertex Cut" - width="50%" /> -</p> +The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a directed graph with +user defined objects attached to each vertex and edge. Like RDDs, property graphs are immutable, +distributed, and fault-tolerant. Vertices are keyed by their vertex identifier (`VertexId`) which is +a unique 64-bit long. Similarly, edges have corresponding source and destination vertex identifiers. +Unlike other systems, GraphX does not impose any ordering or constraints on the vertex identifiers. + +The property graph is parameterized over the vertex `VD` and edge `ED` types. These are the types +of the objects associated with each vertex and edge respectively. In some cases it can be desirable +to have vertices of different types. However, this can be accomplished through inheritance. + +> GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g., +> int, double, etc...) reducing the memory overhead of the graph representation. + +Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the +properties for each vertex and edge: + +{% highlight scala %} +class Graph[VD: ClassTag, ED: ClassTag] { + val vertices: RDD[(VertexId, VD)] + val edges: RDD[Edge[ED]] + // ... +} +{% endhighlight %} +> Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and +> `EdgeRDD[ED]` respectively. These types extend and are optimized versions of `RDD[(VertexId, VD)]` +> and `RDD[Edge[ED]]`. + +For example, we might construct a property graph consisting of various collaborators on the GraphX +project. The vertex property contains the username and occupation and the edge property contains +a string describing the relationships between the users. <p style="text-align: center;"> <img src="img/property_graph.png" title="The Property Graph" alt="The Property Graph" width="50%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> </p> +The resulting graph would have the type signature: + +{% highlight scala %} +val userGraph: Graph[(String, String), String] +{% endhighlight %} + +There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic +generators and these are discussed in more detail in the section on +[graph builders](#graph_builders). Probably the most general method is to use the +[graph singleton](api/graphx/index.html#org.apache.spark.graphx.Graph$). +For example the following code constructs a graph from a collection of RDDs: + +{% highlight scala %} +// Assume the SparkContext has already been constructed +val sc: SparkContext +// Create an RDD for the vertices +val users: RDD[(VertexId, (String, String))] = + sc.parallelize(Array((3, ("rxin", "student")), (7, ("jgonzal", "postdoc")), + (5, ("franklin", "prof")), (2, ("istoica", "prof")))) +// Create an RDD for edges +val relationships: RDD[Edge[String]] = + sc.parallelize(Array(Edge(3, 7, "collab"), Edge(5, 3, "advisor"), + Edge(2, 5, "colleague"), Edge(5, 7, "pi")) +// Define a default user in case there are relationship with missing user +val defaultUser = ("John Doe", "Missing") +// Build the initial Graph +val graph = Graph(users, relationships, defaultUser) +{% endhighlight %} + +In the above example we make use of the [`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge) +case class. Edges have a `srcId` and a `dstId` corresponding to the source and destination vertex +identifiers. In addition, the `Edge` class contains the `attr` member which contains the edge +property. + +We can deconstruct a graph into the respective vertex and edge views by using the `graph.vertices` +and `graph.edges` members respectively. + +{% highlight scala %} +val graph: Graph[(String, String), String] // Constructed from above +// Count all users which are postdocs +graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count +// Count all the edges where src > dst +graph.edges.filter(e => e.srcId > e.dstId).count +{% endhighlight %} + +> Note that `graph.vertices` returns an `RDD[(VertexId, (String, String))]` and so we must use the +> scala `case` expression to deconstruct the tuple. Alternatively, `graph.edges` returns an `RDD` +> containing `Edge[String]` objects. We could have also used the case class type constructor as +> in the following: +> {% highlight scala %} +graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count +{% endhighlight %} + +In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view. +The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD, +ED]]` consisting of [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet). +This *join* can be expressed in the following SQL expression: + +{% highlight sql %} +SELECT src.id, dst.id, src.attr, e.attr, dst.attr +FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst +ON e.srcId = src.Id AND e.dstId = dst.Id +{% endhighlight %} + +or graphically as: + <p style="text-align: center;"> - <img src="img/vertex_routing_edge_tables.png" - title="RDD Graph Representation" - alt="RDD Graph Representation" + <img src="img/triplet.png" + title="Edge Triplet" + alt="Edge Triplet" width="50%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> </p> +The [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class extends the +[`Edge`](api/graphx/index.html#org.apache.spark.graphx.Edge) class by adding the `srcAttr` and +`dstAttr` members which contain the source and destination properties respectively. We can use the +triplet view of a graph to render a collection of strings describing relationships between users. + +{% highlight scala %} +val graph: Graph[(String, String), String] // Constructed from above +// Use the triplets view to create an RDD of facts. +val facts: RDD[String] = + graph.triplets.map(et => et.srcAttr._1 + " is the " + et.attr + " of " et.dstAttr) +{% endhighlight %} # Graph Operators +Just as RDDs have basic operations like `map`, `filter`, and `reduceByKey`, property graphs also +have a collection of basic operators that take user defined function and produce new graphs with +transformed properties and structure. + +## Property Operators + +In direct analogy to the RDD `map` operator, the property +graph contains the following: + +{% highlight scala %} +class Graph[VD, ED] { + def mapVertices[VD2](map: (VertexID, VD) => VD2): Graph[VD2, ED] + def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2] + def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2] +} +{% endhighlight %} + +Each of these operators yields a new graph with the vertex or edge properties modified by the user +defined `map` function. + +> Note that in all cases the graph structure is unaffected. This is a key feature of these +> operators which allows the resulting graph to reuse the structural indicies and the unaffected +> properties of the original graph. +> While `graph.mapVertices(mapUDF)` is logically equivalent to the following, the following +> does not preserve the structural indicies and would not benefit from the substantial system +> optimizations in GraphX. +> {% highlight scala %} +val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))} +val newGraph = Graph(newVertices, graph.edges) +{% endhighlight %} + +These operators are often used to initialize the graph for a particular computation or project away +unnecessary properties. For example, given a graph with the out-degrees as the vertex properties +(we describe how to construct such a graph later) we initialize for PageRank: + +{% highlight scala %} +// Given a graph where the vertex property is the out-degree +val inputGraph: Graph[Int, String] +// Construct a graph where each edge contains the weight +// and each vertex is the initial PageRank +val outputGraph: Graph[Double, Double] = + inputGraph.mapTriplets(et => 1.0/et.srcAttr).mapVertices(v => 1.0) +{% endhighlight %} + +## Structural Operators +<a name="structural_operators"></a> + + ## Map Reduce Triplets (mapReduceTriplets) <a name="mrTriplets"></a> -# Graph Algorithms -<a name="graph_algorithms"></a> # Graph Builders <a name="graph_builders"></a> + +{% highlight scala %} +val userGraph: Graph[(String, String), String] +{% endhighlight %} + + +# Optimized Representation + +The Property Graph is internally represented as a collection of RDDs + +<p style="text-align: center;"> + <img src="img/edge_cut_vs_vertex_cut.png" + title="Edge Cut vs. Vertex Cut" + alt="Edge Cut vs. Vertex Cut" + width="50%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> +</p> + +<p style="text-align: center;"> + <img src="img/vertex_routing_edge_tables.png" + title="RDD Graph Representation" + alt="RDD Graph Representation" + width="50%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> +</p> + + + + +# Graph Algorithms +<a name="graph_algorithms"></a> + + <p style="text-align: center;"> <img src="img/tables_and_graphs.png" title="Tables and Graphs" alt="Tables and Graphs" width="50%" /> + <!-- Images are downsized intentionally to improve quality on retina displays --> </p> # Examples diff --git a/docs/img/graphx_figures.pptx b/docs/img/graphx_figures.pptx Binary files differindex c67ddb4876..ea4f82ce82 100644 --- a/docs/img/graphx_figures.pptx +++ b/docs/img/graphx_figures.pptx diff --git a/docs/img/property_graph.png b/docs/img/property_graph.png Binary files differindex 859d4013fb..6f3f89a010 100644 --- a/docs/img/property_graph.png +++ b/docs/img/property_graph.png diff --git a/docs/img/triplet.png b/docs/img/triplet.png Binary files differnew file mode 100644 index 0000000000..8b82a09bed --- /dev/null +++ b/docs/img/triplet.png |