WIP. Updating figures and cleaning up initial skeleton for GraphX Programming guide.

author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> 2014-01-10 00:39:08 -0800
committer: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> 2014-01-10 00:39:08 -0800
commit: b1eeefb4016d69aa0beadd302496c8250766d9b7 (patch)
tree: 95e41f579fcc7ce0aa7e10870e20d7ce75456463 /docs/graphx-programming-guide.md
parent: b5b0de2de53563c43e1c5844a52b4eeeb2542ea5 (diff)
download: spark-b1eeefb4016d69aa0beadd302496c8250766d9b7.tar.gz
spark-b1eeefb4016d69aa0beadd302496c8250766d9b7.tar.bz2
spark-b1eeefb4016d69aa0beadd302496c8250766d9b7.zip
1 files changed, 126 insertions, 151 deletions
diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md
index ebc47f5d1c..a551e4306d 100644
--- a/docs/graphx-programming-guide.md
+++ b/docs/graphx-programming-guide.md
@@ -1,63 +1,141 @@
 ---
 layout: global
-title: "GraphX: Unifying Graphs and Tables"
+title: GraphX Programming Guide
 ---
 
+* This will become a table of contents (this text will be scraped).
+{:toc}
 
-GraphX extends the distributed fault-tolerant collections API and
-interactive console of [Spark](http://spark.incubator.apache.org) with
-a new graph API which leverages recent advances in graph systems
-(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
-interactively build, transform, and reason about graph structured data
-at scale.
-
-
-## Motivation
-
-From social networks and targeted advertising to protein modeling and
-astrophysics, big graphs capture the structure in data and are central
-to the recent advances in machine learning and data mining. Directly
-applying existing *data-parallel* tools (e.g.,
-[Hadoop](http://hadoop.apache.org) and
-[Spark](http://spark.incubator.apache.org)) to graph computation tasks
-can be cumbersome and inefficient.  The need for intuitive, scalable
-tools for graph computation has lead to the development of new
-*graph-parallel* systems (e.g.,
-[Pregel](http://http://giraph.apache.org) and
-[GraphLab](http://graphlab.org)) which are designed to efficiently
-execute graph algorithms.  Unfortunately, these systems do not address
-the challenges of graph construction and transformation and provide
-limited fault-tolerance and support for interactive analysis.
-
-{:.pagination-centered}
-![Data-parallel vs. graph-parallel]({{ site.url }}/img/data_parallel_vs_graph_parallel.png)
-
-## Solution
-
-The GraphX project combines the advantages of both data-parallel and
-graph-parallel systems by efficiently expressing graph computation
-within the [Spark](http://spark.incubator.apache.org) framework.  We
-leverage new ideas in distributed graph representation to efficiently
-distribute graphs as tabular data-structures.  Similarly, we leverage
-advances in data-flow systems to exploit in-memory computation and
-fault-tolerance.  We provide powerful new operations to simplify graph
-construction and transformation.  Using these primitives we implement
-the PowerGraph and Pregel abstractions in less than 20 lines of code.
-Finally, by exploiting the Scala foundation of Spark, we enable users
-to interactively load, transform, and compute on massive graphs.
-
-<p align="center">
-  <img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" />
+<p style="text-align: center;">
+  <img src="img/graphx_logo.png"
+       title="GraphX Logo"
+       alt="GraphX"
+       width="65%" />
 </p>
 
-## Examples
+# Overview
+
+GraphX is the new (alpha) Spark API for graphs and graph-parallel
+computation. At a high-level GraphX, extends the Spark
+[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by
+introducing the [Resilient Distributed property Graph (RDG)](#property_graph):
+a directed graph with properties attached to each vertex and edge.
+To support graph computation, GraphX exposes a set of functions
+(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the
+[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org)
+APIs. In addition, GraphX includes a growing collection of graph
+[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify
+graph analytics tasks.
+
+## Background on Graph-Parallel Computation
+
+From social networks to language modeling, the growing scale and importance of
+graph data has driven the development of numerous new *graph-parallel* systems
+(e.g., [Giraph](http://http://giraph.apache.org) and
+[GraphLab](http://graphlab.org)).  By restricting the types of computation that can be
+expressed and introducing new techniques to partition and distribute graphs,
+these systems can efficiently execute sophisticated graph algorithms orders of
+magnitude faster than more general *data-parallel* systems.
+
+<p style="text-align: center;">
+  <img src="img/data_parallel_vs_graph_parallel.png"
+       title="Data-Parallel vs. Graph-Parallel"
+       alt="Data-Parallel vs. Graph-Parallel"
+       width="50%" />
+</p>
+
+However, the same restrictions that enable these substantial performance gains
+also make it difficult to express many of the important stages in a typical graph-analytics pipeline:
+constructing the graph, modifying its structure, or expressing computation that
+spans multiple graphs.  As a consequence, existing graph analytics pipelines
+compose graph-parallel and data-parallel systems, leading to extensive data
+movement and duplication and a complicated programming model.
+
+<p style="text-align: center;">
+  <img src="img/graph_analytics_pipeline.png"
+       title="Graph Analytics Pipeline"
+       alt="Graph Analytics Pipeline"
+       width="50%" />
+</p>
+
+The goal of the GraphX project is to unify graph-parallel and data-parallel
+computation in one system with a single composable API. This goal is achieved
+through an API that enables users to view data both as a graph and as
+collections (i.e., RDDs) without data movement or duplication and by
+incorporating advances in graph-parallel systems to optimize the execution of
+operations on the graph view.  In preliminary experiments we find that the GraphX
+system is able to achieve performance comparable to state-of-the-art
+graph-parallel systems while easily expressing the entire analytics pipelines.
+
+<p style="text-align: center;">
+  <img src="img/graphx_performance_comparison.png"
+       title="GraphX Performance Comparison"
+       alt="GraphX Performance Comparison"
+       width="50%" />
+</p>
+
+## GraphX Replaces the Spark Bagel API
+
+Prior to the release of GraphX, graph computation in Spark was expressed using
+Bagel, an implementation of the Pregel API.  GraphX improves upon Bagel by exposing
+a richer property graph API, a more streamlined version of the Pregel abstraction,
+and system optimizations to improve performance and reduce memory
+overhead.  While we plan to eventually deprecate the Bagel, we will continue to
+support the API and [Bagel programming guide](bagel-programming-guide.html). However,
+we encourage Bagel to explore the new GraphX API and comment on issues that may
+complicate the transition from Bagel.
+
+# The Property Graph
+<a name="property_graph"></a>
+
+<p style="text-align: center;">
+  <img src="img/edge_cut_vs_vertex_cut.png"
+       title="Edge Cut vs. Vertex Cut"
+       alt="Edge Cut vs. Vertex Cut"
+       width="50%" />
+</p>
+
+<p style="text-align: center;">
+  <img src="img/property_graph.png"
+       title="The Property Graph"
+       alt="The Property Graph"
+       width="50%" />
+</p>
+
+<p style="text-align: center;">
+  <img src="img/vertex_routing_edge_tables.png"
+       title="RDD Graph Representation"
+       alt="RDD Graph Representation"
+       width="50%" />
+</p>
+
+
+# Graph Operators
+
+## Map Reduce Triplets (mapReduceTriplets)
+<a name="mrTriplets"></a>
+
+# Graph Algorithms
+<a name="graph_algorithms"></a>
+
+# Graph Builders
+<a name="graph_builders"></a>
+
+<p style="text-align: center;">
+  <img src="img/tables_and_graphs.png"
+       title="Tables and Graphs"
+       alt="Tables and Graphs"
+       width="50%" />
+</p>
+
+# Examples
 
 Suppose I want to build a graph from some text files, restrict the graph
 to important relationships and users, run page-rank on the sub-graph, and
 then finally return attributes associated with the top users.  I can do
 all of this in just a few lines with GraphX:
 
-```scala
+{% highlight scala %}
 // Connect to the Spark cluster
 val sc = new SparkContext("spark://master.amplab.org", "research")
 
@@ -89,108 +167,5 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
 
 println(userInfoWithPageRank.top(5))
 
-```
-
-
-## Online Documentation
-
-You can find the latest Spark documentation, including a programming
-guide, on the project webpage at
-<http://spark.incubator.apache.org/documentation.html>.  This README
-file only contains basic setup instructions.
-
-
-## Building
-
-Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
-project is built using Simple Build Tool (SBT), which is packaged with
-it. To build Spark and its example programs, run:
-
-    sbt/sbt assembly
-
-Once you've built Spark, the easiest way to start using it is the
-shell:
-
-    ./spark-shell
-
-Or, for the Python API, the Python shell (`./pyspark`).
-
-Spark also comes with several sample programs in the `examples`
-directory.  To run one of them, use `./run-example <class>
-<params>`. For example:
-
-    ./run-example org.apache.spark.examples.SparkLR local[2]
-
-will run the Logistic Regression example locally on 2 CPUs.
-
-Each of the example programs prints usage help if no params are given.
-
-All of the Spark samples take a `<master>` parameter that is the
-cluster URL to connect to. This can be a mesos:// or spark:// URL, or
-"local" to run locally with one thread, or "local[N]" to run locally
-with N threads.
-
-
-## A Note About Hadoop Versions
-
-Spark uses the Hadoop core library to talk to HDFS and other
-Hadoop-supported storage systems. Because the protocols have changed
-in different versions of Hadoop, you must build Spark against the same
-version that your cluster runs.  You can change the version by setting
-the `SPARK_HADOOP_VERSION` environment when building Spark.
-
-For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
-versions without YARN, use:
-
-    # Apache Hadoop 1.2.1
-    $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly
-
-    # Cloudera CDH 4.2.0 with MapReduce v1
-    $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly
-
-For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
-with YARN, also set `SPARK_YARN=true`:
-
-    # Apache Hadoop 2.0.5-alpha
-    $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly
-
-    # Cloudera CDH 4.2.0 with MapReduce v2
-    $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
-
-For convenience, these variables may also be set through the
-`conf/spark-env.sh` file described below.
-
-When developing a Spark application, specify the Hadoop version by adding the
-"hadoop-client" artifact to your project's dependencies. For example, if you're
-using Hadoop 1.2.1 and build your application using SBT, add this entry to
-`libraryDependencies`:
-
-    "org.apache.hadoop" % "hadoop-client" % "1.2.1"
-
-If your project is built with Maven, add this to your POM file's
-`<dependencies>` section:
-
-    <dependency>
-      <groupId>org.apache.hadoop</groupId>
-      <artifactId>hadoop-client</artifactId>
-      <version>1.2.1</version>
-    </dependency>
-
-
-## Configuration
-
-Please refer to the [Configuration
-guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
-in the online documentation for an overview on how to configure Spark.
-
-
-## Contributing to GraphX
+{% endhighlight %}
 
-Contributions via GitHub pull requests are gladly accepted from their
-original author. Along with any pull requests, please state that the
-contribution is your original work and that you license the work to
-the project under the project's open source license. Whether or not
-you state this explicitly, by submitting any copyrighted material via
-pull request, email, or other means you agree to license the material
-under the project's open source license and warrant that you have the
-legal authority to do so.
author	Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>	2014-01-10 00:39:08 -0800
committer	Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>	2014-01-10 00:39:08 -0800
commit	b1eeefb4016d69aa0beadd302496c8250766d9b7 (patch)
tree	95e41f579fcc7ce0aa7e10870e20d7ce75456463 /docs/graphx-programming-guide.md
parent	b5b0de2de53563c43e1c5844a52b4eeeb2542ea5 (diff)
download	spark-b1eeefb4016d69aa0beadd302496c8250766d9b7.tar.gz spark-b1eeefb4016d69aa0beadd302496c8250766d9b7.tar.bz2 spark-b1eeefb4016d69aa0beadd302496c8250766d9b7.zip