From b1eeefb4016d69aa0beadd302496c8250766d9b7 Mon Sep 17 00:00:00 2001 From: "Joseph E. Gonzalez" Date: Fri, 10 Jan 2014 00:39:08 -0800 Subject: WIP. Updating figures and cleaning up initial skeleton for GraphX Programming guide. --- docs/graphx-programming-guide.md | 277 ++++++++++++++++++--------------------- 1 file changed, 126 insertions(+), 151 deletions(-) (limited to 'docs/graphx-programming-guide.md') diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md index ebc47f5d1c..a551e4306d 100644 --- a/docs/graphx-programming-guide.md +++ b/docs/graphx-programming-guide.md @@ -1,63 +1,141 @@ --- layout: global -title: "GraphX: Unifying Graphs and Tables" +title: GraphX Programming Guide --- +* This will become a table of contents (this text will be scraped). +{:toc} -GraphX extends the distributed fault-tolerant collections API and -interactive console of [Spark](http://spark.incubator.apache.org) with -a new graph API which leverages recent advances in graph systems -(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and -interactively build, transform, and reason about graph structured data -at scale. - - -## Motivation - -From social networks and targeted advertising to protein modeling and -astrophysics, big graphs capture the structure in data and are central -to the recent advances in machine learning and data mining. Directly -applying existing *data-parallel* tools (e.g., -[Hadoop](http://hadoop.apache.org) and -[Spark](http://spark.incubator.apache.org)) to graph computation tasks -can be cumbersome and inefficient. The need for intuitive, scalable -tools for graph computation has lead to the development of new -*graph-parallel* systems (e.g., -[Pregel](http://http://giraph.apache.org) and -[GraphLab](http://graphlab.org)) which are designed to efficiently -execute graph algorithms. Unfortunately, these systems do not address -the challenges of graph construction and transformation and provide -limited fault-tolerance and support for interactive analysis. - -{:.pagination-centered} -![Data-parallel vs. graph-parallel]({{ site.url }}/img/data_parallel_vs_graph_parallel.png) - -## Solution - -The GraphX project combines the advantages of both data-parallel and -graph-parallel systems by efficiently expressing graph computation -within the [Spark](http://spark.incubator.apache.org) framework. We -leverage new ideas in distributed graph representation to efficiently -distribute graphs as tabular data-structures. Similarly, we leverage -advances in data-flow systems to exploit in-memory computation and -fault-tolerance. We provide powerful new operations to simplify graph -construction and transformation. Using these primitives we implement -the PowerGraph and Pregel abstractions in less than 20 lines of code. -Finally, by exploiting the Scala foundation of Spark, we enable users -to interactively load, transform, and compute on massive graphs. - -

- +

+ GraphX

-## Examples +# Overview + +GraphX is the new (alpha) Spark API for graphs and graph-parallel +computation. At a high-level GraphX, extends the Spark +[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by +introducing the [Resilient Distributed property Graph (RDG)](#property_graph): +a directed graph with properties attached to each vertex and edge. +To support graph computation, GraphX exposes a set of functions +(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the +[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org) +APIs. In addition, GraphX includes a growing collection of graph +[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify +graph analytics tasks. + +## Background on Graph-Parallel Computation + +From social networks to language modeling, the growing scale and importance of +graph data has driven the development of numerous new *graph-parallel* systems +(e.g., [Giraph](http://http://giraph.apache.org) and +[GraphLab](http://graphlab.org)). By restricting the types of computation that can be +expressed and introducing new techniques to partition and distribute graphs, +these systems can efficiently execute sophisticated graph algorithms orders of +magnitude faster than more general *data-parallel* systems. + +

+ Data-Parallel vs. Graph-Parallel +

+ +However, the same restrictions that enable these substantial performance gains +also make it difficult to express many of the important stages in a typical graph-analytics pipeline: +constructing the graph, modifying its structure, or expressing computation that +spans multiple graphs. As a consequence, existing graph analytics pipelines +compose graph-parallel and data-parallel systems, leading to extensive data +movement and duplication and a complicated programming model. + +

+ Graph Analytics Pipeline +

+ +The goal of the GraphX project is to unify graph-parallel and data-parallel +computation in one system with a single composable API. This goal is achieved +through an API that enables users to view data both as a graph and as +collections (i.e., RDDs) without data movement or duplication and by +incorporating advances in graph-parallel systems to optimize the execution of +operations on the graph view. In preliminary experiments we find that the GraphX +system is able to achieve performance comparable to state-of-the-art +graph-parallel systems while easily expressing the entire analytics pipelines. + +

+ GraphX Performance Comparison +

+ +## GraphX Replaces the Spark Bagel API + +Prior to the release of GraphX, graph computation in Spark was expressed using +Bagel, an implementation of the Pregel API. GraphX improves upon Bagel by exposing +a richer property graph API, a more streamlined version of the Pregel abstraction, +and system optimizations to improve performance and reduce memory +overhead. While we plan to eventually deprecate the Bagel, we will continue to +support the API and [Bagel programming guide](bagel-programming-guide.html). However, +we encourage Bagel to explore the new GraphX API and comment on issues that may +complicate the transition from Bagel. + +# The Property Graph + + +

+ Edge Cut vs. Vertex Cut +

+ +

+ The Property Graph +

+ +

+ RDD Graph Representation +

+ + +# Graph Operators + +## Map Reduce Triplets (mapReduceTriplets) + + +# Graph Algorithms + + +# Graph Builders + + +

+ Tables and Graphs +

+ +# Examples Suppose I want to build a graph from some text files, restrict the graph to important relationships and users, run page-rank on the sub-graph, and then finally return attributes associated with the top users. I can do all of this in just a few lines with GraphX: -```scala +{% highlight scala %} // Connect to the Spark cluster val sc = new SparkContext("spark://master.amplab.org", "research") @@ -89,108 +167,5 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){ println(userInfoWithPageRank.top(5)) -``` - - -## Online Documentation - -You can find the latest Spark documentation, including a programming -guide, on the project webpage at -. This README -file only contains basic setup instructions. - - -## Building - -Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The -project is built using Simple Build Tool (SBT), which is packaged with -it. To build Spark and its example programs, run: - - sbt/sbt assembly - -Once you've built Spark, the easiest way to start using it is the -shell: - - ./spark-shell - -Or, for the Python API, the Python shell (`./pyspark`). - -Spark also comes with several sample programs in the `examples` -directory. To run one of them, use `./run-example -`. For example: - - ./run-example org.apache.spark.examples.SparkLR local[2] - -will run the Logistic Regression example locally on 2 CPUs. - -Each of the example programs prints usage help if no params are given. - -All of the Spark samples take a `` parameter that is the -cluster URL to connect to. This can be a mesos:// or spark:// URL, or -"local" to run locally with one thread, or "local[N]" to run locally -with N threads. - - -## A Note About Hadoop Versions - -Spark uses the Hadoop core library to talk to HDFS and other -Hadoop-supported storage systems. Because the protocols have changed -in different versions of Hadoop, you must build Spark against the same -version that your cluster runs. You can change the version by setting -the `SPARK_HADOOP_VERSION` environment when building Spark. - -For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop -versions without YARN, use: - - # Apache Hadoop 1.2.1 - $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly - - # Cloudera CDH 4.2.0 with MapReduce v1 - $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly - -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions -with YARN, also set `SPARK_YARN=true`: - - # Apache Hadoop 2.0.5-alpha - $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly - - # Cloudera CDH 4.2.0 with MapReduce v2 - $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly - -For convenience, these variables may also be set through the -`conf/spark-env.sh` file described below. - -When developing a Spark application, specify the Hadoop version by adding the -"hadoop-client" artifact to your project's dependencies. For example, if you're -using Hadoop 1.2.1 and build your application using SBT, add this entry to -`libraryDependencies`: - - "org.apache.hadoop" % "hadoop-client" % "1.2.1" - -If your project is built with Maven, add this to your POM file's -`` section: - - - org.apache.hadoop - hadoop-client - 1.2.1 - - - -## Configuration - -Please refer to the [Configuration -guide](http://spark.incubator.apache.org/docs/latest/configuration.html) -in the online documentation for an overview on how to configure Spark. - - -## Contributing to GraphX +{% endhighlight %} -Contributions via GitHub pull requests are gladly accepted from their -original author. Along with any pull requests, please state that the -contribution is your original work and that you license the work to -the project under the project's open source license. Whether or not -you state this explicitly, by submitting any copyrighted material via -pull request, email, or other means you agree to license the material -under the project's open source license and warrant that you have the -legal authority to do so. -- cgit v1.2.3