From 41b312212094b2accd650813dd45e1767b5465fe Mon Sep 17 00:00:00 2001 From: "Joseph E. Gonzalez" Date: Tue, 29 Oct 2013 20:57:55 -0700 Subject: Strating to improve README. --- docs/img/data_parallel_vs_graph_parallel.png | Bin 0 -> 199060 bytes docs/img/edge-cut.png | Bin 0 -> 12563 bytes docs/img/graph_parallel.png | Bin 0 -> 92288 bytes docs/img/tables_and_graphs.png | Bin 0 -> 68905 bytes docs/img/vertex-cut.png | Bin 0 -> 12246 bytes 5 files changed, 0 insertions(+), 0 deletions(-) create mode 100644 docs/img/data_parallel_vs_graph_parallel.png create mode 100644 docs/img/edge-cut.png create mode 100644 docs/img/graph_parallel.png create mode 100644 docs/img/tables_and_graphs.png create mode 100644 docs/img/vertex-cut.png (limited to 'docs') diff --git a/docs/img/data_parallel_vs_graph_parallel.png b/docs/img/data_parallel_vs_graph_parallel.png new file mode 100644 index 0000000000..d9aa811466 Binary files /dev/null and b/docs/img/data_parallel_vs_graph_parallel.png differ diff --git a/docs/img/edge-cut.png b/docs/img/edge-cut.png new file mode 100644 index 0000000000..698f4ff181 Binary files /dev/null and b/docs/img/edge-cut.png differ diff --git a/docs/img/graph_parallel.png b/docs/img/graph_parallel.png new file mode 100644 index 0000000000..330be5567c Binary files /dev/null and b/docs/img/graph_parallel.png differ diff --git a/docs/img/tables_and_graphs.png b/docs/img/tables_and_graphs.png new file mode 100644 index 0000000000..9af07d3081 Binary files /dev/null and b/docs/img/tables_and_graphs.png differ diff --git a/docs/img/vertex-cut.png b/docs/img/vertex-cut.png new file mode 100644 index 0000000000..0a508dcee9 Binary files /dev/null and b/docs/img/vertex-cut.png differ -- cgit v1.2.3 From e4483582fc59330af8a43e8a152959f927103c79 Mon Sep 17 00:00:00 2001 From: Ankur Dave Date: Thu, 9 Jan 2014 10:23:35 -0800 Subject: Add docs/graphx-programming-guide.md from 7210257ba3038d5e22d4b60fe9c3113dc45c3dff:README.md --- docs/graphx-programming-guide.md | 197 +++++++++++++++++++++++++++++++++++++++ 1 file changed, 197 insertions(+) create mode 100644 docs/graphx-programming-guide.md (limited to 'docs') diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md new file mode 100644 index 0000000000..5b06d82225 --- /dev/null +++ b/docs/graphx-programming-guide.md @@ -0,0 +1,197 @@ +# GraphX: Unifying Graphs and Tables + + +GraphX extends the distributed fault-tolerant collections API and +interactive console of [Spark](http://spark.incubator.apache.org) with +a new graph API which leverages recent advances in graph systems +(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and +interactively build, transform, and reason about graph structured data +at scale. + + +## Motivation + +From social networks and targeted advertising to protein modeling and +astrophysics, big graphs capture the structure in data and are central +to the recent advances in machine learning and data mining. Directly +applying existing *data-parallel* tools (e.g., +[Hadoop](http://hadoop.apache.org) and +[Spark](http://spark.incubator.apache.org)) to graph computation tasks +can be cumbersome and inefficient. The need for intuitive, scalable +tools for graph computation has lead to the development of new +*graph-parallel* systems (e.g., +[Pregel](http://http://giraph.apache.org) and +[GraphLab](http://graphlab.org)) which are designed to efficiently +execute graph algorithms. Unfortunately, these systems do not address +the challenges of graph construction and transformation and provide +limited fault-tolerance and support for interactive analysis. + +

+ +

+ + + +## Solution + +The GraphX project combines the advantages of both data-parallel and +graph-parallel systems by efficiently expressing graph computation +within the [Spark](http://spark.incubator.apache.org) framework. We +leverage new ideas in distributed graph representation to efficiently +distribute graphs as tabular data-structures. Similarly, we leverage +advances in data-flow systems to exploit in-memory computation and +fault-tolerance. We provide powerful new operations to simplify graph +construction and transformation. Using these primitives we implement +the PowerGraph and Pregel abstractions in less than 20 lines of code. +Finally, by exploiting the Scala foundation of Spark, we enable users +to interactively load, transform, and compute on massive graphs. + +

+ +

+ +## Examples + +Suppose I want to build a graph from some text files, restrict the graph +to important relationships and users, run page-rank on the sub-graph, and +then finally return attributes associated with the top users. I can do +all of this in just a few lines with GraphX: + +```scala +// Connect to the Spark cluster +val sc = new SparkContext("spark://master.amplab.org", "research") + +// Load my user data and prase into tuples of user id and attribute list +val users = sc.textFile("hdfs://user_attributes.tsv") + .map(line => line.split).map( parts => (parts.head, parts.tail) ) + +// Parse the edge data which is already in userId -> userId format +val followerGraph = Graph.textFile(sc, "hdfs://followers.tsv") + +// Attach the user attributes +val graph = followerGraph.outerJoinVertices(users){ + case (uid, deg, Some(attrList)) => attrList + // Some users may not have attributes so we set them as empty + case (uid, deg, None) => Array.empty[String] + } + +// Restrict the graph to users which have exactly two attributes +val subgraph = graph.subgraph((vid, attr) => attr.size == 2) + +// Compute the PageRank +val pagerankGraph = Analytics.pagerank(subgraph) + +// Get the attributes of the top pagerank users +val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){ + case (uid, attrList, Some(pr)) => (pr, attrList) + case (uid, attrList, None) => (pr, attrList) + } + +println(userInfoWithPageRank.top(5)) + +``` + + +## Online Documentation + +You can find the latest Spark documentation, including a programming +guide, on the project webpage at +. This README +file only contains basic setup instructions. + + +## Building + +Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The +project is built using Simple Build Tool (SBT), which is packaged with +it. To build Spark and its example programs, run: + + sbt/sbt assembly + +Once you've built Spark, the easiest way to start using it is the +shell: + + ./spark-shell + +Or, for the Python API, the Python shell (`./pyspark`). + +Spark also comes with several sample programs in the `examples` +directory. To run one of them, use `./run-example +`. For example: + + ./run-example org.apache.spark.examples.SparkLR local[2] + +will run the Logistic Regression example locally on 2 CPUs. + +Each of the example programs prints usage help if no params are given. + +All of the Spark samples take a `` parameter that is the +cluster URL to connect to. This can be a mesos:// or spark:// URL, or +"local" to run locally with one thread, or "local[N]" to run locally +with N threads. + + +## A Note About Hadoop Versions + +Spark uses the Hadoop core library to talk to HDFS and other +Hadoop-supported storage systems. Because the protocols have changed +in different versions of Hadoop, you must build Spark against the same +version that your cluster runs. You can change the version by setting +the `SPARK_HADOOP_VERSION` environment when building Spark. + +For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop +versions without YARN, use: + + # Apache Hadoop 1.2.1 + $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly + + # Cloudera CDH 4.2.0 with MapReduce v1 + $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly + +For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions +with YARN, also set `SPARK_YARN=true`: + + # Apache Hadoop 2.0.5-alpha + $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly + + # Cloudera CDH 4.2.0 with MapReduce v2 + $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly + +For convenience, these variables may also be set through the +`conf/spark-env.sh` file described below. + +When developing a Spark application, specify the Hadoop version by adding the +"hadoop-client" artifact to your project's dependencies. For example, if you're +using Hadoop 1.2.1 and build your application using SBT, add this entry to +`libraryDependencies`: + + "org.apache.hadoop" % "hadoop-client" % "1.2.1" + +If your project is built with Maven, add this to your POM file's +`` section: + + + org.apache.hadoop + hadoop-client + 1.2.1 + + + +## Configuration + +Please refer to the [Configuration +guide](http://spark.incubator.apache.org/docs/latest/configuration.html) +in the online documentation for an overview on how to configure Spark. + + +## Contributing to GraphX + +Contributions via GitHub pull requests are gladly accepted from their +original author. Along with any pull requests, please state that the +contribution is your original work and that you license the work to +the project under the project's open source license. Whether or not +you state this explicitly, by submitting any copyrighted material via +pull request, email, or other means you agree to license the material +under the project's open source license and warrant that you have the +legal authority to do so. + -- cgit v1.2.3 From b5b0de2de53563c43e1c5844a52b4eeeb2542ea5 Mon Sep 17 00:00:00 2001 From: Ankur Dave Date: Thu, 9 Jan 2014 13:24:25 -0800 Subject: Start fixing formatting of graphx-programming-guide --- docs/graphx-programming-guide.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) (limited to 'docs') diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md index 5b06d82225..ebc47f5d1c 100644 --- a/docs/graphx-programming-guide.md +++ b/docs/graphx-programming-guide.md @@ -1,4 +1,7 @@ -# GraphX: Unifying Graphs and Tables +--- +layout: global +title: "GraphX: Unifying Graphs and Tables" +--- GraphX extends the distributed fault-tolerant collections API and @@ -26,11 +29,8 @@ execute graph algorithms. Unfortunately, these systems do not address the challenges of graph construction and transformation and provide limited fault-tolerance and support for interactive analysis. -

- -

- - +{:.pagination-centered} +![Data-parallel vs. graph-parallel]({{ site.url }}/img/data_parallel_vs_graph_parallel.png) ## Solution @@ -194,4 +194,3 @@ you state this explicitly, by submitting any copyrighted material via pull request, email, or other means you agree to license the material under the project's open source license and warrant that you have the legal authority to do so. - -- cgit v1.2.3 From b1eeefb4016d69aa0beadd302496c8250766d9b7 Mon Sep 17 00:00:00 2001 From: "Joseph E. Gonzalez" Date: Fri, 10 Jan 2014 00:39:08 -0800 Subject: WIP. Updating figures and cleaning up initial skeleton for GraphX Programming guide. --- docs/_layouts/global.html | 10 +- docs/graphx-programming-guide.md | 277 ++++++++++++--------------- docs/img/data_parallel_vs_graph_parallel.png | Bin 199060 -> 432725 bytes docs/img/edge_cut_vs_vertex_cut.png | Bin 0 -> 79745 bytes docs/img/graph_analytics_pipeline.png | Bin 0 -> 427220 bytes docs/img/graphx_figures.pptx | Bin 0 -> 1118035 bytes docs/img/graphx_logo.png | Bin 0 -> 40324 bytes docs/img/graphx_performance_comparison.png | Bin 0 -> 166343 bytes docs/img/property_graph.png | Bin 0 -> 79056 bytes docs/img/tables_and_graphs.png | Bin 68905 -> 166265 bytes docs/img/vertex_routing_edge_tables.png | Bin 0 -> 570007 bytes docs/index.md | 6 +- 12 files changed, 134 insertions(+), 159 deletions(-) create mode 100644 docs/img/edge_cut_vs_vertex_cut.png create mode 100644 docs/img/graph_analytics_pipeline.png create mode 100644 docs/img/graphx_figures.pptx create mode 100644 docs/img/graphx_logo.png create mode 100644 docs/img/graphx_performance_comparison.png create mode 100644 docs/img/property_graph.png create mode 100644 docs/img/vertex_routing_edge_tables.png (limited to 'docs') diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index ad7969d012..7721854685 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -21,7 +21,7 @@ - + @@ -67,10 +67,10 @@
  • Spark Streaming
  • MLlib (Machine Learning)
  • -
  • Bagel (Pregel on Spark)
  • +
  • GraphX (Graph-Parallel Spark)
  • - + @@ -161,7 +161,7 @@ - +