Strating to improve README.

author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> 2013-10-29 20:57:55 -0700
committer: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> 2013-10-29 20:57:55 -0700
commit: 41b312212094b2accd650813dd45e1767b5465fe (patch)
tree: 9f4666ae9e740a9faecfe1427bad0d1f14ea1902
parent: 38ec0baf5c9033a9e9e9bb015d95357d8176e022 (diff)
download: spark-41b312212094b2accd650813dd45e1767b5465fe.tar.gz
spark-41b312212094b2accd650813dd45e1767b5465fe.tar.bz2
spark-41b312212094b2accd650813dd45e1767b5465fe.zip
6 files changed, 81 insertions, 47 deletions
diff --git a/README.md b/README.md
index 139bdc070c..a54d4ed587 100644
--- a/README.md
+++ b/README.md
@@ -1,35 +1,76 @@
-# GraphX Branch of Spark
+# GraphX: Unifying Graph and Tables
 
-This is experimental code for the Apache spark project.
 
-# Apache Spark
+GraphX extends the distributed fault-tolerant collections API and
+interactive console of [Spark](http://spark.incubator.apache.org) with
+a new graph API which leverages recent advances in graph systems
+(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
+interactively build, transform, and reason about graph structured data
+at scale.
+
+
+## Motivation
+
+From social networks and targeted advertising to protein modeling and
+astrophysics, big graphs capture the structure in data and are central
+to the recent advances in machine learning and data mining. Directly
+applying existing *data-parallel* tools (e.g.,
+[Hadoop](http://hadoop.apache.org) and
+[Spark](http://spark.incubator.apache.org)) to graph computation tasks
+can be cumbersome and inefficient.  The need for intuitive, scalable
+tools for graph computation has lead to the development of new
+*graph-parallel* systems (e.g.,
+[Pregel](http://http://giraph.apache.org) and
+[GraphLab](http://graphlab.org)) which are designed to efficiently
+execute graph algorithms.  Unfortunately, these systems do not address
+the challenges of graph construction and transformation and provide
+limited fault-tolerance and support for interactive analysis.
+
+![image](http://docs/data_parllel_vs_graph_parallel.png)
+
+
+## Solution
+
+The GraphX project combines the advantages of both data-parallel and
+graph-parallel systems by efficiently expressing graph computation
+within the [Spark](http://spark.incubator.apache.org) framework.  We
+leverage new ideas in distributed graph representation to efficiently
+distribute graphs as tabular data-structures.  Similarly, we leverage
+advances in data-flow systems to exploit in-memory computation and
+fault-tolerance.  We provide powerful new operations to simplify graph
+construction and transformation.  Using these primitives we implement
+the PowerGraph and Pregel abstractions in less than 20 lines of code.
+Finally, by exploiting the Scala foundation of Spark, we enable users
+to interactively load, transform, and compute on massive graphs.
 
-Lightning-Fast Cluster Computing - <http://spark.incubator.apache.org/>
 
 
 ## Online Documentation
 
 You can find the latest Spark documentation, including a programming
-guide, on the project webpage at <http://spark.incubator.apache.org/documentation.html>.
-This README file only contains basic setup instructions.
+guide, on the project webpage at
+<http://spark.incubator.apache.org/documentation.html>.  This README
+file only contains basic setup instructions.
 
 
 ## Building
 
-Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The project is
-built using Simple Build Tool (SBT), which is packaged with it. To build
-Spark and its example programs, run:
+Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
+project is built using Simple Build Tool (SBT), which is packaged with
+it. To build Spark and its example programs, run:
 
     sbt/sbt assembly
 
-Once you've built Spark, the easiest way to start using it is the shell:
+Once you've built Spark, the easiest way to start using it is the
+shell:
 
     ./spark-shell
 
 Or, for the Python API, the Python shell (`./pyspark`).
 
-Spark also comes with several sample programs in the `examples` directory.
-To run one of them, use `./run-example <class> <params>`. For example:
+Spark also comes with several sample programs in the `examples`
+directory.  To run one of them, use `./run-example <class>
+<params>`. For example:
 
     ./run-example org.apache.spark.examples.SparkLR local[2]
 
@@ -37,18 +78,19 @@ will run the Logistic Regression example locally on 2 CPUs.
 
 Each of the example programs prints usage help if no params are given.
 
-All of the Spark samples take a `<master>` parameter that is the cluster URL
-to connect to. This can be a mesos:// or spark:// URL, or "local" to run
-locally with one thread, or "local[N]" to run locally with N threads.
+All of the Spark samples take a `<master>` parameter that is the
+cluster URL to connect to. This can be a mesos:// or spark:// URL, or
+"local" to run locally with one thread, or "local[N]" to run locally
+with N threads.
 
 
 ## A Note About Hadoop Versions
 
-Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
-storage systems. Because the protocols have changed in different versions of
-Hadoop, you must build Spark against the same version that your cluster runs.
-You can change the version by setting the `SPARK_HADOOP_VERSION` environment
-when building Spark.
+Spark uses the Hadoop core library to talk to HDFS and other
+Hadoop-supported storage systems. Because the protocols have changed
+in different versions of Hadoop, you must build Spark against the same
+version that your cluster runs.  You can change the version by setting
+the `SPARK_HADOOP_VERSION` environment when building Spark.
 
 For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
 versions without YARN, use:
@@ -68,17 +110,18 @@ with YARN, also set `SPARK_YARN=true`:
     # Cloudera CDH 4.2.0 with MapReduce v2
     $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly
 
-For convenience, these variables may also be set through the `conf/spark-env.sh` file
-described below.
+For convenience, these variables may also be set through the
+`conf/spark-env.sh` file described below.
 
-When developing a Spark application, specify the Hadoop version by adding the
-"hadoop-client" artifact to your project's dependencies. For example, if you're
-using Hadoop 1.0.1 and build your application using SBT, add this entry to
-`libraryDependencies`:
+When developing a Spark application, specify the Hadoop version by
+adding the "hadoop-client" artifact to your project's
+dependencies. For example, if you're using Hadoop 1.0.1 and build your
+application using SBT, add this entry to `libraryDependencies`:
 
     "org.apache.hadoop" % "hadoop-client" % "1.2.1"
 
-If your project is built with Maven, add this to your POM file's `<dependencies>` section:
+If your project is built with Maven, add this to your POM file's
+`<dependencies>` section:
 
     <dependency>
       <groupId>org.apache.hadoop</groupId>
@@ -89,28 +132,19 @@ If your project is built with Maven, add this to your POM file's `<dependencies>
 
 ## Configuration
 
-Please refer to the [Configuration guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
+Please refer to the [Configuration
+guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
 in the online documentation for an overview on how to configure Spark.
 
 
-## Apache Incubator Notice
-
-Apache Spark is an effort undergoing incubation at The Apache Software
-Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of
-all newly accepted projects until a further review indicates that the
-infrastructure, communications, and decision making process have stabilized in
-a manner consistent with other successful ASF projects. While incubation status
-is not necessarily a reflection of the completeness or stability of the code,
-it does indicate that the project has yet to be fully endorsed by the ASF.
-
-
-## Contributing to Spark
+## Contributing to GraphX
 
-Contributions via GitHub pull requests are gladly accepted from their original
-author. Along with any pull requests, please state that the contribution is
-your original work and that you license the work to the project under the
-project's open source license. Whether or not you state this explicitly, by
-submitting any copyrighted material via pull request, email, or other means
-you agree to license the material under the project's open source license and
-warrant that you have the legal authority to do so.
+Contributions via GitHub pull requests are gladly accepted from their
+original author. Along with any pull requests, please state that the
+contribution is your original work and that you license the work to
+the project under the project's open source license. Whether or not
+you state this explicitly, by submitting any copyrighted material via
+pull request, email, or other means you agree to license the material
+under the project's open source license and warrant that you have the
+legal authority to do so.
 
diff --git a/docs/img/data_parallel_vs_graph_parallel.png b/docs/img/data_parallel_vs_graph_parallel.png
new file mode 100644
index 0000000000..d9aa811466
--- /dev/null
+++ b/docs/img/data_parallel_vs_graph_parallel.png
diff --git a/docs/img/edge-cut.png b/docs/img/edge-cut.png
new file mode 100644
index 0000000000..698f4ff181
--- /dev/null
+++ b/docs/img/edge-cut.png
diff --git a/docs/img/graph_parallel.png b/docs/img/graph_parallel.png
new file mode 100644
index 0000000000..330be5567c
--- /dev/null
+++ b/docs/img/graph_parallel.png
diff --git a/docs/img/tables_and_graphs.png b/docs/img/tables_and_graphs.png
new file mode 100644
index 0000000000..9af07d3081
--- /dev/null
+++ b/docs/img/tables_and_graphs.png
diff --git a/docs/img/vertex-cut.png b/docs/img/vertex-cut.png
new file mode 100644
index 0000000000..0a508dcee9
--- /dev/null
+++ b/docs/img/vertex-cut.png
author	Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>	2013-10-29 20:57:55 -0700
committer	Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>	2013-10-29 20:57:55 -0700
commit	41b312212094b2accd650813dd45e1767b5465fe (patch)
tree	9f4666ae9e740a9faecfe1427bad0d1f14ea1902
parent	38ec0baf5c9033a9e9e9bb015d95357d8176e022 (diff)
download	spark-41b312212094b2accd650813dd45e1767b5465fe.tar.gz spark-41b312212094b2accd650813dd45e1767b5465fe.tar.bz2 spark-41b312212094b2accd650813dd45e1767b5465fe.zip