summaryrefslogtreecommitdiff
path: root/site/docs
diff options
context:
space:
mode:
authorAnkur Dave <ankurdave@apache.org>2014-06-03 02:28:23 +0000
committerAnkur Dave <ankurdave@apache.org>2014-06-03 02:28:23 +0000
commit638088923dbfe94215c4e0edfac8beb2e7b483f8 (patch)
tree10237647e54dfa6329e7c54d459fb0464d4db9eb /site/docs
parent43588164336771f787d0d2cdf79f0d50ac828af4 (diff)
downloadspark-website-638088923dbfe94215c4e0edfac8beb2e7b483f8.tar.gz
spark-website-638088923dbfe94215c4e0edfac8beb2e7b483f8.tar.bz2
spark-website-638088923dbfe94215c4e0edfac8beb2e7b483f8.zip
Suggest workarounds for partitionBy in Spark 1.0.0 due to SPARK-1931
Applied PR #908 to the generated docs: https://github.com/apache/spark/pull/908
Diffstat (limited to 'site/docs')
-rw-r--r--site/docs/1.0.0/api/scala/org/apache/spark/graphx/Graph.html25
-rw-r--r--site/docs/1.0.0/graphx-programming-guide.html174
2 files changed, 111 insertions, 88 deletions
diff --git a/site/docs/1.0.0/api/scala/org/apache/spark/graphx/Graph.html b/site/docs/1.0.0/api/scala/org/apache/spark/graphx/Graph.html
index 0c261ea00..c04282e7f 100644
--- a/site/docs/1.0.0/api/scala/org/apache/spark/graphx/Graph.html
+++ b/site/docs/1.0.0/api/scala/org/apache/spark/graphx/Graph.html
@@ -316,7 +316,7 @@ provided for a particular vertex in the graph, the map function receives <code>N
(vid, data, optDeg) <span class="kw">=&gt;</span> optDeg.getOrElse(<span class="num">0</span>)
}</pre></li></ol>
</div></dl></div>
- </li><li name="org.apache.spark.graphx.Graph#partitionBy" visbl="pub" data-isabs="true" fullComment="no" group="Ungrouped">
+ </li><li name="org.apache.spark.graphx.Graph#partitionBy" visbl="pub" data-isabs="true" fullComment="yes" group="Ungrouped">
<a id="partitionBy(partitionStrategy:org.apache.spark.graphx.PartitionStrategy):org.apache.spark.graphx.Graph[VD,ED]"></a>
<a id="partitionBy(PartitionStrategy):Graph[VD,ED]"></a>
<h4 class="signature">
@@ -328,7 +328,28 @@ provided for a particular vertex in the graph, the map function receives <code>N
<span class="name">partitionBy</span><span class="params">(<span name="partitionStrategy">partitionStrategy: <a href="PartitionStrategy.html" class="extype" name="org.apache.spark.graphx.PartitionStrategy">PartitionStrategy</a></span>)</span><span class="result">: <a href="" class="extype" name="org.apache.spark.graphx.Graph">Graph</a>[<span class="extype" name="org.apache.spark.graphx.Graph.VD">VD</span>, <span class="extype" name="org.apache.spark.graphx.Graph.ED">ED</span>]</span>
</span>
</h4>
- <p class="shortcomment cmt">Repartitions the edges in the graph according to <code>partitionStrategy</code>.</p>
+ <p class="shortcomment cmt">Repartitions the edges in the graph according to <code>partitionStrategy</code> (WARNING: broken in
+Spark 1․0․0).</p><div class="fullcomment"><div class="comment cmt"><p>Repartitions the edges in the graph according to <code>partitionStrategy</code> (WARNING: broken in
+Spark 1․0․0).</p><p>To use this function in Spark 1.0.0, either build the latest version of Spark from the master
+branch, or apply the following workaround:</p><pre><span class="cmt">// Define our own version of partitionBy to work around SPARK-1931</span>
+<span class="kw">import</span> org.apache.spark.HashPartitioner
+<span class="kw">def</span> partitionBy[ED](
+ edges: RDD[Edge[ED]], partitionStrategy: PartitionStrategy): RDD[Edge[ED]] = {
+ <span class="kw">val</span> numPartitions = edges.partitions.size
+ edges.map(e <span class="kw">=&gt;</span> (partitionStrategy.getPartition(e.srcId, e.dstId, numPartitions), e))
+ .partitionBy(<span class="kw">new</span> HashPartitioner(numPartitions))
+ .mapPartitions(_.map(_._2), preservesPartitioning = <span class="kw">true</span>)
+}
+
+<span class="kw">val</span> vertices = ...
+<span class="kw">val</span> edges = ...
+
+<span class="cmt">// Instead of:</span>
+<span class="kw">val</span> g = Graph(vertices, edges)
+ .partitionBy(PartitionStrategy.EdgePartition2D) <span class="cmt">// broken in Spark 1.0.0</span>
+
+<span class="cmt">// Use:</span>
+<span class="kw">val</span> g = Graph(vertices, partitionBy(edges, PartitionStrategy.EdgePartition2D))</pre></div></div>
</li><li name="org.apache.spark.graphx.Graph#persist" visbl="pub" data-isabs="true" fullComment="yes" group="Ungrouped">
<a id="persist(newLevel:org.apache.spark.storage.StorageLevel):org.apache.spark.graphx.Graph[VD,ED]"></a>
<a id="persist(StorageLevel):Graph[VD,ED]"></a>
diff --git a/site/docs/1.0.0/graphx-programming-guide.html b/site/docs/1.0.0/graphx-programming-guide.html
index d77c4b229..8fea9a5d5 100644
--- a/site/docs/1.0.0/graphx-programming-guide.html
+++ b/site/docs/1.0.0/graphx-programming-guide.html
@@ -126,6 +126,7 @@
<li><a href="#background-on-graph-parallel-computation">Background on Graph-Parallel Computation</a></li>
<li><a href="#graphx-replaces-the-spark-bagel-api">GraphX Replaces the Spark Bagel API</a></li>
<li><a href="#migrating-from-spark-091">Migrating from Spark 0.9.1</a></li>
+ <li><a href="#workaround-for-graphpartitionby-in-spark-100">Workaround for <code>Graph.partitionBy</code> in Spark 1.0.0</a></li>
</ul>
</li>
<li><a href="#getting-started">Getting Started</a></li>
@@ -238,6 +239,29 @@ explore the new GraphX API and comment on issues that may complicate the transit
<p>GraphX in Spark 1.0.0 contains one user-facing interface change from Spark 0.9.1. <a href="api/scala/index.html#org.apache.spark.graphx.EdgeRDD"><code>EdgeRDD</code></a> may now store adjacent vertex attributes to construct the triplets, so it has gained a type parameter. The edges of a graph of type <code>Graph[VD, ED]</code> are of type <code>EdgeRDD[ED, VD]</code> rather than <code>EdgeRDD[ED]</code>.</p>
+<h2 id="workaround-for-graphpartitionby-in-spark-100">Workaround for <code>Graph.partitionBy</code> in Spark 1.0.0</h2>
+<p><a name="partitionBy_workaround"></a></p>
+
+<p>The <a href="api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]"><code>Graph.partitionBy</code></a> operator allows users to choose the graph partitioning strategy, but due to <a href="https://issues.apache.org/jira/browse/SPARK-1931">SPARK-1931</a>, this method is broken in Spark 1.0.0. We encourage users to build the latest version of Spark from the master branch, which contains a fix. Alternatively, a workaround is to partition the edges before constructing the graph, as follows:</p>
+
+<div class="highlight"><pre><code class="scala"><span class="c1">// Define our own version of partitionBy to work around SPARK-1931</span>
+<span class="k">import</span> <span class="nn">org.apache.spark.HashPartitioner</span>
+<span class="k">def</span> <span class="n">partitionBy</span><span class="o">[</span><span class="kt">ED</span><span class="o">](</span><span class="n">edges</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Edge</span><span class="o">[</span><span class="kt">ED</span><span class="o">]],</span> <span class="n">partitionStrategy</span><span class="k">:</span> <span class="kt">PartitionStrategy</span><span class="o">)</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Edge</span><span class="o">[</span><span class="kt">ED</span><span class="o">]]</span> <span class="k">=</span> <span class="o">{</span>
+ <span class="k">val</span> <span class="n">numPartitions</span> <span class="k">=</span> <span class="n">edges</span><span class="o">.</span><span class="n">partitions</span><span class="o">.</span><span class="n">size</span>
+ <span class="n">edges</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">e</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="n">partitionStrategy</span><span class="o">.</span><span class="n">getPartition</span><span class="o">(</span><span class="n">e</span><span class="o">.</span><span class="n">srcId</span><span class="o">,</span> <span class="n">e</span><span class="o">.</span><span class="n">dstId</span><span class="o">,</span> <span class="n">numPartitions</span><span class="o">),</span> <span class="n">e</span><span class="o">))</span>
+ <span class="o">.</span><span class="n">partitionBy</span><span class="o">(</span><span class="k">new</span> <span class="nc">HashPartitioner</span><span class="o">(</span><span class="n">numPartitions</span><span class="o">))</span>
+ <span class="o">.</span><span class="n">mapPartitions</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">_2</span><span class="o">),</span> <span class="n">preservesPartitioning</span> <span class="k">=</span> <span class="kc">true</span><span class="o">)</span>
+<span class="o">}</span>
+
+<span class="k">val</span> <span class="n">vertices</span> <span class="k">=</span> <span class="o">...</span>
+<span class="k">val</span> <span class="n">edges</span> <span class="k">=</span> <span class="o">...</span>
+
+<span class="c1">// Instead of:</span>
+<span class="k">val</span> <span class="n">g</span> <span class="k">=</span> <span class="nc">Graph</span><span class="o">(</span><span class="n">vertices</span><span class="o">,</span> <span class="n">edges</span><span class="o">).</span><span class="n">partitionBy</span><span class="o">(</span><span class="nc">PartitionStrategy</span><span class="o">.</span><span class="nc">EdgePartition2D</span><span class="o">)</span> <span class="c1">// broken in Spark 1.0.0</span>
+
+<span class="c1">// Use:</span>
+<span class="k">val</span> <span class="n">g</span> <span class="k">=</span> <span class="nc">Graph</span><span class="o">(</span><span class="n">vertices</span><span class="o">,</span> <span class="n">partitionBy</span><span class="o">(</span><span class="n">edges</span><span class="o">,</span> <span class="nc">PartitionStrategy</span><span class="o">.</span><span class="nc">EdgePartition2D</span><span class="o">))</span></code></pre></div>
+
<h1 id="getting-started">Getting Started</h1>
<p>To get started you first need to import Spark and GraphX into your project, as follows:</p>
@@ -245,8 +269,7 @@ explore the new GraphX API and comment on issues that may complicate the transit
<div class="highlight"><pre><code class="scala"><span class="k">import</span> <span class="nn">org.apache.spark._</span>
<span class="k">import</span> <span class="nn">org.apache.spark.graphx._</span>
<span class="c1">// To make some of the examples work we will also need RDD</span>
-<span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span>
-</code></pre></div>
+<span class="k">import</span> <span class="nn">org.apache.spark.rdd.RDD</span></code></pre></div>
<p>If you are not using the Spark shell you will also need a <code>SparkContext</code>. To learn more about
getting started with Spark refer to the <a href="quick-start.html">Spark Quick Start Guide</a>.</p>
@@ -268,7 +291,7 @@ are the types of the objects associated with each vertex and edge respectively.<
<blockquote>
<p>GraphX optimizes the representation of vertex and edge types when they are plain old data-types
-(e.g., int, double, etc&#8230;) reducing the in memory footprint by storing them in specialized
+(e.g., int, double, etc…) reducing the in memory footprint by storing them in specialized
arrays.</p>
</blockquote>
@@ -280,8 +303,7 @@ bipartite graph we might do the following:</p>
<span class="k">case</span> <span class="k">class</span> <span class="nc">UserProperty</span><span class="o">(</span><span class="k">val</span> <span class="n">name</span><span class="k">:</span> <span class="kt">String</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">VertexProperty</span>
<span class="k">case</span> <span class="k">class</span> <span class="nc">ProductProperty</span><span class="o">(</span><span class="k">val</span> <span class="n">name</span><span class="k">:</span> <span class="kt">String</span><span class="o">,</span> <span class="k">val</span> <span class="n">price</span><span class="k">:</span> <span class="kt">Double</span><span class="o">)</span> <span class="k">extends</span> <span class="nc">VertexProperty</span>
<span class="c1">// The graph might then have the type:</span>
-<span class="k">var</span> <span class="n">graph</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VertexProperty</span>, <span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="kc">null</span>
-</code></pre></div>
+<span class="k">var</span> <span class="n">graph</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VertexProperty</span>, <span class="kt">String</span><span class="o">]</span> <span class="k">=</span> <span class="kc">null</span></code></pre></div>
<p>Like RDDs, property graphs are immutable, distributed, and fault-tolerant. Changes to the values or
structure of the graph are accomplished by producing a new graph with the desired changes. Note
@@ -297,8 +319,7 @@ the vertices and edges of the graph:</p>
<div class="highlight"><pre><code class="scala"><span class="k">class</span> <span class="nc">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span> <span class="o">{</span>
<span class="k">val</span> <span class="n">vertices</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">VD</span><span class="o">]</span>
<span class="k">val</span> <span class="n">edges</span><span class="k">:</span> <span class="kt">EdgeRDD</span><span class="o">[</span><span class="kt">ED</span>, <span class="kt">VD</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p>The classes <code>VertexRDD[VD]</code> and <code>EdgeRDD[ED, VD]</code> extend and are optimized versions of <code>RDD[(VertexID,
VD)]</code> and <code>RDD[Edge[ED]]</code> respectively. Both <code>VertexRDD[VD]</code> and <code>EdgeRDD[ED, VD]</code> provide additional
@@ -320,8 +341,7 @@ with a string describing the relationships between collaborators:</p>
<p>The resulting graph would have the type signature:</p>
-<div class="highlight"><pre><code class="scala"><span class="k">val</span> <span class="n">userGraph</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">)</span>, <span class="kt">String</span><span class="o">]</span>
-</code></pre></div>
+<div class="highlight"><pre><code class="scala"><span class="k">val</span> <span class="n">userGraph</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">)</span>, <span class="kt">String</span><span class="o">]</span></code></pre></div>
<p>There are numerous ways to construct a property graph from raw files, RDDs, and even synthetic
generators and these are discussed in more detail in the section on
@@ -342,8 +362,7 @@ code constructs a graph from a collection of RDDs:</p>
<span class="c1">// Define a default user in case there are relationship with missing user</span>
<span class="k">val</span> <span class="n">defaultUser</span> <span class="k">=</span> <span class="o">(</span><span class="s">&quot;John Doe&quot;</span><span class="o">,</span> <span class="s">&quot;Missing&quot;</span><span class="o">)</span>
<span class="c1">// Build the initial Graph</span>
-<span class="k">val</span> <span class="n">graph</span> <span class="k">=</span> <span class="nc">Graph</span><span class="o">(</span><span class="n">users</span><span class="o">,</span> <span class="n">relationships</span><span class="o">,</span> <span class="n">defaultUser</span><span class="o">)</span>
-</code></pre></div>
+<span class="k">val</span> <span class="n">graph</span> <span class="k">=</span> <span class="nc">Graph</span><span class="o">(</span><span class="n">users</span><span class="o">,</span> <span class="n">relationships</span><span class="o">,</span> <span class="n">defaultUser</span><span class="o">)</span></code></pre></div>
<p>In the above example we make use of the <a href="api/scala/index.html#org.apache.spark.graphx.Edge"><code>Edge</code></a> case class. Edges have a <code>srcId</code> and a
<code>dstId</code> corresponding to the source and destination vertex identifiers. In addition, the <code>Edge</code>
@@ -356,8 +375,7 @@ and <code>graph.edges</code> members respectively.</p>
<span class="c1">// Count all users which are postdocs</span>
<span class="n">graph</span><span class="o">.</span><span class="n">vertices</span><span class="o">.</span><span class="n">filter</span> <span class="o">{</span> <span class="k">case</span> <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="o">(</span><span class="n">name</span><span class="o">,</span> <span class="n">pos</span><span class="o">))</span> <span class="k">=&gt;</span> <span class="n">pos</span> <span class="o">==</span> <span class="s">&quot;postdoc&quot;</span> <span class="o">}.</span><span class="n">count</span>
<span class="c1">// Count all the edges where src &gt; dst</span>
-<span class="n">graph</span><span class="o">.</span><span class="n">edges</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">e</span> <span class="k">=&gt;</span> <span class="n">e</span><span class="o">.</span><span class="n">srcId</span> <span class="o">&gt;</span> <span class="n">e</span><span class="o">.</span><span class="n">dstId</span><span class="o">).</span><span class="n">count</span>
-</code></pre></div>
+<span class="n">graph</span><span class="o">.</span><span class="n">edges</span><span class="o">.</span><span class="n">filter</span><span class="o">(</span><span class="n">e</span> <span class="k">=&gt;</span> <span class="n">e</span><span class="o">.</span><span class="n">srcId</span> <span class="o">&gt;</span> <span class="n">e</span><span class="o">.</span><span class="n">dstId</span><span class="o">).</span><span class="n">count</span></code></pre></div>
<blockquote>
<p>Note that <code>graph.vertices</code> returns an <code>VertexRDD[(String, String)]</code> which extends
@@ -365,8 +383,7 @@ and <code>graph.edges</code> members respectively.</p>
tuple. On the other hand, <code>graph.edges</code> returns an <code>EdgeRDD</code> containing <code>Edge[String]</code> objects.
We could have also used the case class type constructor as in the following:</p>
- <div class="highlight"><pre><code class="scala"><span class="n">graph</span><span class="o">.</span><span class="n">edges</span><span class="o">.</span><span class="n">filter</span> <span class="o">{</span> <span class="k">case</span> <span class="nc">Edge</span><span class="o">(</span><span class="n">src</span><span class="o">,</span> <span class="n">dst</span><span class="o">,</span> <span class="n">prop</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">src</span> <span class="o">&gt;</span> <span class="n">dst</span> <span class="o">}.</span><span class="n">count</span>
-</code></pre></div>
+ <div class="highlight"><pre><code class="scala"><span class="n">graph</span><span class="o">.</span><span class="n">edges</span><span class="o">.</span><span class="n">filter</span> <span class="o">{</span> <span class="k">case</span> <span class="nc">Edge</span><span class="o">(</span><span class="n">src</span><span class="o">,</span> <span class="n">dst</span><span class="o">,</span> <span class="n">prop</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">src</span> <span class="o">&gt;</span> <span class="n">dst</span> <span class="o">}.</span><span class="n">count</span></code></pre></div>
</blockquote>
<p>In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
@@ -376,8 +393,7 @@ The triplet view logically joins the vertex and edge properties yielding an
<div class="highlight"><pre><code class="sql"><span class="k">SELECT</span> <span class="n">src</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">dst</span><span class="p">.</span><span class="n">id</span><span class="p">,</span> <span class="n">src</span><span class="p">.</span><span class="n">attr</span><span class="p">,</span> <span class="n">e</span><span class="p">.</span><span class="n">attr</span><span class="p">,</span> <span class="n">dst</span><span class="p">.</span><span class="n">attr</span>
<span class="k">FROM</span> <span class="n">edges</span> <span class="k">AS</span> <span class="n">e</span> <span class="k">LEFT</span> <span class="k">JOIN</span> <span class="n">vertices</span> <span class="k">AS</span> <span class="n">src</span><span class="p">,</span> <span class="n">vertices</span> <span class="k">AS</span> <span class="n">dst</span>
-<span class="k">ON</span> <span class="n">e</span><span class="p">.</span><span class="n">srcId</span> <span class="o">=</span> <span class="n">src</span><span class="p">.</span><span class="n">Id</span> <span class="k">AND</span> <span class="n">e</span><span class="p">.</span><span class="n">dstId</span> <span class="o">=</span> <span class="n">dst</span><span class="p">.</span><span class="n">Id</span>
-</code></pre></div>
+<span class="k">ON</span> <span class="n">e</span><span class="p">.</span><span class="n">srcId</span> <span class="o">=</span> <span class="n">src</span><span class="p">.</span><span class="n">Id</span> <span class="k">AND</span> <span class="n">e</span><span class="p">.</span><span class="n">dstId</span> <span class="o">=</span> <span class="n">dst</span><span class="p">.</span><span class="n">Id</span></code></pre></div>
<p>or graphically as:</p>
@@ -395,8 +411,7 @@ triplet view of a graph to render a collection of strings describing relationshi
<span class="k">val</span> <span class="n">facts</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">String</span><span class="o">]</span> <span class="k">=</span>
<span class="n">graph</span><span class="o">.</span><span class="n">triplets</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">triplet</span> <span class="k">=&gt;</span>
<span class="n">triplet</span><span class="o">.</span><span class="n">srcAttr</span><span class="o">.</span><span class="n">_1</span> <span class="o">+</span> <span class="s">&quot; is the &quot;</span> <span class="o">+</span> <span class="n">triplet</span><span class="o">.</span><span class="n">attr</span> <span class="o">+</span> <span class="s">&quot; of &quot;</span> <span class="o">+</span> <span class="n">triplet</span><span class="o">.</span><span class="n">dstAttr</span><span class="o">.</span><span class="n">_1</span><span class="o">)</span>
-<span class="n">facts</span><span class="o">.</span><span class="n">collect</span><span class="o">.</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">(</span><span class="k">_</span><span class="o">))</span>
-</code></pre></div>
+<span class="n">facts</span><span class="o">.</span><span class="n">collect</span><span class="o">.</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">(</span><span class="k">_</span><span class="o">))</span></code></pre></div>
<h1 id="graph-operators">Graph Operators</h1>
@@ -410,8 +425,7 @@ compute the in-degree of each vertex (defined in <code>GraphOps</code>) by the f
<div class="highlight"><pre><code class="scala"><span class="k">val</span> <span class="n">graph</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[(</span><span class="kt">String</span>, <span class="kt">String</span><span class="o">)</span>, <span class="kt">String</span><span class="o">]</span>
<span class="c1">// Use the implicit GraphOps.inDegrees operator</span>
-<span class="k">val</span> <span class="n">inDegrees</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">Int</span><span class="o">]</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">inDegrees</span>
-</code></pre></div>
+<span class="k">val</span> <span class="n">inDegrees</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">Int</span><span class="o">]</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">inDegrees</span></code></pre></div>
<p>The reason for differentiating between core graph operations and <a href="api/scala/index.html#org.apache.spark.graphx.GraphOps"><code>GraphOps</code></a> is to be
able to support different graph representations in the future. Each graph representation must
@@ -442,6 +456,8 @@ operations.</p>
<span class="k">def</span> <span class="n">cache</span><span class="o">()</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="k">def</span> <span class="n">unpersistVertices</span><span class="o">(</span><span class="n">blocking</span><span class="k">:</span> <span class="kt">Boolean</span> <span class="o">=</span> <span class="kc">true</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="c1">// Change the partitioning heuristic ============================================================</span>
+ <span class="c1">// - WARNING: partitionBy is broken in Spark 1.0.0 due to SPARK-1931.</span>
+ <span class="c1">// See the section &quot;Workaround for Graph.partitionBy in Spark 1.0.0&quot; above.</span>
<span class="k">def</span> <span class="n">partitionBy</span><span class="o">(</span><span class="n">partitionStrategy</span><span class="k">:</span> <span class="kt">PartitionStrategy</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="c1">// Transform vertex and edge attributes ==========================================================</span>
<span class="k">def</span> <span class="n">mapVertices</span><span class="o">[</span><span class="kt">VD2</span><span class="o">](</span><span class="n">map</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexID</span><span class="o">,</span> <span class="kt">VD</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">VD2</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD2</span>, <span class="kt">ED</span><span class="o">]</span>
@@ -482,8 +498,7 @@ operations.</p>
<span class="k">def</span> <span class="n">connectedComponents</span><span class="o">()</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VertexID</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="k">def</span> <span class="n">triangleCount</span><span class="o">()</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">Int</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="k">def</span> <span class="n">stronglyConnectedComponents</span><span class="o">(</span><span class="n">numIter</span><span class="k">:</span> <span class="kt">Int</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VertexID</span>, <span class="kt">ED</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<h2 id="property-operators">Property Operators</h2>
@@ -494,8 +509,7 @@ graph contains the following:</p>
<span class="k">def</span> <span class="n">mapVertices</span><span class="o">[</span><span class="kt">VD2</span><span class="o">](</span><span class="n">map</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">VD</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">VD2</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD2</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="k">def</span> <span class="n">mapEdges</span><span class="o">[</span><span class="kt">ED2</span><span class="o">](</span><span class="n">map</span><span class="k">:</span> <span class="kt">Edge</span><span class="o">[</span><span class="kt">ED</span><span class="o">]</span> <span class="k">=&gt;</span> <span class="nc">ED2</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED2</span><span class="o">]</span>
<span class="k">def</span> <span class="n">mapTriplets</span><span class="o">[</span><span class="kt">ED2</span><span class="o">](</span><span class="n">map</span><span class="k">:</span> <span class="kt">EdgeTriplet</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span> <span class="k">=&gt;</span> <span class="nc">ED2</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED2</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p>Each of these operators yields a new graph with the vertex or edge properties modified by the user
defined <code>map</code> function.</p>
@@ -507,15 +521,13 @@ following snippets are logically equivalent, but the first one does not preserve
indices and would not benefit from the GraphX system optimizations:</p>
<div class="highlight"><pre><code class="scala"><span class="k">val</span> <span class="n">newVertices</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">vertices</span><span class="o">.</span><span class="n">map</span> <span class="o">{</span> <span class="k">case</span> <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">attr</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">mapUdf</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">attr</span><span class="o">))</span> <span class="o">}</span>
-<span class="k">val</span> <span class="n">newGraph</span> <span class="k">=</span> <span class="nc">Graph</span><span class="o">(</span><span class="n">newVertices</span><span class="o">,</span> <span class="n">graph</span><span class="o">.</span><span class="n">edges</span><span class="o">)</span>
-</code></pre></div>
+<span class="k">val</span> <span class="n">newGraph</span> <span class="k">=</span> <span class="nc">Graph</span><span class="o">(</span><span class="n">newVertices</span><span class="o">,</span> <span class="n">graph</span><span class="o">.</span><span class="n">edges</span><span class="o">)</span></code></pre></div>
</blockquote>
<blockquote>
<p>Instead, use <a href="api/scala/index.html#org.apache.spark.graphx.Graph@mapVertices[VD2]((VertexId,VD)⇒VD2)(ClassTag[VD2]):Graph[VD2,ED]"><code>mapVertices</code></a> to preserve the indices:</p>
- <div class="highlight"><pre><code class="scala"><span class="k">val</span> <span class="n">newGraph</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">mapVertices</span><span class="o">((</span><span class="n">id</span><span class="o">,</span> <span class="n">attr</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">mapUdf</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">attr</span><span class="o">))</span>
-</code></pre></div>
+ <div class="highlight"><pre><code class="scala"><span class="k">val</span> <span class="n">newGraph</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">mapVertices</span><span class="o">((</span><span class="n">id</span><span class="o">,</span> <span class="n">attr</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">mapUdf</span><span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">attr</span><span class="o">))</span></code></pre></div>
</blockquote>
<p>These operators are often used to initialize the graph for a particular computation or project away
@@ -528,8 +540,7 @@ unnecessary properties. For example, given a graph with the out-degrees as the
<span class="c1">// Construct a graph where each edge contains the weight</span>
<span class="c1">// and each vertex is the initial PageRank</span>
<span class="k">val</span> <span class="n">outputGraph</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">Double</span>, <span class="kt">Double</span><span class="o">]</span> <span class="k">=</span>
- <span class="n">inputGraph</span><span class="o">.</span><span class="n">mapTriplets</span><span class="o">(</span><span class="n">triplet</span> <span class="k">=&gt;</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">triplet</span><span class="o">.</span><span class="n">srcAttr</span><span class="o">).</span><span class="n">mapVertices</span><span class="o">((</span><span class="n">id</span><span class="o">,</span> <span class="k">_</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="mf">1.0</span><span class="o">)</span>
-</code></pre></div>
+ <span class="n">inputGraph</span><span class="o">.</span><span class="n">mapTriplets</span><span class="o">(</span><span class="n">triplet</span> <span class="k">=&gt;</span> <span class="mf">1.0</span> <span class="o">/</span> <span class="n">triplet</span><span class="o">.</span><span class="n">srcAttr</span><span class="o">).</span><span class="n">mapVertices</span><span class="o">((</span><span class="n">id</span><span class="o">,</span> <span class="k">_</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="mf">1.0</span><span class="o">)</span></code></pre></div>
<h2 id="structural-operators">Structural Operators</h2>
<p><a name="structural_operators"></a></p>
@@ -543,8 +554,7 @@ add more in the future. The following is a list of the basic structural operato
<span class="n">vpred</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">VD</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">Boolean</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="k">def</span> <span class="n">mask</span><span class="o">[</span><span class="kt">VD2</span>, <span class="kt">ED2</span><span class="o">](</span><span class="n">other</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD2</span>, <span class="kt">ED2</span><span class="o">])</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="k">def</span> <span class="n">groupEdges</span><span class="o">(</span><span class="n">merge</span><span class="k">:</span> <span class="o">(</span><span class="kt">ED</span><span class="o">,</span> <span class="kt">ED</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">ED</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>,<span class="kt">ED</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p>The <a href="api/scala/index.html#org.apache.spark.graphx.Graph@reverse:Graph[VD,ED]"><code>reverse</code></a> operator returns a new graph with all the edge directions reversed.
This can be useful when, for example, trying to compute the inverse PageRank. Because the reverse
@@ -582,8 +592,7 @@ interest or eliminate broken links. For example in the following code we remove
<span class="n">validGraph</span><span class="o">.</span><span class="n">vertices</span><span class="o">.</span><span class="n">collect</span><span class="o">.</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">(</span><span class="k">_</span><span class="o">))</span>
<span class="n">validGraph</span><span class="o">.</span><span class="n">triplets</span><span class="o">.</span><span class="n">map</span><span class="o">(</span>
<span class="n">triplet</span> <span class="k">=&gt;</span> <span class="n">triplet</span><span class="o">.</span><span class="n">srcAttr</span><span class="o">.</span><span class="n">_1</span> <span class="o">+</span> <span class="s">&quot; is the &quot;</span> <span class="o">+</span> <span class="n">triplet</span><span class="o">.</span><span class="n">attr</span> <span class="o">+</span> <span class="s">&quot; of &quot;</span> <span class="o">+</span> <span class="n">triplet</span><span class="o">.</span><span class="n">dstAttr</span><span class="o">.</span><span class="n">_1</span>
- <span class="o">).</span><span class="n">collect</span><span class="o">.</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">(</span><span class="k">_</span><span class="o">))</span>
-</code></pre></div>
+ <span class="o">).</span><span class="n">collect</span><span class="o">.</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">(</span><span class="k">_</span><span class="o">))</span></code></pre></div>
<blockquote>
<p>Note in the above example only the vertex predicate is provided. The <code>subgraph</code> operator defaults
@@ -601,8 +610,7 @@ the answer to the valid subgraph.</p>
<span class="c1">// Remove missing vertices as well as the edges to connected to them</span>
<span class="k">val</span> <span class="n">validGraph</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">subgraph</span><span class="o">(</span><span class="n">vpred</span> <span class="k">=</span> <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">attr</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">attr</span><span class="o">.</span><span class="n">_2</span> <span class="o">!=</span> <span class="s">&quot;Missing&quot;</span><span class="o">)</span>
<span class="c1">// Restrict the answer to the valid subgraph</span>
-<span class="k">val</span> <span class="n">validCCGraph</span> <span class="k">=</span> <span class="n">ccGraph</span><span class="o">.</span><span class="n">mask</span><span class="o">(</span><span class="n">validGraph</span><span class="o">)</span>
-</code></pre></div>
+<span class="k">val</span> <span class="n">validCCGraph</span> <span class="k">=</span> <span class="n">ccGraph</span><span class="o">.</span><span class="n">mask</span><span class="o">(</span><span class="n">validGraph</span><span class="o">)</span></code></pre></div>
<p>The <a href="api/scala/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED]"><code>groupEdges</code></a> operator merges parallel edges (i.e., duplicate edges between
pairs of vertices) in the multigraph. In many numerical applications, parallel edges can be <em>added</em>
@@ -621,8 +629,7 @@ using the <em>join</em> operators. Below we list the key join operators:</p>
<span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span>
<span class="k">def</span> <span class="n">outerJoinVertices</span><span class="o">[</span><span class="kt">U</span>, <span class="kt">VD2</span><span class="o">](</span><span class="n">table</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[(</span><span class="kt">VertexId</span>, <span class="kt">U</span><span class="o">)])(</span><span class="n">map</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">VD</span><span class="o">,</span> <span class="nc">Option</span><span class="o">[</span><span class="kt">U</span><span class="o">])</span> <span class="k">=&gt;</span> <span class="nc">VD2</span><span class="o">)</span>
<span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD2</span>, <span class="kt">ED</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p>The <a href="api/scala/index.html#org.apache.spark.graphx.GraphOps@joinVertices[U](RDD[(VertexId,U)])((VertexId,VD,U)⇒VD)(ClassTag[U]):Graph[VD,ED]"><code>joinVertices</code></a> operator joins the vertices with the input RDD and
returns a new graph with the vertex properties obtained by applying the user defined <code>map</code> function
@@ -638,8 +645,7 @@ also <em>pre-index</em> the resulting values to substantially accelerate the sub
<span class="k">val</span> <span class="n">uniqueCosts</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="k">=</span>
<span class="n">graph</span><span class="o">.</span><span class="n">vertices</span><span class="o">.</span><span class="n">aggregateUsingIndex</span><span class="o">(</span><span class="n">nonUnique</span><span class="o">,</span> <span class="o">(</span><span class="n">a</span><span class="o">,</span><span class="n">b</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="o">)</span>
<span class="k">val</span> <span class="n">joinedGraph</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">joinVertices</span><span class="o">(</span><span class="n">uniqueCosts</span><span class="o">)(</span>
- <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">oldCost</span><span class="o">,</span> <span class="n">extraCost</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">oldCost</span> <span class="o">+</span> <span class="n">extraCost</span><span class="o">)</span>
-</code></pre></div>
+ <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">oldCost</span><span class="o">,</span> <span class="n">extraCost</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">oldCost</span> <span class="o">+</span> <span class="n">extraCost</span><span class="o">)</span></code></pre></div>
</blockquote>
<p>The more general <a href="api/scala/index.html#org.apache.spark.graphx.Graph@outerJoinVertices[U,VD2](RDD[(VertexId,U)])((VertexId,VD,Option[U])⇒VD2)(ClassTag[U],ClassTag[VD2]):Graph[VD2,ED]"><code>outerJoinVertices</code></a> behaves similarly to <code>joinVertices</code>
@@ -654,8 +660,7 @@ vertex properties with their <code>outDegree</code>.</p>
<span class="k">case</span> <span class="nc">Some</span><span class="o">(</span><span class="n">outDeg</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">outDeg</span>
<span class="k">case</span> <span class="nc">None</span> <span class="k">=&gt;</span> <span class="mi">0</span> <span class="c1">// No outDegree means zero outDegree</span>
<span class="o">}</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<blockquote>
<p>You may have noticed the multiple parameter lists (e.g., <code>f(a)(b)</code>) curried function pattern used
@@ -664,8 +669,7 @@ that type inference on <code>b</code> would not depend on <code>a</code>. As a
provide type annotation for the user defined function:</p>
<div class="highlight"><pre><code class="scala"><span class="k">val</span> <span class="n">joinedGraph</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">joinVertices</span><span class="o">(</span><span class="n">uniqueCosts</span><span class="o">,</span>
- <span class="o">(</span><span class="n">id</span><span class="k">:</span> <span class="kt">VertexID</span><span class="o">,</span> <span class="n">oldCost</span><span class="k">:</span> <span class="kt">Double</span><span class="o">,</span> <span class="n">extraCost</span><span class="k">:</span> <span class="kt">Double</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">oldCost</span> <span class="o">+</span> <span class="n">extraCost</span><span class="o">)</span>
-</code></pre></div>
+ <span class="o">(</span><span class="n">id</span><span class="k">:</span> <span class="kt">VertexID</span><span class="o">,</span> <span class="n">oldCost</span><span class="k">:</span> <span class="kt">Double</span><span class="o">,</span> <span class="n">extraCost</span><span class="k">:</span> <span class="kt">Double</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">oldCost</span> <span class="o">+</span> <span class="n">extraCost</span><span class="o">)</span></code></pre></div>
</blockquote>
<h2 id="neighborhood-aggregation">Neighborhood Aggregation</h2>
@@ -687,8 +691,7 @@ PageRank Value, shortest path to the source, and smallest reachable vertex id).<
<span class="n">map</span><span class="k">:</span> <span class="kt">EdgeTriplet</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span> <span class="k">=&gt;</span> <span class="nc">Iterator</span><span class="o">[(</span><span class="kt">VertexId</span>, <span class="kt">A</span><span class="o">)],</span>
<span class="n">reduce</span><span class="k">:</span> <span class="o">(</span><span class="kt">A</span><span class="o">,</span> <span class="kt">A</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">A</span><span class="o">)</span>
<span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">A</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p>The <a href="api/scala/index.html#org.apache.spark.graphx.Graph@mapReduceTriplets[A](mapFunc:org.apache.spark.graphx.EdgeTriplet[VD,ED]=&gt;Iterator[(org.apache.spark.graphx.VertexId,A)],reduceFunc:(A,A)=&gt;A,activeSetOpt:Option[(org.apache.spark.graphx.VertexRDD[_],org.apache.spark.graphx.EdgeDirection)])(implicitevidence$10:scala.reflect.ClassTag[A]):org.apache.spark.graphx.VertexRDD[A]"><code>mapReduceTriplets</code></a> operator takes a user defined map function which
is applied to each triplet and can yield <em>messages</em> destined to either (none or both) vertices in
@@ -705,8 +708,7 @@ receive a message are not included in the returned <code>VertexRDD</code>.</p>
vertices in the provided <code>VertexRDD</code>: </p>
-<div class="highlight"><pre><code class="scala"> <span class="n">activeSetOpt</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[(</span><span class="kt">VertexRDD</span><span class="o">[</span><span class="k">_</span><span class="o">]</span>, <span class="kt">EdgeDirection</span><span class="o">)]</span> <span class="k">=</span> <span class="nc">None</span>
-</code></pre></div>
+<div class="highlight"><pre><code class="scala"><span class="n">activeSetOpt</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[(</span><span class="kt">VertexRDD</span><span class="o">[</span><span class="k">_</span><span class="o">]</span>, <span class="kt">EdgeDirection</span><span class="o">)]</span> <span class="k">=</span> <span class="nc">None</span></code></pre></div>
<p>The EdgeDirection specifies which edges adjacent to the vertex set are included in the map
@@ -748,8 +750,7 @@ more senior followers of each user.</p>
<span class="k">val</span> <span class="n">avgAgeOfOlderFollowers</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="k">=</span>
<span class="n">olderFollowers</span><span class="o">.</span><span class="n">mapValues</span><span class="o">(</span> <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="n">value</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">value</span> <span class="k">match</span> <span class="o">{</span> <span class="k">case</span> <span class="o">(</span><span class="n">count</span><span class="o">,</span> <span class="n">totalAge</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">totalAge</span> <span class="o">/</span> <span class="n">count</span> <span class="o">}</span> <span class="o">)</span>
<span class="c1">// Display the results</span>
-<span class="n">avgAgeOfOlderFollowers</span><span class="o">.</span><span class="n">collect</span><span class="o">.</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">(</span><span class="k">_</span><span class="o">))</span>
-</code></pre></div>
+<span class="n">avgAgeOfOlderFollowers</span><span class="o">.</span><span class="n">collect</span><span class="o">.</span><span class="n">foreach</span><span class="o">(</span><span class="n">println</span><span class="o">(</span><span class="k">_</span><span class="o">))</span></code></pre></div>
<blockquote>
<p>Note that the <code>mapReduceTriplets</code> operation performs optimally when the messages (and the sums of
@@ -773,8 +774,7 @@ compute the max in, out, and total degrees:</p>
<span class="c1">// Compute the max degrees</span>
<span class="k">val</span> <span class="n">maxInDegree</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">inDegrees</span><span class="o">.</span><span class="n">reduce</span><span class="o">(</span><span class="n">max</span><span class="o">)</span>
<span class="k">val</span> <span class="n">maxOutDegree</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">outDegrees</span><span class="o">.</span><span class="n">reduce</span><span class="o">(</span><span class="n">max</span><span class="o">)</span>
-<span class="k">val</span> <span class="n">maxDegrees</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">degrees</span><span class="o">.</span><span class="n">reduce</span><span class="o">(</span><span class="n">max</span><span class="o">)</span>
-</code></pre></div>
+<span class="k">val</span> <span class="n">maxDegrees</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">Int</span><span class="o">)</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">degrees</span><span class="o">.</span><span class="n">reduce</span><span class="o">(</span><span class="n">max</span><span class="o">)</span></code></pre></div>
<h3 id="collecting-neighbors">Collecting Neighbors</h3>
@@ -786,8 +786,7 @@ attributes at each vertex. This can be easily accomplished using the
<div class="highlight"><pre><code class="scala"><span class="k">class</span> <span class="nc">GraphOps</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">ED</span><span class="o">]</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">collectNeighborIds</span><span class="o">(</span><span class="n">edgeDirection</span><span class="k">:</span> <span class="kt">EdgeDirection</span><span class="o">)</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">Array</span><span class="o">[</span><span class="kt">VertexId</span><span class="o">]]</span>
<span class="k">def</span> <span class="n">collectNeighbors</span><span class="o">(</span><span class="n">edgeDirection</span><span class="k">:</span> <span class="kt">EdgeDirection</span><span class="o">)</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span> <span class="kt">Array</span><span class="o">[(</span><span class="kt">VertexId</span>, <span class="kt">VD</span><span class="o">)]</span> <span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<blockquote>
<p>Note that these operators can be quite costly as they duplicate information and require
@@ -862,8 +861,7 @@ of its implementation (note calls to graph.cache have been removed):</p>
<span class="o">}</span>
<span class="n">g</span>
<span class="o">}</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p>Notice that Pregel takes two argument lists (i.e., <code>graph.pregel(list1)(list2)</code>). The first
argument list contains configuration parameters including the initial message, the maximum number of
@@ -894,13 +892,12 @@ shortest path in the following example.</p>
<span class="o">},</span>
<span class="o">(</span><span class="n">a</span><span class="o">,</span><span class="n">b</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">math</span><span class="o">.</span><span class="n">min</span><span class="o">(</span><span class="n">a</span><span class="o">,</span><span class="n">b</span><span class="o">)</span> <span class="c1">// Merge Message</span>
<span class="o">)</span>
-<span class="n">println</span><span class="o">(</span><span class="n">sssp</span><span class="o">.</span><span class="n">vertices</span><span class="o">.</span><span class="n">collect</span><span class="o">.</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span>
-</code></pre></div>
+<span class="n">println</span><span class="o">(</span><span class="n">sssp</span><span class="o">.</span><span class="n">vertices</span><span class="o">.</span><span class="n">collect</span><span class="o">.</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span></code></pre></div>
<h1 id="graph-builders">Graph Builders</h1>
<p><a name="graph_builders"></a></p>
-<p>GraphX provides several ways of building a graph from a collection of vertices and edges in an RDD or on disk. None of the graph builders repartitions the graph&#8217;s edges by default; instead, edges are left in their default partitions (such as their original blocks in HDFS). <a href="api/scala/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED]"><code>Graph.groupEdges</code></a> requires the graph to be repartitioned because it assumes identical edges will be colocated on the same partition, so you must call <a href="api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]"><code>Graph.partitionBy</code></a> before calling <code>groupEdges</code>.</p>
+<p>GraphX provides several ways of building a graph from a collection of vertices and edges in an RDD or on disk. None of the graph builders repartitions the graph’s edges by default; instead, edges are left in their default partitions (such as their original blocks in HDFS). <a href="api/scala/index.html#org.apache.spark.graphx.Graph@groupEdges((ED,ED)⇒ED):Graph[VD,ED]"><code>Graph.groupEdges</code></a> requires the graph to be repartitioned because it assumes identical edges will be colocated on the same partition, so you must call <a href="api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]"><code>Graph.partitionBy</code></a> before calling <code>groupEdges</code>. However, note that <code>Graph.partitionBy</code> is broken in Spark 1.0.0 due to <a href="https://issues.apache.org/jira/browse/SPARK-1931">SPARK-1931</a>; see the <a href="#partitionBy_workaround">suggested workarounds</a> above.</p>
<div class="highlight"><pre><code class="scala"><span class="k">object</span> <span class="nc">GraphLoader</span> <span class="o">{</span>
<span class="k">def</span> <span class="n">edgeListFile</span><span class="o">(</span>
@@ -909,8 +906,7 @@ shortest path in the following example.</p>
<span class="n">canonicalOrientation</span><span class="k">:</span> <span class="kt">Boolean</span> <span class="o">=</span> <span class="kc">false</span><span class="o">,</span>
<span class="n">minEdgePartitions</span><span class="k">:</span> <span class="kt">Int</span> <span class="o">=</span> <span class="mi">1</span><span class="o">)</span>
<span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">Int</span>, <span class="kt">Int</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p><a href="api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]"><code>GraphLoader.edgeListFile</code></a> provides a way to load a graph from a list of edges on disk. It parses an adjacency list of (source vertex ID, destination vertex ID) pairs of the following form, skipping comment lines that begin with <code>#</code>:</p>
@@ -938,8 +934,7 @@ shortest path in the following example.</p>
<span class="n">defaultValue</span><span class="k">:</span> <span class="kt">VD</span><span class="o">,</span>
<span class="n">uniqueEdges</span><span class="k">:</span> <span class="kt">Option</span><span class="o">[</span><span class="kt">PartitionStrategy</span><span class="o">]</span> <span class="k">=</span> <span class="nc">None</span><span class="o">)</span><span class="k">:</span> <span class="kt">Graph</span><span class="o">[</span><span class="kt">VD</span>, <span class="kt">Int</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p><a href="api/scala/index.html#org.apache.spark.graphx.Graph$@apply[VD,ED](RDD[(VertexId,VD)],RDD[Edge[ED]],VD)(ClassTag[VD],ClassTag[ED]):Graph[VD,ED]"><code>Graph.apply</code></a> allows creating a graph from RDDs of vertices and edges. Duplicate vertices are picked arbitrarily and vertices found in the edge RDD but not the vertex RDD are assigned the default attribute.</p>
@@ -978,8 +973,7 @@ additional functionality:</p>
<span class="k">def</span> <span class="n">innerJoin</span><span class="o">[</span><span class="kt">U</span>, <span class="kt">VD2</span><span class="o">](</span><span class="n">other</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[(</span><span class="kt">VertexId</span>, <span class="kt">U</span><span class="o">)])(</span><span class="n">f</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">VD</span><span class="o">,</span> <span class="n">U</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">VD2</span><span class="o">)</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">VD2</span><span class="o">]</span>
<span class="c1">// Use the index on this RDD to accelerate a `reduceByKey` operation on the input RDD.</span>
<span class="k">def</span> <span class="n">aggregateUsingIndex</span><span class="o">[</span><span class="kt">VD2</span><span class="o">](</span><span class="n">other</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[(</span><span class="kt">VertexId</span>, <span class="kt">VD2</span><span class="o">)],</span> <span class="n">reduceFunc</span><span class="k">:</span> <span class="o">(</span><span class="kt">VD2</span><span class="o">,</span> <span class="kt">VD2</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">VD2</span><span class="o">)</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">VD2</span><span class="o">]</span>
-<span class="o">}</span>
-</code></pre></div>
+<span class="o">}</span></code></pre></div>
<p>Notice, for example, how the <code>filter</code> operator returns an <code>VertexRDD</code>. Filter is actually
implemented using a <code>BitSet</code> thereby reusing the index and preserving the ability to do fast joins
@@ -1001,8 +995,7 @@ both aggregate and then subsequently index the <code>RDD[(VertexID, A)]</code>.
<span class="c1">// There should be 100 entries in setB</span>
<span class="n">setB</span><span class="o">.</span><span class="n">count</span>
<span class="c1">// Joining A and B should now be fast!</span>
-<span class="k">val</span> <span class="n">setC</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="k">=</span> <span class="n">setA</span><span class="o">.</span><span class="n">innerJoin</span><span class="o">(</span><span class="n">setB</span><span class="o">)((</span><span class="n">id</span><span class="o">,</span> <span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="o">)</span>
-</code></pre></div>
+<span class="k">val</span> <span class="n">setC</span><span class="k">:</span> <span class="kt">VertexRDD</span><span class="o">[</span><span class="kt">Double</span><span class="o">]</span> <span class="k">=</span> <span class="n">setA</span><span class="o">.</span><span class="n">innerJoin</span><span class="o">(</span><span class="n">setB</span><span class="o">)((</span><span class="n">id</span><span class="o">,</span> <span class="n">a</span><span class="o">,</span> <span class="n">b</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span><span class="o">)</span></code></pre></div>
<h2 id="edgerdds">EdgeRDDs</h2>
@@ -1018,8 +1011,7 @@ reuse when changing attribute values.</p>
<span class="c1">// Revere the edges reusing both attributes and structure</span>
<span class="k">def</span> <span class="n">reverse</span><span class="k">:</span> <span class="kt">EdgeRDD</span><span class="o">[</span><span class="kt">ED</span>, <span class="kt">VD</span><span class="o">]</span>
<span class="c1">// Join two `EdgeRDD`s partitioned using the same partitioning strategy.</span>
-<span class="k">def</span> <span class="n">innerJoin</span><span class="o">[</span><span class="kt">ED2</span>, <span class="kt">ED3</span><span class="o">](</span><span class="n">other</span><span class="k">:</span> <span class="kt">EdgeRDD</span><span class="o">[</span><span class="kt">ED2</span>, <span class="kt">VD</span><span class="o">])(</span><span class="n">f</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">VertexId</span><span class="o">,</span> <span class="nc">ED</span><span class="o">,</span> <span class="nc">ED2</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">ED3</span><span class="o">)</span><span class="k">:</span> <span class="kt">EdgeRDD</span><span class="o">[</span><span class="kt">ED3</span>, <span class="kt">VD</span><span class="o">]</span>
-</code></pre></div>
+<span class="k">def</span> <span class="n">innerJoin</span><span class="o">[</span><span class="kt">ED2</span>, <span class="kt">ED3</span><span class="o">](</span><span class="n">other</span><span class="k">:</span> <span class="kt">EdgeRDD</span><span class="o">[</span><span class="kt">ED2</span>, <span class="kt">VD</span><span class="o">])(</span><span class="n">f</span><span class="k">:</span> <span class="o">(</span><span class="kt">VertexId</span><span class="o">,</span> <span class="kt">VertexId</span><span class="o">,</span> <span class="nc">ED</span><span class="o">,</span> <span class="nc">ED2</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="nc">ED3</span><span class="o">)</span><span class="k">:</span> <span class="kt">EdgeRDD</span><span class="o">[</span><span class="kt">ED3</span>, <span class="kt">VD</span><span class="o">]</span></code></pre></div>
<p>In most applications we have found that operations on the <code>EdgeRDD</code> are accomplished through the
graph operators or rely on operations defined in the base <code>RDD</code> class.</p>
@@ -1040,10 +1032,12 @@ distributed graph partitioning:</p>
reduce both the communication and storage overhead. Logically, this corresponds to assigning edges
to machines and allowing vertices to span multiple machines. The exact method of assigning edges
depends on the <a href="api/scala/index.html#org.apache.spark.graphx.PartitionStrategy"><code>PartitionStrategy</code></a> and there are several tradeoffs to the
-various heuristics. Users can choose between different strategies by repartitioning the graph with
-the <a href="api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]"><code>Graph.partitionBy</code></a> operator. The default partitioning strategy is to use
-the initial partitioning of the edges as provided on graph construction. However, users can easily
-switch to 2D-partitioning or other heuristics included in GraphX.</p>
+various heuristics. The default partitioning strategy is to use the initial partitioning of the
+edges as provided on graph construction. However, users can easily switch to 2D-partitioning or
+other heuristics included in GraphX.</p>
+
+<p>Users can choose between different strategies by repartitioning the graph with
+the <a href="api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]"><code>Graph.partitionBy</code></a> operator. However, note that <code>Graph.partitionBy</code> is broken in Spark 1.0.0 due to <a href="https://issues.apache.org/jira/browse/SPARK-1931">SPARK-1931</a>; see the <a href="#partitionBy_workaround">suggested workarounds</a> above.</p>
<p style="text-align: center;">
<img src="img/vertex_routing_edge_tables.png" title="RDD Graph Representation" alt="RDD Graph Representation" width="50%" />
@@ -1065,7 +1059,7 @@ to broadcast vertices when implementing the join required for operations like <c
<h2 id="pagerank">PageRank</h2>
<p><a name="pagerank"></a></p>
-<p>PageRank measures the importance of each vertex in a graph, assuming an edge from <em>u</em> to <em>v</em> represents an endorsement of <em>v</em>&#8217;s importance by <em>u</em>. For example, if a Twitter user is followed by many others, the user will be ranked highly.</p>
+<p>PageRank measures the importance of each vertex in a graph, assuming an edge from <em>u</em> to <em>v</em> represents an endorsement of <em>v</em>’s importance by <em>u</em>. For example, if a Twitter user is followed by many others, the user will be ranked highly.</p>
<p>GraphX comes with static and dynamic implementations of PageRank as methods on the <a href="api/scala/index.html#org.apache.spark.graphx.lib.PageRank$"><code>PageRank</code> object</a>. Static PageRank runs for a fixed number of iterations, while dynamic PageRank runs until the ranks converge (i.e., stop changing by more than a specified tolerance). <a href="api/scala/index.html#org.apache.spark.graphx.GraphOps"><code>GraphOps</code></a> allows calling these algorithms directly as methods on <code>Graph</code>.</p>
@@ -1084,8 +1078,7 @@ to broadcast vertices when implementing the join required for operations like <c
<span class="k">case</span> <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="o">(</span><span class="n">username</span><span class="o">,</span> <span class="n">rank</span><span class="o">))</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="n">username</span><span class="o">,</span> <span class="n">rank</span><span class="o">)</span>
<span class="o">}</span>
<span class="c1">// Print the result</span>
-<span class="n">println</span><span class="o">(</span><span class="n">ranksByUsername</span><span class="o">.</span><span class="n">collect</span><span class="o">().</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span>
-</code></pre></div>
+<span class="n">println</span><span class="o">(</span><span class="n">ranksByUsername</span><span class="o">.</span><span class="n">collect</span><span class="o">().</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span></code></pre></div>
<h2 id="connected-components">Connected Components</h2>
@@ -1104,15 +1097,26 @@ to broadcast vertices when implementing the join required for operations like <c
<span class="k">case</span> <span class="o">(</span><span class="n">id</span><span class="o">,</span> <span class="o">(</span><span class="n">username</span><span class="o">,</span> <span class="n">cc</span><span class="o">))</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="n">username</span><span class="o">,</span> <span class="n">cc</span><span class="o">)</span>
<span class="o">}</span>
<span class="c1">// Print the result</span>
-<span class="n">println</span><span class="o">(</span><span class="n">ccByUsername</span><span class="o">.</span><span class="n">collect</span><span class="o">().</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span>
-</code></pre></div>
+<span class="n">println</span><span class="o">(</span><span class="n">ccByUsername</span><span class="o">.</span><span class="n">collect</span><span class="o">().</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span></code></pre></div>
<h2 id="triangle-counting">Triangle Counting</h2>
-<p>A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the <a href="api/scala/index.html#org.apache.spark.graphx.lib.TriangleCount$"><code>TriangleCount</code> object</a> that determines the number of triangles passing through each vertex, providing a measure of clustering. We compute the triangle count of the social network dataset from the <a href="#pagerank">PageRank section</a>. <em>Note that <code>TriangleCount</code> requires the edges to be in canonical orientation (<code>srcId &lt; dstId</code>) and the graph to be partitioned using <a href="api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]"><code>Graph.partitionBy</code></a>.</em></p>
+<p>A vertex is part of a triangle when it has two adjacent vertices with an edge between them. GraphX implements a triangle counting algorithm in the <a href="api/scala/index.html#org.apache.spark.graphx.lib.TriangleCount$"><code>TriangleCount</code> object</a> that determines the number of triangles passing through each vertex, providing a measure of clustering. We compute the triangle count of the social network dataset from the <a href="#pagerank">PageRank section</a>. <em>Note that <code>TriangleCount</code> requires the edges to be in canonical orientation (<code>srcId &lt; dstId</code>) and the graph to be partitioned using <a href="api/scala/index.html#org.apache.spark.graphx.Graph@partitionBy(PartitionStrategy):Graph[VD,ED]"><code>Graph.partitionBy</code></a>.</em> Also note that <code>Graph.partitionBy</code> is broken in Spark 1.0.0 due to <a href="https://issues.apache.org/jira/browse/SPARK-1931">SPARK-1931</a>; see the <a href="#partitionBy_workaround">suggested workarounds</a> above.</p>
+
+<div class="highlight"><pre><code class="scala"><span class="c1">// Define our own version of partitionBy to work around SPARK-1931</span>
+<span class="k">import</span> <span class="nn">org.apache.spark.HashPartitioner</span>
+<span class="k">def</span> <span class="n">partitionBy</span><span class="o">[</span><span class="kt">ED</span><span class="o">](</span><span class="n">edges</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Edge</span><span class="o">[</span><span class="kt">ED</span><span class="o">]],</span> <span class="n">partitionStrategy</span><span class="k">:</span> <span class="kt">PartitionStrategy</span><span class="o">)</span><span class="k">:</span> <span class="kt">RDD</span><span class="o">[</span><span class="kt">Edge</span><span class="o">[</span><span class="kt">ED</span><span class="o">]]</span> <span class="k">=</span> <span class="o">{</span>
+ <span class="k">val</span> <span class="n">numPartitions</span> <span class="k">=</span> <span class="n">edges</span><span class="o">.</span><span class="n">partitions</span><span class="o">.</span><span class="n">size</span>
+ <span class="n">edges</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="n">e</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="n">partitionStrategy</span><span class="o">.</span><span class="n">getPartition</span><span class="o">(</span><span class="n">e</span><span class="o">.</span><span class="n">srcId</span><span class="o">,</span> <span class="n">e</span><span class="o">.</span><span class="n">dstId</span><span class="o">,</span> <span class="n">numPartitions</span><span class="o">),</span> <span class="n">e</span><span class="o">))</span>
+ <span class="o">.</span><span class="n">partitionBy</span><span class="o">(</span><span class="k">new</span> <span class="nc">HashPartitioner</span><span class="o">(</span><span class="n">numPartitions</span><span class="o">))</span>
+ <span class="o">.</span><span class="n">mapPartitions</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">map</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">_2</span><span class="o">),</span> <span class="n">preservesPartitioning</span> <span class="k">=</span> <span class="kc">true</span><span class="o">)</span>
+<span class="o">}</span>
-<div class="highlight"><pre><code class="scala"><span class="c1">// Load the edges in canonical order and partition the graph for triangle count</span>
-<span class="k">val</span> <span class="n">graph</span> <span class="k">=</span> <span class="nc">GraphLoader</span><span class="o">.</span><span class="n">edgeListFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">&quot;graphx/data/followers.txt&quot;</span><span class="o">,</span> <span class="kc">true</span><span class="o">).</span><span class="n">partitionBy</span><span class="o">(</span><span class="nc">PartitionStrategy</span><span class="o">.</span><span class="nc">RandomVertexCut</span><span class="o">)</span>
+<span class="c1">// Load the edges in canonical order and partition the graph for triangle count</span>
+<span class="k">val</span> <span class="n">unpartitionedGraph</span> <span class="k">=</span> <span class="nc">GraphLoader</span><span class="o">.</span><span class="n">edgeListFile</span><span class="o">(</span><span class="n">sc</span><span class="o">,</span> <span class="s">&quot;graphx/data/followers.txt&quot;</span><span class="o">,</span> <span class="kc">true</span><span class="o">)</span>
+<span class="k">val</span> <span class="n">graph</span> <span class="k">=</span> <span class="nc">Graph</span><span class="o">(</span>
+ <span class="n">partitionBy</span><span class="o">(</span><span class="n">unpartitionedGraph</span><span class="o">.</span><span class="n">edges</span><span class="o">,</span> <span class="nc">PartitionStrategy</span><span class="o">.</span><span class="nc">RandomVertexCut</span><span class="o">),</span>
+ <span class="n">unpartitionedGraph</span><span class="o">.</span><span class="n">vertices</span><span class="o">)</span>
<span class="c1">// Find the triangle count for each vertex</span>
<span class="k">val</span> <span class="n">triCounts</span> <span class="k">=</span> <span class="n">graph</span><span class="o">.</span><span class="n">triangleCount</span><span class="o">().</span><span class="n">vertices</span>
<span class="c1">// Join the triangle counts with the usernames</span>
@@ -1124,8 +1128,7 @@ to broadcast vertices when implementing the join required for operations like <c
<span class="o">(</span><span class="n">username</span><span class="o">,</span> <span class="n">tc</span><span class="o">)</span>
<span class="o">}</span>
<span class="c1">// Print the result</span>
-<span class="n">println</span><span class="o">(</span><span class="n">triCountByUsername</span><span class="o">.</span><span class="n">collect</span><span class="o">().</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span>
-</code></pre></div>
+<span class="n">println</span><span class="o">(</span><span class="n">triCountByUsername</span><span class="o">.</span><span class="n">collect</span><span class="o">().</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span></code></pre></div>
<h1 id="examples">Examples</h1>
@@ -1163,8 +1166,7 @@ all of this in just a few lines with GraphX:</p>
<span class="k">case</span> <span class="o">(</span><span class="n">uid</span><span class="o">,</span> <span class="n">attrList</span><span class="o">,</span> <span class="nc">None</span><span class="o">)</span> <span class="k">=&gt;</span> <span class="o">(</span><span class="mf">0.0</span><span class="o">,</span> <span class="n">attrList</span><span class="o">.</span><span class="n">toList</span><span class="o">)</span>
<span class="o">}</span>
-<span class="n">println</span><span class="o">(</span><span class="n">userInfoWithPageRank</span><span class="o">.</span><span class="n">vertices</span><span class="o">.</span><span class="n">top</span><span class="o">(</span><span class="mi">5</span><span class="o">)(</span><span class="nc">Ordering</span><span class="o">.</span><span class="n">by</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">_2</span><span class="o">.</span><span class="n">_1</span><span class="o">)).</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span>
-</code></pre></div>
+<span class="n">println</span><span class="o">(</span><span class="n">userInfoWithPageRank</span><span class="o">.</span><span class="n">vertices</span><span class="o">.</span><span class="n">top</span><span class="o">(</span><span class="mi">5</span><span class="o">)(</span><span class="nc">Ordering</span><span class="o">.</span><span class="n">by</span><span class="o">(</span><span class="k">_</span><span class="o">.</span><span class="n">_2</span><span class="o">.</span><span class="n">_1</span><span class="o">)).</span><span class="n">mkString</span><span class="o">(</span><span class="s">&quot;\n&quot;</span><span class="o">))</span></code></pre></div>