| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
This reverts commit adfed7086f10fa8db4eeac7996c84cf98f625e9a.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
partitioner of EdgeRDDImp...
If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl),
the following error occurs in ReplicatedVertexView.scala:72;
object GraphTest extends Logging {
def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = {
graph.aggregateMessages(
ctx => {
ctx.sendToSrc(1)
ctx.sendToDst(2)
},
_ + _)
}
}
val g = GraphLoader.edgeListFile(sc, "graph.txt")
val rdd = GraphTest.run(g)
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206)
at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.partitions(RDD.scala:204)
at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82)
at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193)
at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191)
...
Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
Closes #4136 from maropu/EdgePartitionBugFix and squashes the following commits:
0cd8942 [Ankur Dave] Use more concise getOrElse
aad4a2c [Ankur Dave] Add unit test for non-default number of edge partitions
0a2f32b [Takeshi Yamamuro] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImpl
(cherry picked from commit e224dbb011789297cd6c6ba095f702c042869ed6)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
generator to prevent infinite loop
I looked into GraphGenerators#chooseCell, and found that chooseCell can't generate more edges than pow(2, (2 * (log2(numVertices)-1))) to make a Power-law graph. (Ex. numVertices:4 upperbound:4, numVertices:8 upperbound:16, numVertices:16 upperbound:64)
If we request more edges over the upperbound, rmatGraph fall into infinite loop. So, how about adding an argument validation?
Author: Kenji Kikushima <kikushima.kenji@lab.ntt.co.jp>
Closes #3950 from kj-ki/SPARK-5064 and squashes the following commits:
4ee18c7 [Ankur Dave] Reword error message and add unit test
d760bc7 [Kenji Kikushima] Add numEdges upperbound validation for R-MAT graph generator to prevent infinite loop.
(cherry picked from commit 3ee3ab592eee831d759c940eb68231817ad6d083)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
|
| |
|
| |
|
|
|
|
| |
This reverts commit 2b72c569a674cccf79ebbe8d067b8dbaaf78007f.
|
|
|
|
| |
This reverts commit bc05df8a23ba7ad485f6844f28f96551b13ba461.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add an IF to uncache both vertices and edges of Graph/GraphImpl.
This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations.
Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
This patch had conflicts when merged, resolved by
Committer: Ankur Dave <ankurdave@gmail.com>
Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits:
77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl
(cherry picked from commit 8817fc7fe8785d7b11138ca744f22f7e70f1f0a0)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch just replaces a native quick sorter with Sorter(TimSort) in Spark.
It could get performance gains by ~8% in my quick experiments.
Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits:
8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import
3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
(cherry picked from commit 2e6b736b0e6e5920d0523533c87832a53211db42)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: GuoQiang Li <witgo@qq.com>
Closes #2631 from witgo/SPARK-3623 and squashes the following commits:
a70c500 [GuoQiang Li] Remove java related
4d1e249 [GuoQiang Li] Add comments
e682724 [GuoQiang Li] Graph should support the checkpoint operation
(cherry picked from commit e895e0cbecbbec1b412ff21321e57826d2d0a982)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
|
| |
|
| |
|
|
|
|
| |
This reverts commit 1056e9ec13203d0c51564265e94d77a054498fdb.
|
|
|
|
| |
This reverts commit 00316cc87983b844f6603f351a8f0b84fe1f6035.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
error
The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672
In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA.
Author: JerryLead <JerryLead@163.com>
Author: Lijie Xu <csxulijie@gmail.com>
Closes #3544 from JerryLead/my_graphX and squashes the following commits:
628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
(cherry picked from commit 17c162f6682520e6e2790626e37da3a074471793)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672
Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA.
Author: JerryLead <JerryLead@163.com>
Author: Lijie Xu <csxulijie@gmail.com>
Closes #3549 from JerryLead/my_graphX_checkpoint and squashes the following commits:
d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves
ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
(cherry picked from commit fc0a1475ef7c8b33363d88adfe8e8f28def5afc7)
Signed-off-by: Ankur Dave <ankurdave@gmail.com>
|
| |
|
| |
|
|
|
|
| |
This reverts commit 39c7d1c1f9a7785285cf4c20dfbffd96f72d5634.
|
|
|
|
| |
This reverts commit fc7bff00ac731d2632213a98cd92dc5e84ce7dcd.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After additional discussion with rxin, I think having all the possible `TripletField` options is confusing. This pull request reduces the triplet fields to:
```java
/**
* None of the triplet fields are exposed.
*/
public static final TripletFields None = new TripletFields(false, false, false);
/**
* Expose only the edge field and not the source or destination field.
*/
public static final TripletFields EdgeOnly = new TripletFields(false, false, true);
/**
* Expose the source and edge fields but not the destination field. (Same as Src)
*/
public static final TripletFields Src = new TripletFields(true, false, true);
/**
* Expose the destination and edge fields but not the source field. (Same as Dst)
*/
public static final TripletFields Dst = new TripletFields(false, true, true);
/**
* Expose all the fields (source, edge, and destination).
*/
public static final TripletFields All = new TripletFields(true, true, true);
```
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes #3472 from jegonzal/SimplifyTripletFields and squashes the following commits:
91796b5 [Joseph E. Gonzalez] removing confusing triplet fields
(cherry picked from commit 288ce583b05004a8c71dcd836fab23caff5d4ba7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
| |
This reverts commit cc2c05e4ee81d2f34873a2ebb9a5272867cb65c2.
|
|
|
|
| |
This reverts commit 380eba5f49eca1dbd4084e6c84e19866fffd4efa.
|
| |
|
| |
|
|
|
|
| |
This reverts commit 5247dd859b95a440baa562b9827bdeb26aa6530e.
|
|
|
|
| |
This reverts commit 79df6b43ae762263a8120f423ddb4a0811dd4b6f.
|
| |
|
| |
|
|
|
|
| |
This reverts commit db7f4a898af22a02b36428507f8ef2b429d78dc1.
|
|
|
|
| |
This reverts commit d7b1ecb25676d228deb6fe05efdb4e2ab9c3e30b.
|
| |
|
| |
|
|
|
|
| |
This reverts commit 38c1fbd9694430cefd962c90bc36b0d108c6124b.
|
|
|
|
| |
This reverts commit d7ac6013483e83caff8ea54c228f37aeca159db8.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
4421964 [Joseph E. Gonzalez] updating documentation for graphx
(cherry picked from commit 377b06820934cab6d67f3a9182528c7f417a7d98)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter.
This requires removing the `filter` method from the EdgeRDD interface, because it depends on vertex attribute caching.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #3303 from ankurdave/edgerdd-drop-tparam and squashes the following commits:
38dca9b [Ankur Dave] Leave EdgeRDD.fromEdges public
fafeb51 [Ankur Dave] Drop VD type parameter from EdgeRDD
(cherry picked from commit 9ac2bb18ede2e9f73c255fa33445af89aaf8a000)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
| |
|
| |
|
|
|
|
| |
This reverts commit bc09875799aa373f4320d38b02618173ffa4c96f.
|
|
|
|
| |
This reverts commit 6c6fd218c83a049c874b8a0ea737333c1899c94a.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This discourages users from calling the VertexRDD and EdgeRDD constructor and makes it easier for future changes to ensure backward compatibility.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #2530 from ankurdave/SPARK-3666 and squashes the following commits:
d681f45 [Ankur Dave] Define getPartitions and compute in abstract class for MIMA
1472390 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666
24201d4 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666
cbe15f2 [Ankur Dave] Remove specialized annotation from VertexRDD and EdgeRDD
931b587 [Ankur Dave] Use abstract class instead of trait for binary compatibility
9ba4ec4 [Ankur Dave] Mark (Vertex|Edge)RDDImpl constructors package-private
620e603 [Ankur Dave] Extract VertexRDD interface and move implementation to VertexRDDImpl
55b6398 [Ankur Dave] Extract EdgeRDD interface and move implementation to EdgeRDDImpl
(cherry picked from commit a5ef58113667ff73562ce6db381cff96a0b354b0)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Add EdgeActiveness enum to represent activeness criteria more cleanly than using booleans.
2. Comments and whitespace.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #3231 from ankurdave/aggregateMessages-followup and squashes the following commits:
3d485c3 [Ankur Dave] Internal cleanup for aggregateMessages
(cherry picked from commit 0402be90f7af82c8404cafbca79f5f9fb8e2bbed)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements:
1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages.
2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936.
Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition:
1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages.
2. Internal iterators in aggregateMessages are inlined into a while loop.
In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s).
Subsumes apache/spark#2815. Also fixes SPARK-4173.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #3100 from ankurdave/aggregateMessages and squashes the following commits:
f5b65d0 [Ankur Dave] Address @rxin comments on apache/spark#3054 and apache/spark#3100
1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets
194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test
e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder
c85076d [Ankur Dave] Readability improvements
b567be2 [Ankur Dave] iter.foreach -> while loop
4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition
(cherry picked from commit faeb41de215d3ac567ce72a43ab242ad433ca93e)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|