spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	Revert "Preparing development version 1.2.2-SNAPSHOT"	Patrick Wendell	2015-01-26	1	-1/+1
\| \| \| \|	This reverts commit adfed7086f10fa8db4eeac7996c84cf98f625e9a.
*	Preparing development version 1.2.2-SNAPSHOT	Ubuntu	2015-01-27	1	-1/+1
\|
*	Preparing Spark release v1.2.1-rc1	Ubuntu	2015-01-27	1	-1/+1
\|
*	[SPARK-5351][GraphX] Do not use Partitioner.defaultPartitioner as a ↵	Takeshi Yamamuro	2015-01-23	2	-2/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	partitioner of EdgeRDDImp... If the value of 'spark.default.parallelism' does not match the number of partitoins in EdgePartition(EdgeRDDImpl), the following error occurs in ReplicatedVertexView.scala:72; object GraphTest extends Logging { def run[VD: ClassTag, ED: ClassTag](graph: Graph[VD, ED]): VertexRDD[Int] = { graph.aggregateMessages( ctx => { ctx.sendToSrc(1) ctx.sendToDst(2) }, _ + _) } } val g = GraphLoader.edgeListFile(sc, "graph.txt") val rdd = GraphTest.run(g) java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions at org.apache.spark.rdd.ZippedPartitionsBaseRDD.getPartitions(ZippedPartitionsRDD.scala:57) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.rdd.MapPartitionsRDD.getPartitions(MapPartitionsRDD.scala:32) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:206) at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:204) at scala.Option.getOrElse(Option.scala:120) at org.apache.spark.rdd.RDD.partitions(RDD.scala:204) at org.apache.spark.ShuffleDependency.<init>(Dependency.scala:82) at org.apache.spark.rdd.ShuffledRDD.getDependencies(ShuffledRDD.scala:80) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:193) at org.apache.spark.rdd.RDD$$anonfun$dependencies$2.apply(RDD.scala:191) ... Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #4136 from maropu/EdgePartitionBugFix and squashes the following commits: 0cd8942 [Ankur Dave] Use more concise getOrElse aad4a2c [Ankur Dave] Add unit test for non-default number of edge partitions 0a2f32b [Takeshi Yamamuro] Do not use Partitioner.defaultPartitioner as a partitioner of EdgeRDDImpl (cherry picked from commit e224dbb011789297cd6c6ba095f702c042869ed6) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	[SPARK-5064][GraphX] Add numEdges upperbound validation for R-MAT graph ↵	Kenji Kikushima	2015-01-21	2	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	generator to prevent infinite loop I looked into GraphGenerators#chooseCell, and found that chooseCell can't generate more edges than pow(2, (2 * (log2(numVertices)-1))) to make a Power-law graph. (Ex. numVertices:4 upperbound:4, numVertices:8 upperbound:16, numVertices:16 upperbound:64) If we request more edges over the upperbound, rmatGraph fall into infinite loop. So, how about adding an argument validation? Author: Kenji Kikushima <kikushima.kenji@lab.ntt.co.jp> Closes #3950 from kj-ki/SPARK-5064 and squashes the following commits: 4ee18c7 [Ankur Dave] Reword error message and add unit test d760bc7 [Kenji Kikushima] Add numEdges upperbound validation for R-MAT graph generator to prevent infinite loop. (cherry picked from commit 3ee3ab592eee831d759c940eb68231817ad6d083) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	Preparing development version 1.2.1-SNAPSHOT	Patrick Wendell	2014-12-10	1	-1/+1
\|
*	Preparing Spark release v1.2.0-rc2v1.2.0	Patrick Wendell	2014-12-10	1	-1/+1
\|
*	Revert "Preparing Spark release v1.2.0-rc2"	Patrick Wendell	2014-12-10	1	-1/+1
\| \| \| \|	This reverts commit 2b72c569a674cccf79ebbe8d067b8dbaaf78007f.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-12-10	1	-1/+1
\| \| \| \|	This reverts commit bc05df8a23ba7ad485f6844f28f96551b13ba461.
*	[SPARK-4620] Add unpersist in Graph and GraphImpl	Takeshi Yamamuro	2014-12-07	2	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add an IF to uncache both vertices and edges of Graph/GraphImpl. This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> This patch had conflicts when merged, resolved by Committer: Ankur Dave <ankurdave@gmail.com> Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits: 77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl (cherry picked from commit 8817fc7fe8785d7b11138ca744f22f7e70f1f0a0) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	[SPARK-4646] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark	Takeshi Yamamuro	2014-12-07	2	-5/+64
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch just replaces a native quick sorter with Sorter(TimSort) in Spark. It could get performance gains by ~8% in my quick experiments. Author: Takeshi Yamamuro <linguin.m.s@gmail.com> Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits: 8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import 3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark (cherry picked from commit 2e6b736b0e6e5920d0523533c87832a53211db42) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	[SPARK-3623][GraphX] GraphX should support the checkpoint operation	GuoQiang Li	2014-12-06	3	-0/+34
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: GuoQiang Li <witgo@qq.com> Closes #2631 from witgo/SPARK-3623 and squashes the following commits: a70c500 [GuoQiang Li] Remove java related 4d1e249 [GuoQiang Li] Add comments e682724 [GuoQiang Li] Graph should support the checkpoint operation (cherry picked from commit e895e0cbecbbec1b412ff21321e57826d2d0a982) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	Preparing development version 1.2.1-SNAPSHOT	Patrick Wendell	2014-12-04	1	-1/+1
\|
*	Preparing Spark release v1.2.0-rc2	Patrick Wendell	2014-12-04	1	-1/+1
\|
*	Revert "Preparing Spark release v1.2.0-rc1"	Patrick Wendell	2014-12-04	1	-1/+1
\| \| \| \|	This reverts commit 1056e9ec13203d0c51564265e94d77a054498fdb.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-12-04	1	-1/+1
\| \| \| \|	This reverts commit 00316cc87983b844f6603f351a8f0b84fe1f6035.
*	[SPARK-4672][GraphX]Non-transient PartitionsRDDs will lead to StackOverflow ↵	JerryLead	2014-12-02	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	error The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3544 from JerryLead/my_graphX and squashes the following commits: 628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit 17c162f6682520e6e2790626e37da3a074471793) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	[SPARK-4672][GraphX]Perform checkpoint() on PartitionsRDD to shorten the lineage	JerryLead	2014-12-02	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672 Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA. Author: JerryLead <JerryLead@163.com> Author: Lijie Xu <csxulijie@gmail.com> Closes #3549 from JerryLead/my_graphX_checkpoint and squashes the following commits: d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark 52799e3 [Lijie Xu] Merge pull request #1 from apache/master (cherry picked from commit fc0a1475ef7c8b33363d88adfe8e8f28def5afc7) Signed-off-by: Ankur Dave <ankurdave@gmail.com>
*	Preparing development version 1.2.1-SNAPSHOT	Patrick Wendell	2014-11-28	1	-1/+1
\|
*	Preparing Spark release v1.2.0-rc1	Patrick Wendell	2014-11-28	1	-1/+1
\|
*	Revert "Preparing Spark release v1.2.0-rc1"	Patrick Wendell	2014-11-28	1	-1/+1
\| \| \| \|	This reverts commit 39c7d1c1f9a7785285cf4c20dfbffd96f72d5634.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-11-28	1	-1/+1
\| \| \| \|	This reverts commit fc7bff00ac731d2632213a98cd92dc5e84ce7dcd.
*	Preparing development version 1.2.1-SNAPSHOT	Patrick Wendell	2014-11-28	1	-1/+1
\|
*	Preparing Spark release v1.2.0-rc1	Patrick Wendell	2014-11-28	1	-1/+1
\|
*	Removing confusing TripletFields	Joseph E. Gonzalez	2014-11-26	4	-33/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After additional discussion with rxin, I think having all the possible `TripletField` options is confusing. This pull request reduces the triplet fields to: ```java /** * None of the triplet fields are exposed. / public static final TripletFields None = new TripletFields(false, false, false); /* * Expose only the edge field and not the source or destination field. / public static final TripletFields EdgeOnly = new TripletFields(false, false, true); /* * Expose the source and edge fields but not the destination field. (Same as Src) / public static final TripletFields Src = new TripletFields(true, false, true); /* * Expose the destination and edge fields but not the source field. (Same as Dst) / public static final TripletFields Dst = new TripletFields(false, true, true); /* * Expose all the fields (source, edge, and destination). */ public static final TripletFields All = new TripletFields(true, true, true); ``` Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #3472 from jegonzal/SimplifyTripletFields and squashes the following commits: 91796b5 [Joseph E. Gonzalez] removing confusing triplet fields (cherry picked from commit 288ce583b05004a8c71dcd836fab23caff5d4ba7) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	Revert "Preparing Spark release v1.2.0-rc1"	Patrick Wendell	2014-11-26	1	-1/+1
\| \| \| \|	This reverts commit cc2c05e4ee81d2f34873a2ebb9a5272867cb65c2.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-11-26	1	-1/+1
\| \| \| \|	This reverts commit 380eba5f49eca1dbd4084e6c84e19866fffd4efa.
*	Preparing development version 1.2.1-SNAPSHOT	Patrick Wendell	2014-11-26	1	-1/+1
\|
*	Preparing Spark release v1.2.0-rc1	Patrick Wendell	2014-11-26	1	-1/+1
\|
*	Revert "Preparing Spark release v1.2.0-rc1"	Patrick Wendell	2014-11-26	1	-1/+1
\| \| \| \|	This reverts commit 5247dd859b95a440baa562b9827bdeb26aa6530e.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-11-26	1	-1/+1
\| \| \| \|	This reverts commit 79df6b43ae762263a8120f423ddb4a0811dd4b6f.
*	Preparing development version 1.2.1-SNAPSHOT	Patrick Wendell	2014-11-26	1	-1/+1
\|
*	Preparing Spark release v1.2.0-rc1	Patrick Wendell	2014-11-26	1	-1/+1
\|
*	Revert "Preparing Spark release v1.2.0-rc1"	Patrick Wendell	2014-11-26	1	-1/+1
\| \| \| \|	This reverts commit db7f4a898af22a02b36428507f8ef2b429d78dc1.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-11-26	1	-1/+1
\| \| \| \|	This reverts commit d7b1ecb25676d228deb6fe05efdb4e2ab9c3e30b.
*	Preparing development version 1.2.1-SNAPSHOT	Ubuntu	2014-11-26	1	-1/+1
\|
*	Preparing Spark release v1.2.0-rc1	Ubuntu	2014-11-26	1	-1/+1
\|
*	Revert "Preparing Spark release v1.2.0-snapshot1"	Patrick Wendell	2014-11-26	1	-1/+1
\| \| \| \|	This reverts commit 38c1fbd9694430cefd962c90bc36b0d108c6124b.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-11-26	1	-1/+1
\| \| \| \|	This reverts commit d7ac6013483e83caff8ea54c228f37aeca159db8.
*	Updating GraphX programming guide and documentation	Joseph E. Gonzalez	2014-11-19	1	-0/+46
\| \| \| \| \| \| \| \| \| \| \| \| \|	This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits: 4421964 [Joseph E. Gonzalez] updating documentation for graphx (cherry picked from commit 377b06820934cab6d67f3a9182528c7f417a7d98) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-4444] Drop VD type parameter from EdgeRDD	Ankur Dave	2014-11-17	7	-50/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter. This requires removing the `filter` method from the EdgeRDD interface, because it depends on vertex attribute caching. Author: Ankur Dave <ankurdave@gmail.com> Closes #3303 from ankurdave/edgerdd-drop-tparam and squashes the following commits: 38dca9b [Ankur Dave] Leave EdgeRDD.fromEdges public fafeb51 [Ankur Dave] Drop VD type parameter from EdgeRDD (cherry picked from commit 9ac2bb18ede2e9f73c255fa33445af89aaf8a000) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	Preparing development version 1.2.1-SNAPSHOT	Ubuntu	2014-11-17	1	-1/+1
\|
*	Preparing Spark release v1.2.0-snapshot1	Ubuntu	2014-11-17	1	-1/+1
\|
*	Revert "Preparing Spark release v1.2.0-snapshot0"	Patrick Wendell	2014-11-16	1	-1/+1
\| \| \| \|	This reverts commit bc09875799aa373f4320d38b02618173ffa4c96f.
*	Revert "Preparing development version 1.2.1-SNAPSHOT"	Patrick Wendell	2014-11-16	1	-2/+2
\| \| \| \|	This reverts commit 6c6fd218c83a049c874b8a0ea737333c1899c94a.
*	Preparing development version 1.2.1-SNAPSHOT	Ubuntu	2014-11-17	1	-2/+2
\|
*	Preparing Spark release v1.2.0-snapshot0	Ubuntu	2014-11-17	1	-1/+1
\|
*	[SPARK-3666] Extract interfaces for EdgeRDD and VertexRDD	Ankur Dave	2014-11-12	4	-244/+386
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This discourages users from calling the VertexRDD and EdgeRDD constructor and makes it easier for future changes to ensure backward compatibility. Author: Ankur Dave <ankurdave@gmail.com> Closes #2530 from ankurdave/SPARK-3666 and squashes the following commits: d681f45 [Ankur Dave] Define getPartitions and compute in abstract class for MIMA 1472390 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666 24201d4 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666 cbe15f2 [Ankur Dave] Remove specialized annotation from VertexRDD and EdgeRDD 931b587 [Ankur Dave] Use abstract class instead of trait for binary compatibility 9ba4ec4 [Ankur Dave] Mark (Vertex\|Edge)RDDImpl constructors package-private 620e603 [Ankur Dave] Extract VertexRDD interface and move implementation to VertexRDDImpl 55b6398 [Ankur Dave] Extract EdgeRDD interface and move implementation to EdgeRDDImpl (cherry picked from commit a5ef58113667ff73562ce6db381cff96a0b354b0) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	Internal cleanup for aggregateMessages	Ankur Dave	2014-11-12	4	-34/+69
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Add EdgeActiveness enum to represent activeness criteria more cleanly than using booleans. 2. Comments and whitespace. Author: Ankur Dave <ankurdave@gmail.com> Closes #3231 from ankurdave/aggregateMessages-followup and squashes the following commits: 3d485c3 [Ankur Dave] Internal cleanup for aggregateMessages (cherry picked from commit 0402be90f7af82c8404cafbca79f5f9fb8e2bbed) Signed-off-by: Reynold Xin <rxin@databricks.com>
*	[SPARK-3936] Add aggregateMessages, which supersedes mapReduceTriplets	Ankur Dave	2014-11-11	15	-376/+766
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements: 1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages. 2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936. Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition: 1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages. 2. Internal iterators in aggregateMessages are inlined into a while loop. In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s). Subsumes apache/spark#2815. Also fixes SPARK-4173. Author: Ankur Dave <ankurdave@gmail.com> Closes #3100 from ankurdave/aggregateMessages and squashes the following commits: f5b65d0 [Ankur Dave] Address @rxin comments on apache/spark#3054 and apache/spark#3100 1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets 194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder c85076d [Ankur Dave] Readability improvements b567be2 [Ankur Dave] iter.foreach -> while loop 4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition (cherry picked from commit faeb41de215d3ac567ce72a43ab242ad433ca93e) Signed-off-by: Reynold Xin <rxin@databricks.com>