aboutsummaryrefslogtreecommitdiff
path: root/graphx
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-3397] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOTGuoQiang Li2014-09-061-1/+1
| | | | | | | | Author: GuoQiang Li <witgo@qq.com> Closes #2268 from witgo/SPARK-3397 and squashes the following commits: eaf913f [GuoQiang Li] Bump pom.xml version number of master branch to 1.2.0-SNAPSHOT
* [HOTFIX] [SPARK-3400] Revert 9b225ac "fix GraphX EdgeRDD zipPartitions"Ankur Dave2014-09-032-18/+2
| | | | | | | | | | | 9b225ac3072de522b40b46aba6df1f1c231f13ef has been causing GraphX tests to fail nondeterministically, which is blocking development for others. Author: Ankur Dave <ankurdave@gmail.com> Closes #2271 from ankurdave/SPARK-3400 and squashes the following commits: 10c2a97 [Ankur Dave] [HOTFIX] [SPARK-3400] Revert 9b225ac "fix GraphX EdgeRDD zipPartitions"
* [SPARK-3263][GraphX] Fix changes made to GraphGenerator.logNormalGraph in PR ↵RJ Nowling2014-09-032-29/+146
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | #720 PR #720 made multiple changes to GraphGenerator.logNormalGraph including: * Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel paper, I believe the in-line functions are incorrect. * Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated. * Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter. * Failed to update scala docs and programming guide for API changes * Added a Synthetic Benchmark in the examples. This PR: * Removes the in-line calls and calls original vertex / edge generation functions again * Adds an optional seed parameter for deterministic behavior (when desired) * Keeps the number of partitions parameter that was added. * Keeps compatibility with the synthetic benchmark example * Maintains backwards-compatible API Author: RJ Nowling <rnowling@gmail.com> Author: Ankur Dave <ankurdave@gmail.com> Closes #2168 from rnowling/graphgenrand and squashes the following commits: f1cd79f [Ankur Dave] Style fixes e11918e [RJ Nowling] Fix bad comparisons in unit tests 785ac70 [RJ Nowling] Fix style error c70868d [RJ Nowling] Fix logNormalGraph scala doc for seed 41fd1f8 [RJ Nowling] Fix logNormalGraph scala doc for seed 799f002 [RJ Nowling] Added test for different seeds for sampleLogNormal 43949ad [RJ Nowling] Added test for different seeds for generateRandomEdges 2faf75f [RJ Nowling] Added unit test for logNormalGraph 82f22397 [RJ Nowling] Add unit test for sampleLogNormal b99cba9 [RJ Nowling] Make sampleLogNormal private to Spark (vs private) for unit testing 6803da1 [RJ Nowling] Add GraphGeneratorsSuite with test for generateRandomEdges 1c8fc44 [RJ Nowling] Connected components part of SynthBenchmark was failing to call count on RDD before printing dfbb6dd [RJ Nowling] Fix parameter name in SynthBenchmark docs b5eeb80 [RJ Nowling] Add optional seed parameter to SynthBenchmark and set default to randomly generate a seed 1ff8d30 [RJ Nowling] Fix bug in generateRandomEdges where numVertices instead of numEdges was used to control number of edges to generate 98bb73c [RJ Nowling] Add documentation for logNormalGraph parameters d40141a [RJ Nowling] Fix style error 684804d [RJ Nowling] revert PR #720 which introduce errors in logNormalGraph and messed up seeding of RNGs. Add user-defined optional seed for deterministic behavior c183136 [RJ Nowling] Fix to deterministic GraphGenerators.logNormalGraph that allows generating graphs randomly using optional seed. 015010c [RJ Nowling] Fixed GraphGenerator logNormalGraph API to make backward-incompatible change in commit 894ecde04
* [SPARK-2823][GraphX]fix GraphX EdgeRDD zipPartitionsluluorta2014-09-022-2/+18
| | | | | | | | | | | If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw: java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions Author: luluorta <luluorta@gmail.com> Closes #1763 from luluorta/fix-graph-zip and squashes the following commits: 8338961 [luluorta] fix GraphX EdgeRDD zipPartitions
* [SPARK-2981][GraphX] EdgePartition1D Int overflowLarry Xiao2014-09-021-1/+1
| | | | | | | | | | | minor fix detail is here: https://issues.apache.org/jira/browse/SPARK-2981 Author: Larry Xiao <xiaodi@sjtu.edu.cn> Closes #1902 from larryxiao/2981 and squashes the following commits: 88059a2 [Larry Xiao] [SPARK-2981][GraphX] EdgePartition1D Int overflow
* [SPARK-3123][GraphX]: override the "setName" function to set EdgeRDD's name ↵uncleGen2014-09-021-1/+9
| | | | | | | | | | manually just as VertexRDD does. Author: uncleGen <hustyugm@gmail.com> Closes #2033 from uncleGen/master_origin and squashes the following commits: 801994b [uncleGen] Update EdgeRDD.scala
* [SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examplesLarry Xiao2014-09-021-161/+0
| | | | | | | | | | | to support ~/spark/bin/run-example GraphXAnalytics triangles /soc-LiveJournal1.txt --numEPart=256 Author: Larry Xiao <xiaodi@sjtu.edu.cn> Closes #1766 from larryxiao/1986 and squashes the following commits: bb77cd9 [Larry Xiao] [SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples
* [SPARK-3190] Avoid overflow in VertexRDD.count()Ankur Dave2014-08-281-1/+1
| | | | | | | | | | | | | | | | | | VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them. The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion). ```scala val pairs = sc.parallelize(0L until 500L).map(_ * 10000000) .flatMap(start => start until (start + 10000000)).map(x => (x, x)) VertexRDD(pairs).count() ``` Author: Ankur Dave <ankurdave@gmail.com> Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits: 641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()
* SPARK-2045 Sort-based shuffleMatei Zaharia2014-07-302-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts key-value pairs by partition ID and can be used to create a single sorted file with a map task's output. (Longer-term I think this can take on the remaining functionality in ExternalAppendOnlyMap and replace it so we don't have code duplication.) The main TODOs still left are: - [x] enabling ExternalSorter to merge across spilled files - [x] with an Ordering - [x] without an Ordering, using the keys' hash codes - [x] adding more tests (e.g. a version of our shuffle suite that runs on this) - [x] rebasing on top of the size-tracking refactoring in #1165 when that is merged - [x] disabling spilling if spark.shuffle.spill is set to false Despite this though, this seems to work pretty well (running successfully in cases where the hash shuffle would OOM, such as 1000 reduce tasks on executors with only 1G memory), and it seems to be comparable in speed or faster than hash-based shuffle (it will create much fewer files for the OS to keep track of). So I'm posting it to get some early feedback. After these TODOs are done, I'd also like to enable ExternalSorter to sort data within each partition by a key as well, which will allow us to use it to implement external spilling in reduce tasks in `sortByKey`. Author: Matei Zaharia <matei@databricks.com> Closes #1499 from mateiz/sort-based-shuffle and squashes the following commits: bd841f9 [Matei Zaharia] Various review comments d1c137fd [Matei Zaharia] Various review comments a611159 [Matei Zaharia] Compile fixes due to rebase 62c56c8 [Matei Zaharia] Fix ShuffledRDD sometimes not returning Tuple2s. f617432 [Matei Zaharia] Fix a failing test (seems to be due to change in SizeTracker logic) 9464d5f [Matei Zaharia] Simplify code and fix conflicts after latest rebase 0174149 [Matei Zaharia] Add cleanup behavior and cleanup tests for sort-based shuffle eb4ee0d [Matei Zaharia] Remove customizable element type in ShuffledRDD fa2e8db [Matei Zaharia] Allow nextBatchStream to be called after we're done looking at all streams a34b352 [Matei Zaharia] Fix tracking of indices within a partition in SpillReader, and add test 03e1006 [Matei Zaharia] Add a SortShuffleSuite that runs ShuffleSuite with sort-based shuffle 3c7ff1f [Matei Zaharia] Obey the spark.shuffle.spill setting in ExternalSorter ad65fbd [Matei Zaharia] Rebase on top of Aaron's Sorter change, and use Sorter in our buffer 44d2a93 [Matei Zaharia] Use estimateSize instead of atGrowThreshold to test collection sizes 5686f71 [Matei Zaharia] Optimize merging phase for in-memory only data: 5461cbb [Matei Zaharia] Review comments and more tests (e.g. tests with 1 element per partition) e9ad356 [Matei Zaharia] Update ContextCleanerSuite to make sure shuffle cleanup tests use hash shuffle (since they were written for it) c72362a [Matei Zaharia] Added bug fix and test for when iterators are empty de1fb40 [Matei Zaharia] Make trait SizeTrackingCollection private[spark] 4988d16 [Matei Zaharia] tweak c1b7572 [Matei Zaharia] Small optimization ba7db7f [Matei Zaharia] Handle null keys in hash-based comparator, and add tests for collisions ef4e397 [Matei Zaharia] Support for partial aggregation even without an Ordering 4b7a5ce [Matei Zaharia] More tests, and ability to sort data if a total ordering is given e1f84be [Matei Zaharia] Fix disk block manager test 5a40a1c [Matei Zaharia] More tests 614f1b4 [Matei Zaharia] Add spill metrics to map tasks cc52caf [Matei Zaharia] Add more error handling and tests for error cases bbf359d [Matei Zaharia] More work 3a56341 [Matei Zaharia] More partial work towards sort-based shuffle 7a0895d [Matei Zaharia] Some more partial work towards sort-based shuffle b615476 [Matei Zaharia] Scaffolding for sort-based shuffle
* SPARK-2748 [MLLIB] [GRAPHX] Loss of precision for small arguments to ↵Sean Owen2014-07-301-2/+4
| | | | | | | | | | | | | | | | | | Math.exp, Math.log In a few places in MLlib, an expression of the form `log(1.0 + p)` is evaluated. When p is so small that `1.0 + p == 1.0`, the result is 0.0. However the correct answer is very near `p`. This is why `Math.log1p` exists. Similarly for one instance of `exp(m) - 1` in GraphX; there's a special `Math.expm1` method. While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible. Also note the related PR for Python: https://github.com/apache/spark/pull/1652 Author: Sean Owen <srowen@gmail.com> Closes #1659 from srowen/SPARK-2748 and squashes the following commits: c5926d4 [Sean Owen] Use log1p, expm1 for better precision for tiny arguments
* [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix)Cheng Lian2014-07-281-1/+1
| | | | | | | | | | | | | | | JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410) Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar. In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits: 629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
* Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"Patrick Wendell2014-07-271-1/+1
| | | | This reverts commit f6ff2a61d00d12481bfb211ae13d6992daacdcc2.
* [SPARK-2410][SQL] Merging Hive Thrift/JDBC serverCheng Lian2014-07-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | (This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.) JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410) Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc). Thanks chenghao-intel for his initial contribution of the Spark SQL CLI. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1600 from liancheng/jdbc and squashes the following commits: ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds 090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR 21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd] 199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver 1083e9d [Cheng Lian] Fixed failed test suites 7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic 9cc0f06 [Cheng Lian] Starts beeline with spark-submit cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile 061880f [Cheng Lian] Addressed all comments by @pwendell 7755062 [Cheng Lian] Adapts test suites to spark-submit settings 40bafef [Cheng Lian] Fixed more license header issues e214aab [Cheng Lian] Added missing license headers b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft 3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit 61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit 2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
* Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"Michael Armbrust2014-07-251-1/+1
| | | | | | | | | | | | This reverts commit 06dc0d2c6b69c5d59b4d194ced2ac85bfe2e05e2. #1399 is making Jenkins fail. We should investigate and put this back after its passing tests. Author: Michael Armbrust <michael@databricks.com> Closes #1594 from marmbrus/revertJDBC and squashes the following commits: 59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
* [SPARK-2410][SQL] Merging Hive Thrift/JDBC serverCheng Lian2014-07-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA issue: - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410) - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678) Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc). (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.) TODO - [x] Use `spark-submit` to launch the server, the CLI and beeline - [x] Migration guideline draft for Shark users ---- Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example: ```bash $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help ``` This actually shows usage information of `SparkSubmit` rather than `BeeLine`. ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~ **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #1399 from liancheng/thriftserver and squashes the following commits: 090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR 21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd] 199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver 1083e9d [Cheng Lian] Fixed failed test suites 7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic 9cc0f06 [Cheng Lian] Starts beeline with spark-submit cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile 061880f [Cheng Lian] Addressed all comments by @pwendell 7755062 [Cheng Lian] Adapts test suites to spark-submit settings 40bafef [Cheng Lian] Fixed more license header issues e214aab [Cheng Lian] Added missing license headers b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft 3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit 61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit 2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
* Replace RoutingTableMessage with pairAnkur Dave2014-07-234-30/+36
| | | | | | | | | | | | | | | | | | | | | | | RoutingTableMessage was used to construct routing tables to enable joining VertexRDDs with partitioned edges. It stored three elements: the destination vertex ID, the source edge partition, and a byte specifying the position in which the edge partition referenced the vertex to enable join elimination. However, this was incompatible with sort-based shuffle (SPARK-2045). It was also slightly wasteful, because partition IDs are usually much smaller than 2^32, though this was mitigated by a custom serializer that used variable-length encoding. This commit replaces RoutingTableMessage with a pair of (VertexId, Int) where the Int encodes both the source partition ID (in the lower 30 bits) and the position (in the top 2 bits). Author: Ankur Dave <ankurdave@gmail.com> Closes #1553 from ankurdave/remove-RoutingTableMessage and squashes the following commits: 697e17b [Ankur Dave] Replace RoutingTableMessage with pair
* Remove GraphX MessageToPartition for compatibility with sort-based shuffleAnkur Dave2014-07-225-228/+2
| | | | | | | | | | | MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB). Author: Ankur Dave <ankurdave@gmail.com> Closes #1537 from ankurdave/remove-MessageToPartition and squashes the following commits: f9d0054 [Ankur Dave] Remove MessageToPartition ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg
* Graphx exampleCrazyJvm2014-07-222-5/+5
| | | | | | | | | | | fix examples Author: CrazyJvm <crazyjvm@gmail.com> Closes #1523 from CrazyJvm/graphx-example and squashes the following commits: 663457a [CrazyJvm] outDegrees does not take parameters 7cfff1d [CrazyJvm] fix example for joinVertices
* [SPARK-2455] Mark (Shippable)VertexPartition serializableAnkur Dave2014-07-125-14/+37
| | | | | | | | | | | | | VertexPartition and ShippableVertexPartition are contained in RDDs but are not marked Serializable, leading to NotSerializableExceptions when using Java serialization. The fix is simply to mark them as Serializable. This PR does that and adds a test for serializing them using Java and Kryo serialization. Author: Ankur Dave <ankurdave@gmail.com> Closes #1376 from ankurdave/SPARK-2455 and squashes the following commits: ed4a51b [Ankur Dave] Make (Shippable)VertexPartition serializable 1fd42c5 [Ankur Dave] Add failing tests for Java serialization
* fix Graph partitionStrategy commentCrazyJvm2014-07-112-4/+6
| | | | | | | | | Author: CrazyJvm <crazyjvm@gmail.com> Closes #1368 from CrazyJvm/graph-comment-1 and squashes the following commits: d47f3c5 [CrazyJvm] fix style e190d6f [CrazyJvm] fix Graph partitionStrategy comment
* [SPARK-1776] Have Spark's SBT build read dependencies from Maven.Prashant Sharma2014-07-101-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch introduces the new way of working also retaining the existing ways of doing things. For example build instruction for yarn in maven is `mvn -Pyarn -PHadoop2.2 clean package -DskipTests` in sbt it can become `MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly` Also supports `sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly` Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #772 from ScrapCodes/sbt-maven and squashes the following commits: a8ac951 [Prashant Sharma] Updated sbt version. 62b09bb [Prashant Sharma] Improvements. fa6221d [Prashant Sharma] Excluding sql from mima 4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default. 72651ca [Prashant Sharma] Addresses code reivew comments. acab73d [Prashant Sharma] Revert "Small fix to run-examples script." ac4312c [Prashant Sharma] Revert "minor fix" 6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit. 65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path. 446768e [Prashant Sharma] minor fix 89b9777 [Prashant Sharma] Merge conflicts d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups. dccc8ac [Prashant Sharma] updated mima to check against 1.0 a49c61b [Prashant Sharma] Fix for tools jar a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies. cf88758 [Prashant Sharma] cleanup 9439ea3 [Prashant Sharma] Small fix to run-examples script. 96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven. 36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins. 4973dbd [Patrick Wendell] Example build using pom reader.
* [SPARK-2124] Move aggregation into shuffle implementationsjerryshao2014-06-232-5/+6
| | | | | | | | | | | | | | | | | | This PR is a sub-task of SPARK-2044 to move the execution of aggregation into shuffle implementations. I leave `CoGoupedRDD` and `SubtractedRDD` unchanged because they have their implementations of aggregation. I'm not sure is it suitable to change these two RDDs. Also I do not move sort related code of `OrderedRDDFunctions` into shuffle, this will be solved in another sub-task. Author: jerryshao <saisai.shao@intel.com> Closes #1064 from jerryshao/SPARK-2124 and squashes the following commits: 4a05a40 [jerryshao] Modify according to comments 1f7dcc8 [jerryshao] Style changes 50a2fd6 [jerryshao] Fix test suite issue after moving aggregator to Shuffle reader and writer 1a96190 [jerryshao] Code modification related to the ShuffledRDD 308f635 [jerryshao] initial works of move combiner to ShuffleManager's reader and writer
* [SPARK-1552] Fix type comparison bug in {map,outerJoin}VerticesAnkur Dave2014-06-055-8/+40
| | | | | | | | | | | | | | | | | | | | | In GraphImpl, mapVertices and outerJoinVertices use a more efficient implementation when the map function conserves vertex attribute types. This is implemented by comparing the ClassTags of the old and new vertex attribute types. However, ClassTags store erased types, so the comparison will return a false positive for types with different type parameters, such as Option[Int] and Option[Double]. This PR resolves the problem by requesting that the compiler generate evidence of equality between the old and new vertex attribute types, and providing a default value for the evidence parameter if the two types are not equal. The methods can then check the value of the evidence parameter to see whether the types are equal. It also adds a test called "mapVertices changing type with same erased type" that failed before the PR and succeeds now. Callers of mapVertices and outerJoinVertices can no longer use a wildcard for a graph's VD type. To avoid "Error occurred in an application involving default arguments," they must bind VD to a type parameter, as this PR does for ShortestPaths and LabelPropagation. Author: Ankur Dave <ankurdave@gmail.com> Closes #967 from ankurdave/SPARK-1552 and squashes the following commits: 68a4fff [Ankur Dave] Undo conserve naming 7388705 [Ankur Dave] Remove unnecessary ClassTag for VD parameters a704e5f [Ankur Dave] Use type equality constraint with default argument 29a5ab7 [Ankur Dave] Add failing test f458c83 [Ankur Dave] Revert "[SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices" 16d6af8 [Ankur Dave] [SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices
* [SPARK-2025] Unpersist edges of previous graph in PregelAnkur Dave2014-06-051-0/+1
| | | | | | | | | | | | | | Due to a bug introduced by apache/spark#497, Pregel does not unpersist replicated vertices from previous iterations. As a result, they stay cached until memory is full, wasting GC time. This PR corrects the problem by unpersisting both the edges and the replicated vertices of previous iterations. This is safe because the edges and replicated vertices of the current iteration are cached by the call to `g.cache()` and then materialized by the call to `messages.count()`. Therefore no unmaterialized RDDs depend on `prevG.edges`. I verified that no recomputation occurs by running PageRank with a custom patch to Spark that warns when a partition is recomputed. Thanks to Tim Weninger for reporting this bug. Author: Ankur Dave <ankurdave@gmail.com> Closes #972 from ankurdave/SPARK-2025 and squashes the following commits: 13d5b07 [Ankur Dave] Unpersist edges of previous graph in Pregel
* [SPARK-2029] Bump pom.xml version number of master branch to 1.1.0-SNAPSHOT.Takuya UESHIN2014-06-051-1/+1
| | | | | | | | Author: Takuya UESHIN <ueshin@happy-camper.st> Closes #974 from ueshin/issues/SPARK-2029 and squashes the following commits: e19e8f4 [Takuya UESHIN] Bump version number to 1.1.0-SNAPSHOT.
* Minor: Fix documentation error from apache/spark#946Ankur Dave2014-06-041-2/+2
| | | | | | | | Author: Ankur Dave <ankurdave@gmail.com> Closes #970 from ankurdave/SPARK-1991_docfix and squashes the following commits: 6d07343 [Ankur Dave] Minor: Fix documentation error from apache/spark#946
* Enable repartitioning of graph over different number of partitionsJoseph E. Gonzalez2014-06-033-4/+20
| | | | | | | | | | It is currently very difficult to repartition a graph over a different number of partitions. This PR adds an additional `partitionBy` function that takes the number of partitions. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #719 from jegonzal/graph_partitioning_options and squashes the following commits: 730b405 [Joseph E. Gonzalez] adding an additional number of partitions option to partitionBy
* [SPARK-1991] Support custom storage levels for vertices and edgesAnkur Dave2014-06-037-97/+208
| | | | | | | | | | | | | | | | | | | | | This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed. The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels. In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods. I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed. Author: Ankur Dave <ankurdave@gmail.com> Closes #946 from ankurdave/SPARK-1991 and squashes the following commits: ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0 c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks" 34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks 6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges
* Synthetic GraphX BenchmarkJoseph E. Gonzalez2014-06-032-9/+41
| | | | | | | | | | | | | | | | | | | | This PR accomplishes two things: 1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph. This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets 2. This PR improves the implementation of the log-normal graph generator. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Author: Ankur Dave <ankurdave@gmail.com> Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits: e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0 bccccad [Ankur Dave] Fix long lines 374678a [Ankur Dave] Bugfix and style changes 1bdf39a [Joseph E. Gonzalez] updating options d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder. f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
* [SPARK-1942] Stop clearing spark.driver.port in unit testsSyed Hashmi2014-06-031-2/+0
| | | | | | | | | | | | | | | | | | | stop resetting spark.driver.port in unit tests (scala, java and python). Author: Syed Hashmi <shashmi@cloudera.com> Author: CodingCat <zhunansjtu@gmail.com> Closes #943 from syedhashmi/master and squashes the following commits: 885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool) b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master' b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner" 57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner" 1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests 4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread" fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner 6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread 4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
* Add landmark-based Shortest Path algorithm to graphx.libAnkur Dave2014-06-022-0/+120
| | | | | | | | | | | | | | | This is a modified version of apache/spark#10. Author: Ankur Dave <ankurdave@gmail.com> Author: Andres Perez <andres@tresata.com> Closes #933 from ankurdave/shortestpaths and squashes the following commits: 03a103c [Ankur Dave] Style fixes 7a1ff48 [Ankur Dave] Improve ShortestPaths documentation d75c8fc [Ankur Dave] Remove unnecessary VD type param, and pass through ED d983fb4 [Ankur Dave] Fix style errors 60ed8e6 [Andres Perez] Add Shortest-path computations to graphx.lib with unit tests.
* initial version of LPAAnkur Dave2014-05-292-0/+111
| | | | | | | | | | | | | | | | | A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework. Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach. Author: Ankur Dave <ankurdave@gmail.com> Author: haroldsultan <haroldsultan@gmail.com> Author: Harold Sultan <haroldsultan@gmail.com> Closes #905 from haroldsultan/master and squashes the following commits: 327aee0 [haroldsultan] Merge pull request #2 from ankurdave/label-propagation 227a4d0 [Ankur Dave] Untabify 0ac574c [haroldsultan] Merge pull request #1 from ankurdave/label-propagation 0e24303 [Ankur Dave] Add LabelPropagationSuite 84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA 9830342 [Harold Sultan] initial version of LPA
* [SPARK-1931] Reconstruct routing tables in Graph.partitionByAnkur Dave2014-05-263-4/+31
| | | | | | | | | | | | | 905173df57b90f90ebafb22e43f55164445330e6 introduced a bug in partitionBy where, after repartitioning the edges, it reuses the VertexRDD without updating the routing tables to reflect the new edge layout. Subsequent accesses of the triplets contain nulls for many vertex properties. This commit adds a test for this bug and fixes it by introducing `VertexRDD#withEdges` and calling it in `partitionBy`. Author: Ankur Dave <ankurdave@gmail.com> Closes #885 from ankurdave/SPARK-1931 and squashes the following commits: 3930cdd [Ankur Dave] Note how to set up VertexRDD for efficient joins 9bdbaa4 [Ankur Dave] [SPARK-1931] Reconstruct routing tables in Graph.partitionBy
* bugfix: overflow of graphx Edge compare functionZhen Peng2014-05-162-2/+47
| | | | | | | | | Author: Zhen Peng <zhenpeng01@baidu.com> Closes #769 from zhpengg/bugfix-graphx-edge-compare and squashes the following commits: 8a978ff [Zhen Peng] add ut for graphx Edge.lexicographicOrdering.compare 413c258 [Zhen Peng] there maybe a overflow for two Long's substraction
* Fixes a misplaced comment.Prashant Sharma2014-05-151-2/+2
| | | | | | | | | | | | | Fixes a misplaced comment from #785. @pwendell Author: Prashant Sharma <prashant.s@imaginea.com> Closes #788 from ScrapCodes/patch-1 and squashes the following commits: 3ef6a69 [Prashant Sharma] Update package-info.java 67d9461 [Prashant Sharma] Update package-info.java
* Package docsPrashant Sharma2014-05-145-0/+110
| | | | | | | | | | | | | | This is a few changes based on the original patch by @scrapcodes. Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #785 from pwendell/package-docs and squashes the following commits: c32b731 [Patrick Wendell] Changes based on Prashant's patch c0463d3 [Prashant Sharma] added eof new line ce8bf73 [Prashant Sharma] Added eof new line to all files. 4c35f2e [Prashant Sharma] SPARK-1563 Add package-info.java and package.scala files for all packages that appear in docs
* SPARK-1798. Tests should clean up temp filesSean Owen2014-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent. Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former. The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules. Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method. _If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._ Author: Sean Owen <sowen@cloudera.com> Closes #732 from srowen/SPARK-1798 and squashes the following commits: 5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
* SPARK-1786: Reopening PR 724Ankur Dave2014-05-1211-23/+44
| | | | | | | | | | | | | | | Addressing issue in MimaBuild.scala. Author: Ankur Dave <ankurdave@gmail.com> Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #742 from jegonzal/edge_partition_serialization and squashes the following commits: 8ba6e0d [Ankur Dave] Add concatenation operators to MimaBuild.scala cb2ed3a [Joseph E. Gonzalez] addressing missing exclusion in MimaBuild.scala 5d27824 [Ankur Dave] Disable reference tracking to fix serialization test c0a9ae5 [Ankur Dave] Add failing test for EdgePartition Kryo serialization a4a3faa [Joseph E. Gonzalez] Making EdgePartition serializable.
* Revert "SPARK-1786: Edge Partition Serialization"Patrick Wendell2014-05-1211-44/+23
| | | | This reverts commit a6b02fb7486356493474c7f42bb714c9cce215ca.
* SPARK-1786: Edge Partition SerializationAnkur Dave2014-05-1111-23/+44
| | | | | | | | | | | | | This appears to address the issue with edge partition serialization. The solution appears to be just registering the `PrimitiveKeyOpenHashMap`. However I noticed that we appear to have forked that code in GraphX but retained the same name (which is confusing). I also renamed our local copy to `GraphXPrimitiveKeyOpenHashMap`. We should consider dropping that and using the one in Spark if possible. Author: Ankur Dave <ankurdave@gmail.com> Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #724 from jegonzal/edge_partition_serialization and squashes the following commits: b0a525a [Ankur Dave] Disable reference tracking to fix serialization test bb7f548 [Ankur Dave] Add failing test for EdgePartition Kryo serialization 67dac22 [Joseph E. Gonzalez] Making EdgePartition serializable.
* Fix error in 2d Graph PartitionerJoseph E. Gonzalez2014-05-111-2/+2
| | | | | | | | | | Their was a minor bug in which negative partition ids could be generated when constructing a 2D partitioning of a graph. This could lead to an inefficient 2D partition for large vertex id values. Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com> Closes #709 from jegonzal/fix_2d_partitioning and squashes the following commits: 937c562 [Joseph E. Gonzalez] fixing bug in 2d partitioning algorithm where negative partition ids could be generated.
* Unify GraphImpl RDDs + other graph load optimizationsAnkur Dave2014-05-1026-843/+1337
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR makes the following changes, primarily in e4fbd329aef85fe2c38b0167255d2a712893d683: 1. *Unify RDDs to avoid zipPartitions.* A graph used to be four RDDs: vertices, edges, routing table, and triplet view. This commit merges them down to two: vertices (with routing table), and edges (with replicated vertices). 2. *Avoid duplicate shuffle in graph building.* We used to do two shuffles when building a graph: one to extract routing information from the edges and move it to the vertices, and another to find nonexistent vertices referred to by edges. With this commit, the latter is done as a side effect of the former. 3. *Avoid no-op shuffle when joins are fully eliminated.* This is a side effect of unifying the edges and the triplet view. 4. *Join elimination for mapTriplets.* 5. *Ship only the needed vertex attributes when upgrading the triplet view.* If the triplet view already contains source attributes, and we now need both attributes, only ship destination attributes rather than re-shipping both. This is done in `ReplicatedVertexView#upgrade`. Author: Ankur Dave <ankurdave@gmail.com> Closes #497 from ankurdave/unify-rdds and squashes the following commits: 332ab43 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds 4933e2e [Ankur Dave] Exclude RoutingTable from binary compatibility check 5ba8789 [Ankur Dave] Add GraphX upgrade guide from Spark 0.9.1 13ac845 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into unify-rdds a04765c [Ankur Dave] Remove unnecessary toOps call 57202e8 [Ankur Dave] Replace case with pair parameter 75af062 [Ankur Dave] Add explicit return types 04d3ae5 [Ankur Dave] Convert implicit parameter to context bound c88b269 [Ankur Dave] Revert upgradeIterator to if-in-a-loop 0d3584c [Ankur Dave] EdgePartition.size should be val 2a928b2 [Ankur Dave] Set locality wait 10b3596 [Ankur Dave] Clean up public API ae36110 [Ankur Dave] Fix style errors e4fbd32 [Ankur Dave] Unify GraphImpl RDDs + other graph load optimizations d6d60e2 [Ankur Dave] In GraphLoader, coalesce to minEdgePartitions 62c7b78 [Ankur Dave] In Analytics, take PageRank numIter d64e8d4 [Ankur Dave] Log current Pregel iteration
* SPARK-1708. Add a ClassTag on Serializer and things that depend on itMatei Zaharia2014-05-102-23/+27
| | | | | | | | | | | | | | | | | This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility. One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly. CC @rxin, @pwendell, @heathermiller Author: Matei Zaharia <matei@databricks.com> Closes #700 from mateiz/spark-1708 and squashes the following commits: 1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java 3b449ed [Matei Zaharia] test fix 2209a27 [Matei Zaharia] Code style fixes 9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
* SPARK-1565, update examples to be used with spark-submit script.Prashant Sharma2014-05-081-7/+11
| | | | | | | | | | | | | | | | Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ? Also few other things that did not work like `bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2` Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits: 669dd23 [Prashant Sharma] Review comments 2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
* [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...Kan Zhang2014-05-072-16/+4
| | | | | | | | | | | | ... that do not change schema Author: Kan Zhang <kzhang@apache.org> Closes #448 from kanzhang/SPARK-1460 and squashes the following commits: 111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD 91dc787 [Kan Zhang] Taking into account newly added Ordering param 79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
* Improved build configurationwitgo2014-04-281-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x 2, Fix SPARK-1491: maven hadoop-provided profile fails to build 3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency 4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces) Author: witgo <witgo@qq.com> Closes #480 from witgo/format_pom and squashes the following commits: 03f652f [witgo] review commit b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence 7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence 0da4bc3 [witgo] merge master d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom e345919 [witgo] add avro dependency to yarn-alpha 77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency 1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 934f24d [witgo] review commit cf46edc [witgo] exclude jruby 06e7328 [witgo] Merge branch 'SparkBuild' into format_pom 99464d2 [witgo] fix maven hadoop-provided profile fails to build 0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x 6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
* Fix Scala StyleSandeep2014-04-241-2/+5
| | | | | | | | | | | | | | Any comments are welcome Author: Sandeep <sandeep@techaddict.me> Closes #531 from techaddict/stylefix-1 and squashes the following commits: 7492730 [Sandeep] Pass 4 98b2428 [Sandeep] fix rxin suggestions b5e2e6f [Sandeep] Pass 3 05932d7 [Sandeep] fix if else styling 2 08690e5 [Sandeep] fix if else styling
* Mark all fields of EdgePartition, Graph, and GraphOps transientAnkur Dave2014-04-233-12/+12
| | | | | | | | | | | | These classes are only serializable to work around closure capture, so their fields should all be marked `@transient` to avoid wasteful serialization. This PR supersedes apache/spark#519 and fixes the same bug. Author: Ankur Dave <ankurdave@gmail.com> Closes #520 from ankurdave/graphx-transient and squashes the following commits: 6431760 [Ankur Dave] Mark all fields of EdgePartition, Graph, and GraphOps `@transient`
* SPARK-1329: Create pid2vid with correct number of partitionsAnkur Dave2014-04-162-2/+14
| | | | | | | | | | | | | | Each vertex partition is co-located with a pid2vid array created in RoutingTable.scala. This array maps edge partition IDs to the list of vertices in the current vertex partition that are mentioned by edges in that partition. Therefore the pid2vid array should have one entry per edge partition. GraphX currently creates one entry per *vertex* partition, which is a bug that leads to an ArrayIndexOutOfBoundsException when there are more edge partitions than vertex partitions. This commit fixes the bug and adds a test for this case. Resolves SPARK-1329. Thanks to Daniel Darabos for reporting this bug. Author: Ankur Dave <ankurdave@gmail.com> Closes #368 from ankurdave/fix-pid2vid-size and squashes the following commits: 5a5c52a [Ankur Dave] SPARK-1329: Create pid2vid with correct number of partitions
* Rebuild routing table after Graph.reverseAnkur Dave2014-04-162-1/+11
| | | | | | | | | | | | | | | | GraphImpl.reverse used to reverse edges in each partition of the edge RDD but preserve the routing table and replicated vertex view, since reversing should not affect partitioning. However, the old routing table would then have incorrect information for srcAttrOnly and dstAttrOnly. These RDDs should be switched. A simple fix is for Graph.reverse to rebuild the routing table and replicated vertex view. Thanks to Bogdan Ghidireac for reporting this issue on the [mailing list](http://apache-spark-user-list.1001560.n3.nabble.com/graph-reverse-amp-Pregel-API-td4338.html). Author: Ankur Dave <ankurdave@gmail.com> Closes #431 from ankurdave/fix-reverse-bug and squashes the following commits: 75d63cb [Ankur Dave] Rebuild routing table after Graph.reverse