| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR:
- Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
- Tells `surefire` to test only Java tests
- Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.
For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.
Author: Sean Owen <sowen@cloudera.com>
Closes #3651 from srowen/SPARK-4159 and squashes the following commits:
2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
|
|
|
|
|
|
|
|
|
|
|
|
| |
The sum of vertices on matrix (v0 to v11) is 12. And, I think one same block overlaps in this strategy.
This is minor PR, so I didn't file in JIRA.
Author: kj-ki <kikushima.kenji@lab.ntt.co.jp>
Closes #3904 from kj-ki/fix-partitionstrategy-comments and squashes the following commits:
79829d9 [kj-ki] Fix comments for 2D partitioning.
|
|
|
|
|
|
|
|
|
|
|
|
| |
As we learned in #3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior.
This is a follow up PR for rest of Spark (outside Spark SQL). The original PR for Spark SQL can be found at https://github.com/apache/spark/pull/3859
Author: Reynold Xin <rxin@databricks.com>
Closes #3860 from rxin/implicit and squashes the following commits:
73702f9 [Reynold Xin] [SPARK-5038] Add explicit return type for implicit functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add an IF to uncache both vertices and edges of Graph/GraphImpl.
This IF is useful when iterative graph operations build a new graph in each iteration, and the vertices and edges of previous iterations are no longer needed for following iterations.
Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
This patch had conflicts when merged, resolved by
Committer: Ankur Dave <ankurdave@gmail.com>
Closes #3476 from maropu/UnpersistInGraphSpike and squashes the following commits:
77a006a [Takeshi Yamamuro] Add unpersist in Graph and GraphImpl
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch just replaces a native quick sorter with Sorter(TimSort) in Spark.
It could get performance gains by ~8% in my quick experiments.
Author: Takeshi Yamamuro <linguin.m.s@gmail.com>
Closes #3507 from maropu/TimSortInEdgePartitionBuilderSpike and squashes the following commits:
8d4e5d2 [Takeshi Yamamuro] Remove a wildcard import
3527e00 [Takeshi Yamamuro] Replace Scala.util.Sorting.quickSort with Sorter(TimSort) in Spark
|
|
|
|
|
|
|
|
|
|
| |
Author: GuoQiang Li <witgo@qq.com>
Closes #2631 from witgo/SPARK-3623 and squashes the following commits:
a70c500 [GuoQiang Li] Remove java related
4d1e249 [GuoQiang Li] Add comments
e682724 [GuoQiang Li] Graph should support the checkpoint operation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
error
The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672
In a nutshell, if `val partitionsRDD` in EdgeRDDImpl and VertexRDDImpl are non-transient, the serialization chain can become very long in iterative algorithms and finally lead to the StackOverflow error. More details and explanation can be found in the JIRA.
Author: JerryLead <JerryLead@163.com>
Author: Lijie Xu <csxulijie@gmail.com>
Closes #3544 from JerryLead/my_graphX and squashes the following commits:
628f33c [JerryLead] set PartitionsRDD to be transient in EdgeRDDImpl and VertexRDDImpl
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The related JIRA is https://issues.apache.org/jira/browse/SPARK-4672
Iterative GraphX applications always have long lineage, while checkpoint() on EdgeRDD and VertexRDD themselves cannot shorten the lineage. In contrast, if we perform checkpoint() on their ParitionsRDD, the long lineage can be cut off. Moreover, the existing operations such as cache() in this code is performed on the PartitionsRDD, so checkpoint() should do the same way. More details and explanation can be found in the JIRA.
Author: JerryLead <JerryLead@163.com>
Author: Lijie Xu <csxulijie@gmail.com>
Closes #3549 from JerryLead/my_graphX_checkpoint and squashes the following commits:
d1aa8d8 [JerryLead] Perform checkpoint() on PartitionsRDD not VertexRDD and EdgeRDD themselves
ff08ed4 [JerryLead] Merge branch 'master' of https://github.com/apache/spark
c0169da [JerryLead] Merge branch 'master' of https://github.com/apache/spark
52799e3 [Lijie Xu] Merge pull request #1 from apache/master
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
After additional discussion with rxin, I think having all the possible `TripletField` options is confusing. This pull request reduces the triplet fields to:
```java
/**
* None of the triplet fields are exposed.
*/
public static final TripletFields None = new TripletFields(false, false, false);
/**
* Expose only the edge field and not the source or destination field.
*/
public static final TripletFields EdgeOnly = new TripletFields(false, false, true);
/**
* Expose the source and edge fields but not the destination field. (Same as Src)
*/
public static final TripletFields Src = new TripletFields(true, false, true);
/**
* Expose the destination and edge fields but not the source field. (Same as Dst)
*/
public static final TripletFields Dst = new TripletFields(false, true, true);
/**
* Expose all the fields (source, edge, and destination).
*/
public static final TripletFields All = new TripletFields(true, true, true);
```
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes #3472 from jegonzal/SimplifyTripletFields and squashes the following commits:
91796b5 [Joseph E. Gonzalez] removing confusing triplet fields
|
|
|
|
|
|
|
|
|
|
| |
This pull request revises the programming guide to reflect changes in the GraphX API as well as the deprecated mapReduceTriplets operator.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes #3359 from jegonzal/GraphXProgrammingGuide and squashes the following commits:
4421964 [Joseph E. Gonzalez] updating documentation for graphx
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to vertex attribute caching, EdgeRDD previously took two type parameters: ED and VD. However, this is an implementation detail that should not be exposed in the interface, so this PR drops the VD type parameter.
This requires removing the `filter` method from the EdgeRDD interface, because it depends on vertex attribute caching.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #3303 from ankurdave/edgerdd-drop-tparam and squashes the following commits:
38dca9b [Ankur Dave] Leave EdgeRDD.fromEdges public
fafeb51 [Ankur Dave] Drop VD type parameter from EdgeRDD
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This discourages users from calling the VertexRDD and EdgeRDD constructor and makes it easier for future changes to ensure backward compatibility.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #2530 from ankurdave/SPARK-3666 and squashes the following commits:
d681f45 [Ankur Dave] Define getPartitions and compute in abstract class for MIMA
1472390 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666
24201d4 [Ankur Dave] Merge remote-tracking branch 'apache-spark/master' into SPARK-3666
cbe15f2 [Ankur Dave] Remove specialized annotation from VertexRDD and EdgeRDD
931b587 [Ankur Dave] Use abstract class instead of trait for binary compatibility
9ba4ec4 [Ankur Dave] Mark (Vertex|Edge)RDDImpl constructors package-private
620e603 [Ankur Dave] Extract VertexRDD interface and move implementation to VertexRDDImpl
55b6398 [Ankur Dave] Extract EdgeRDD interface and move implementation to EdgeRDDImpl
|
|
|
|
|
|
|
|
|
|
|
| |
1. Add EdgeActiveness enum to represent activeness criteria more cleanly than using booleans.
2. Comments and whitespace.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #3231 from ankurdave/aggregateMessages-followup and squashes the following commits:
3d485c3 [Ankur Dave] Internal cleanup for aggregateMessages
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
aggregateMessages enables neighborhood computation similarly to mapReduceTriplets, but it introduces two API improvements:
1. Messages are sent using an imperative interface based on EdgeContext rather than by returning an iterator of messages.
2. Rather than attempting bytecode inspection, the required triplet fields must be explicitly specified by the user by passing a TripletFields object. This fixes SPARK-3936.
Additionally, this PR includes the following optimizations for aggregateMessages and EdgePartition:
1. EdgePartition now stores local vertex ids instead of global ids. This avoids hash lookups when looking up vertex attributes and aggregating messages.
2. Internal iterators in aggregateMessages are inlined into a while loop.
In total, these optimizations were tested to provide a 37% speedup on PageRank (uk-2007-05 graph, 10 iterations, 16 r3.2xlarge machines, sped up from 513 s to 322 s).
Subsumes apache/spark#2815. Also fixes SPARK-4173.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #3100 from ankurdave/aggregateMessages and squashes the following commits:
f5b65d0 [Ankur Dave] Address @rxin comments on apache/spark#3054 and apache/spark#3100
1e80aca [Ankur Dave] Add aggregateMessages, which supersedes mapReduceTriplets
194a2df [Ankur Dave] Test triplet iterator in EdgePartition serialization test
e0f8ecc [Ankur Dave] Take activeSet in ExistingEdgePartitionBuilder
c85076d [Ankur Dave] Readability improvements
b567be2 [Ankur Dave] iter.foreach -> while loop
4a566dc [Ankur Dave] Optimizations for mapReduceTriplets and EdgePartition
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As [reported][1] on the mailing list, GraphX throws
```
java.lang.ClassCastException: java.lang.Long cannot be cast to scala.Tuple2
at org.apache.spark.graphx.impl.RoutingTableMessageSerializer$$anon$1$$anon$2.writeObject(Serializers.scala:39)
at org.apache.spark.storage.DiskBlockObjectWriter.write(BlockObjectWriter.scala:195)
at org.apache.spark.util.collection.ExternalSorter.spillToMergeableFile(ExternalSorter.scala:329)
```
when sort-based shuffle attempts to spill to disk. This is because GraphX defines custom serializers for shuffling pair RDDs that assume Spark will always serialize the entire pair object rather than breaking it up into its components. However, the spill code path in sort-based shuffle [violates this assumption][2].
GraphX uses the custom serializers to compress vertex ID keys using variable-length integer encoding. However, since the serializer can no longer rely on the key and value being serialized and deserialized together, performing such encoding would either require writing a tag byte (costly) or maintaining state in the serializer and assuming that serialization calls will alternate between key and value (fragile).
Instead, this PR simply removes the custom serializers. This causes a **10% slowdown** (494 s to 543 s) and **16% increase in per-iteration communication** (2176 MB to 2518 MB) for PageRank (averages across 3 trials, 10 iterations per trial, uk-2007-05 graph, 16 r3.2xlarge nodes).
[1]: http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassCastException-java-lang-Long-cannot-be-cast-to-scala-Tuple2-td13926.html#a14501
[2]: https://github.com/apache/spark/blob/f9d6220c792b779be385f3022d146911a22c2130/core/src/main/scala/org/apache/spark/util/collection/ExternalSorter.scala#L329
Author: Ankur Dave <ankurdave@gmail.com>
Closes #2503 from ankurdave/SPARK-3649 and squashes the following commits:
a49c2ad [Ankur Dave] [SPARK-3649] Remove GraphX custom serializers
|
|
|
|
|
|
|
|
|
|
| |
at first srcIds is not initialized and are all 0. so we use edgeArray(0).srcId to currSrcId
Author: lianhuiwang <lianhuiwang09@gmail.com>
Closes #3138 from lianhuiwang/SPARK-4249 and squashes the following commits:
3f4e503 [lianhuiwang] fix a problem of EdgePartitionBuilder in Graphx
|
|
|
|
|
|
|
|
|
|
| |
Accumulate sizes of all the EdgePartitions just like the VertexRDD.
Author: luluorta <luluorta@gmail.com>
Closes #2975 from luluorta/graph-edge-count and squashes the following commits:
86ef0e5 [luluorta] Add overrided count for edge counting of EdgeRDD.
|
|
|
|
|
|
|
|
|
|
| |
Changing the default number of edge partitions to match spark parallelism.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes #3006 from jegonzal/default_partitions and squashes the following commits:
a9a5c4f [Joseph E. Gonzalez] Changing the default number of edge partitions to match spark parallelism
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Sandy Ryza <sandy@cloudera.com>
Closes #789 from sryza/sandy-spark-1813 and squashes the following commits:
48b05e9 [Sandy Ryza] Simplify
b824932 [Sandy Ryza] Allow both spark.kryo.classesToRegister and spark.kryo.registrator at the same time
6a15bb7 [Sandy Ryza] Small fix
a2278c0 [Sandy Ryza] Respond to review comments
6ef592e [Sandy Ryza] SPARK-1813. Add a utility to SparkConf that makes using Kryo really easy
|
|
|
|
|
|
|
|
|
|
| |
Thread names are useful for correlating failures.
Author: Reynold Xin <rxin@apache.org>
Closes #2600 from rxin/log4j and squashes the following commits:
83ffe88 [Reynold Xin] [SPARK-3748] Log thread name in unit test logs
|
|
|
|
|
|
|
|
| |
Author: oded <oded@HP-DV6.c4internal.c4-security.com>
Closes #2486 from odedz/master and squashes the following commits:
dd7890a [oded] Fixed the condition in StronglyConnectedComponents Issue: SPARK-3635
|
|
|
|
|
|
|
|
|
|
| |
When `numVertices > 50`, probability is set to 0. This would cause infinite loop.
Author: yingjieMiao <yingjie@42go.com>
Closes #2553 from yingjieMiao/graphx and squashes the following commits:
6adf3c8 [yingjieMiao] [graphX] GraphOps: random pick vertex bug
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GraphGenerators.sampleLogNormal is supposed to return an integer strictly less than maxVal. However, it violates this guarantee. It generates its return value as follows:
```scala
var X: Double = maxVal
while (X >= maxVal) {
val Z = rand.nextGaussian()
X = math.exp(mu + sigma*Z)
}
math.round(X.toFloat)
```
When X is sampled to be close to (but less than) maxVal, then it will pass the while loop condition, but the rounded result will be equal to maxVal, which will violate the guarantee. For example, if maxVal is 5 and X is 4.9, then X < maxVal, but `math.round(X.toFloat)` is 5.
This PR instead rounds X before checking the loop condition, guaranteeing that the condition will hold for the return value.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #2439 from ankurdave/SPARK-3578 and squashes the following commits:
f6655e5 [Ankur Dave] Go back to math.floor
5900c22 [Ankur Dave] Round X in loop condition
6fd5fb1 [Ankur Dave] Run sampleLogNormal bounds check 1000 times
1638598 [Ankur Dave] Round down in sampleLogNormal to guarantee upper bound
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
VertexRDD.apply had a bug where it ignored the merge function for
duplicate vertices and instead used whichever vertex attribute occurred
first. This commit fixes the bug by passing the merge function through
to ShippableVertexPartition.apply, which merges any duplicates using the
merge function and then fills in missing vertices using the specified
default vertex attribute. This commit also adds a unit test for
VertexRDD.apply.
Author: Larry Xiao <xiaodi@sjtu.edu.cn>
Author: Blie Arkansol <xiaodi@sjtu.edu.cn>
Author: Ankur Dave <ankurdave@gmail.com>
Closes #1903 from larryxiao/2062 and squashes the following commits:
625aa9d [Blie Arkansol] Merge pull request #1 from ankurdave/SPARK-2062
476770b [Ankur Dave] ShippableVertexPartition.initFrom: Don't run mergeFunc on default values
614059f [Larry Xiao] doc update: note about the default null value vertices construction
dfdb3c9 [Larry Xiao] minor fix
1c70366 [Larry Xiao] scalastyle check: wrap line, parameter list indent 4 spaces
e4ca697 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
6a35ea8 [Larry Xiao] [TEST] VertexRDD.apply mergeFunc
4fbc29c [Blie Arkansol] undo unnecessary change
efae765 [Larry Xiao] fix mistakes: should be able to call with or without mergeFunc
b2422f9 [Larry Xiao] Merge branch '2062' of github.com:larryxiao/spark into 2062
52dc7f7 [Larry Xiao] pass mergeFunc to VertexPartitionBase, where merge is handled
581e9ee [Larry Xiao] TODO: VertexRDDSuite
20d80a3 [Larry Xiao] [SPARK-2062][GraphX] VertexRDD.apply does not use the mergeFunc
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
GraphX's current implementation of static (fixed iteration count) PageRank uses the Pregel API. This unnecessarily tracks active vertices, even though in static PageRank all vertices are always active. Active vertex tracking incurs the following costs:
1. A shuffle per iteration to ship the active sets to the edge partitions.
2. A hash table creation per iteration at each partition to index the active sets for lookup.
3. A hash lookup per edge to check whether the source vertex is active.
I reimplemented static PageRank using the lower-level GraphX API instead of the Pregel API. In benchmarks on a 16-node m2.4xlarge cluster, this provided a 23% speedup (from 514 s to 397 s, mean over 3 trials) for 10 iterations of PageRank on a synthetic graph with 10M vertices and 1.27B edges.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #2308 from ankurdave/SPARK-3427 and squashes the following commits:
449996a [Ankur Dave] Avoid unnecessary active vertex tracking in static PageRank
|
|
|
|
|
|
|
|
|
|
|
| |
9b225ac3072de522b40b46aba6df1f1c231f13ef has been causing GraphX tests
to fail nondeterministically, which is blocking development for others.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #2271 from ankurdave/SPARK-3400 and squashes the following commits:
10c2a97 [Ankur Dave] [HOTFIX] [SPARK-3400] Revert 9b225ac "fix GraphX EdgeRDD zipPartitions"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
#720
PR #720 made multiple changes to GraphGenerator.logNormalGraph including:
* Replacing the call to functions for generating random vertices and edges with in-line implementations with different equations. Based on reading the Pregel paper, I believe the in-line functions are incorrect.
* Hard-coding of RNG seeds so that method now generates the same graph for a given number of vertices, edges, mu, and sigma -- user is not able to override seed or specify that seed should be randomly generated.
* Backwards-incompatible change to logNormalGraph signature with introduction of new required parameter.
* Failed to update scala docs and programming guide for API changes
* Added a Synthetic Benchmark in the examples.
This PR:
* Removes the in-line calls and calls original vertex / edge generation functions again
* Adds an optional seed parameter for deterministic behavior (when desired)
* Keeps the number of partitions parameter that was added.
* Keeps compatibility with the synthetic benchmark example
* Maintains backwards-compatible API
Author: RJ Nowling <rnowling@gmail.com>
Author: Ankur Dave <ankurdave@gmail.com>
Closes #2168 from rnowling/graphgenrand and squashes the following commits:
f1cd79f [Ankur Dave] Style fixes
e11918e [RJ Nowling] Fix bad comparisons in unit tests
785ac70 [RJ Nowling] Fix style error
c70868d [RJ Nowling] Fix logNormalGraph scala doc for seed
41fd1f8 [RJ Nowling] Fix logNormalGraph scala doc for seed
799f002 [RJ Nowling] Added test for different seeds for sampleLogNormal
43949ad [RJ Nowling] Added test for different seeds for generateRandomEdges
2faf75f [RJ Nowling] Added unit test for logNormalGraph
82f22397 [RJ Nowling] Add unit test for sampleLogNormal
b99cba9 [RJ Nowling] Make sampleLogNormal private to Spark (vs private) for unit testing
6803da1 [RJ Nowling] Add GraphGeneratorsSuite with test for generateRandomEdges
1c8fc44 [RJ Nowling] Connected components part of SynthBenchmark was failing to call count on RDD before printing
dfbb6dd [RJ Nowling] Fix parameter name in SynthBenchmark docs
b5eeb80 [RJ Nowling] Add optional seed parameter to SynthBenchmark and set default to randomly generate a seed
1ff8d30 [RJ Nowling] Fix bug in generateRandomEdges where numVertices instead of numEdges was used to control number of edges to generate
98bb73c [RJ Nowling] Add documentation for logNormalGraph parameters
d40141a [RJ Nowling] Fix style error
684804d [RJ Nowling] revert PR #720 which introduce errors in logNormalGraph and messed up seeding of RNGs. Add user-defined optional seed for deterministic behavior
c183136 [RJ Nowling] Fix to deterministic GraphGenerators.logNormalGraph that allows generating graphs randomly using optional seed.
015010c [RJ Nowling] Fixed GraphGenerator logNormalGraph API to make backward-incompatible change in commit 894ecde04
|
|
|
|
|
|
|
|
|
|
|
| |
If the users set “spark.default.parallelism” and the value is different with the EdgeRDD partition number, GraphX jobs will throw:
java.lang.IllegalArgumentException: Can't zip RDDs with unequal numbers of partitions
Author: luluorta <luluorta@gmail.com>
Closes #1763 from luluorta/fix-graph-zip and squashes the following commits:
8338961 [luluorta] fix GraphX EdgeRDD zipPartitions
|
|
|
|
|
|
|
|
|
|
|
| |
minor fix
detail is here: https://issues.apache.org/jira/browse/SPARK-2981
Author: Larry Xiao <xiaodi@sjtu.edu.cn>
Closes #1902 from larryxiao/2981 and squashes the following commits:
88059a2 [Larry Xiao] [SPARK-2981][GraphX] EdgePartition1D Int overflow
|
|
|
|
|
|
|
|
|
|
| |
manually just as VertexRDD does.
Author: uncleGen <hustyugm@gmail.com>
Closes #2033 from uncleGen/master_origin and squashes the following commits:
801994b [uncleGen] Update EdgeRDD.scala
|
|
|
|
|
|
|
|
|
|
|
| |
to support ~/spark/bin/run-example GraphXAnalytics triangles
/soc-LiveJournal1.txt --numEPart=256
Author: Larry Xiao <xiaodi@sjtu.edu.cn>
Closes #1766 from larryxiao/1986 and squashes the following commits:
bb77cd9 [Larry Xiao] [SPARK-1986][GraphX]move lib.Analytics to org.apache.spark.examples
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
VertexRDDs with more than 4 billion elements are counted incorrectly due to integer overflow when summing partition sizes. This PR fixes the issue by converting partition sizes to Longs before summing them.
The following code previously returned -10000000. After applying this PR, it returns the correct answer of 5000000000 (5 billion).
```scala
val pairs = sc.parallelize(0L until 500L).map(_ * 10000000)
.flatMap(start => start until (start + 10000000)).map(x => (x, x))
VertexRDD(pairs).count()
```
Author: Ankur Dave <ankurdave@gmail.com>
Closes #2106 from ankurdave/SPARK-3190 and squashes the following commits:
641f468 [Ankur Dave] Avoid overflow in VertexRDD.count()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a new ShuffleManager based on sorting, as described in https://issues.apache.org/jira/browse/SPARK-2045. The bulk of the code is in an ExternalSorter class that is similar to ExternalAppendOnlyMap, but sorts key-value pairs by partition ID and can be used to create a single sorted file with a map task's output. (Longer-term I think this can take on the remaining functionality in ExternalAppendOnlyMap and replace it so we don't have code duplication.)
The main TODOs still left are:
- [x] enabling ExternalSorter to merge across spilled files
- [x] with an Ordering
- [x] without an Ordering, using the keys' hash codes
- [x] adding more tests (e.g. a version of our shuffle suite that runs on this)
- [x] rebasing on top of the size-tracking refactoring in #1165 when that is merged
- [x] disabling spilling if spark.shuffle.spill is set to false
Despite this though, this seems to work pretty well (running successfully in cases where the hash shuffle would OOM, such as 1000 reduce tasks on executors with only 1G memory), and it seems to be comparable in speed or faster than hash-based shuffle (it will create much fewer files for the OS to keep track of). So I'm posting it to get some early feedback.
After these TODOs are done, I'd also like to enable ExternalSorter to sort data within each partition by a key as well, which will allow us to use it to implement external spilling in reduce tasks in `sortByKey`.
Author: Matei Zaharia <matei@databricks.com>
Closes #1499 from mateiz/sort-based-shuffle and squashes the following commits:
bd841f9 [Matei Zaharia] Various review comments
d1c137fd [Matei Zaharia] Various review comments
a611159 [Matei Zaharia] Compile fixes due to rebase
62c56c8 [Matei Zaharia] Fix ShuffledRDD sometimes not returning Tuple2s.
f617432 [Matei Zaharia] Fix a failing test (seems to be due to change in SizeTracker logic)
9464d5f [Matei Zaharia] Simplify code and fix conflicts after latest rebase
0174149 [Matei Zaharia] Add cleanup behavior and cleanup tests for sort-based shuffle
eb4ee0d [Matei Zaharia] Remove customizable element type in ShuffledRDD
fa2e8db [Matei Zaharia] Allow nextBatchStream to be called after we're done looking at all streams
a34b352 [Matei Zaharia] Fix tracking of indices within a partition in SpillReader, and add test
03e1006 [Matei Zaharia] Add a SortShuffleSuite that runs ShuffleSuite with sort-based shuffle
3c7ff1f [Matei Zaharia] Obey the spark.shuffle.spill setting in ExternalSorter
ad65fbd [Matei Zaharia] Rebase on top of Aaron's Sorter change, and use Sorter in our buffer
44d2a93 [Matei Zaharia] Use estimateSize instead of atGrowThreshold to test collection sizes
5686f71 [Matei Zaharia] Optimize merging phase for in-memory only data:
5461cbb [Matei Zaharia] Review comments and more tests (e.g. tests with 1 element per partition)
e9ad356 [Matei Zaharia] Update ContextCleanerSuite to make sure shuffle cleanup tests use hash shuffle (since they were written for it)
c72362a [Matei Zaharia] Added bug fix and test for when iterators are empty
de1fb40 [Matei Zaharia] Make trait SizeTrackingCollection private[spark]
4988d16 [Matei Zaharia] tweak
c1b7572 [Matei Zaharia] Small optimization
ba7db7f [Matei Zaharia] Handle null keys in hash-based comparator, and add tests for collisions
ef4e397 [Matei Zaharia] Support for partial aggregation even without an Ordering
4b7a5ce [Matei Zaharia] More tests, and ability to sort data if a total ordering is given
e1f84be [Matei Zaharia] Fix disk block manager test
5a40a1c [Matei Zaharia] More tests
614f1b4 [Matei Zaharia] Add spill metrics to map tasks
cc52caf [Matei Zaharia] Add more error handling and tests for error cases
bbf359d [Matei Zaharia] More work
3a56341 [Matei Zaharia] More partial work towards sort-based shuffle
7a0895d [Matei Zaharia] Some more partial work towards sort-based shuffle
b615476 [Matei Zaharia] Scaffolding for sort-based shuffle
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Math.exp, Math.log
In a few places in MLlib, an expression of the form `log(1.0 + p)` is evaluated. When p is so small that `1.0 + p == 1.0`, the result is 0.0. However the correct answer is very near `p`. This is why `Math.log1p` exists.
Similarly for one instance of `exp(m) - 1` in GraphX; there's a special `Math.expm1` method.
While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible.
Also note the related PR for Python: https://github.com/apache/spark/pull/1652
Author: Sean Owen <srowen@gmail.com>
Closes #1659 from srowen/SPARK-2748 and squashes the following commits:
c5926d4 [Sean Owen] Use log1p, expm1 for better precision for tiny arguments
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
RoutingTableMessage was used to construct routing tables to enable
joining VertexRDDs with partitioned edges. It stored three elements: the
destination vertex ID, the source edge partition, and a byte specifying
the position in which the edge partition referenced the vertex to enable
join elimination.
However, this was incompatible with sort-based shuffle (SPARK-2045). It
was also slightly wasteful, because partition IDs are usually much
smaller than 2^32, though this was mitigated by a custom serializer that
used variable-length encoding.
This commit replaces RoutingTableMessage with a pair of (VertexId, Int)
where the Int encodes both the source partition ID (in the lower 30
bits) and the position (in the top 2 bits).
Author: Ankur Dave <ankurdave@gmail.com>
Closes #1553 from ankurdave/remove-RoutingTableMessage and squashes the following commits:
697e17b [Ankur Dave] Replace RoutingTableMessage with pair
|
|
|
|
|
|
|
|
|
|
|
| |
MessageToPartition was used in `Graph#partitionBy`. Unlike a Tuple2, it marked the key as transient to avoid sending it over the network. However, it was incompatible with sort-based shuffle (SPARK-2045) and represented only a minor optimization: for partitionBy, it improved performance by 6.3% (30.4 s to 28.5 s) and reduced communication by 5.6% (114.2 MB to 107.8 MB).
Author: Ankur Dave <ankurdave@gmail.com>
Closes #1537 from ankurdave/remove-MessageToPartition and squashes the following commits:
f9d0054 [Ankur Dave] Remove MessageToPartition
ab71364 [Ankur Dave] Remove unused VertexBroadcastMsg
|
|
|
|
|
|
|
|
|
|
|
| |
fix examples
Author: CrazyJvm <crazyjvm@gmail.com>
Closes #1523 from CrazyJvm/graphx-example and squashes the following commits:
663457a [CrazyJvm] outDegrees does not take parameters
7cfff1d [CrazyJvm] fix example for joinVertices
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
VertexPartition and ShippableVertexPartition are contained in RDDs but are not marked Serializable, leading to NotSerializableExceptions when using Java serialization.
The fix is simply to mark them as Serializable. This PR does that and adds a test for serializing them using Java and Kryo serialization.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #1376 from ankurdave/SPARK-2455 and squashes the following commits:
ed4a51b [Ankur Dave] Make (Shippable)VertexPartition serializable
1fd42c5 [Ankur Dave] Add failing tests for Java serialization
|
|
|
|
|
|
|
|
|
| |
Author: CrazyJvm <crazyjvm@gmail.com>
Closes #1368 from CrazyJvm/graph-comment-1 and squashes the following commits:
d47f3c5 [CrazyJvm] fix style
e190d6f [CrazyJvm] fix Graph partitionStrategy comment
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR is a sub-task of SPARK-2044 to move the execution of aggregation into shuffle implementations.
I leave `CoGoupedRDD` and `SubtractedRDD` unchanged because they have their implementations of aggregation. I'm not sure is it suitable to change these two RDDs.
Also I do not move sort related code of `OrderedRDDFunctions` into shuffle, this will be solved in another sub-task.
Author: jerryshao <saisai.shao@intel.com>
Closes #1064 from jerryshao/SPARK-2124 and squashes the following commits:
4a05a40 [jerryshao] Modify according to comments
1f7dcc8 [jerryshao] Style changes
50a2fd6 [jerryshao] Fix test suite issue after moving aggregator to Shuffle reader and writer
1a96190 [jerryshao] Code modification related to the ShuffledRDD
308f635 [jerryshao] initial works of move combiner to ShuffleManager's reader and writer
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In GraphImpl, mapVertices and outerJoinVertices use a more efficient implementation when the map function conserves vertex attribute types. This is implemented by comparing the ClassTags of the old and new vertex attribute types. However, ClassTags store erased types, so the comparison will return a false positive for types with different type parameters, such as Option[Int] and Option[Double].
This PR resolves the problem by requesting that the compiler generate evidence of equality between the old and new vertex attribute types, and providing a default value for the evidence parameter if the two types are not equal. The methods can then check the value of the evidence parameter to see whether the types are equal.
It also adds a test called "mapVertices changing type with same erased type" that failed before the PR and succeeds now.
Callers of mapVertices and outerJoinVertices can no longer use a wildcard for a graph's VD type. To avoid "Error occurred in an application involving default arguments," they must bind VD to a type parameter, as this PR does for ShortestPaths and LabelPropagation.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #967 from ankurdave/SPARK-1552 and squashes the following commits:
68a4fff [Ankur Dave] Undo conserve naming
7388705 [Ankur Dave] Remove unnecessary ClassTag for VD parameters
a704e5f [Ankur Dave] Use type equality constraint with default argument
29a5ab7 [Ankur Dave] Add failing test
f458c83 [Ankur Dave] Revert "[SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices"
16d6af8 [Ankur Dave] [SPARK-1552] Fix type comparison bug in mapVertices and outerJoinVertices
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Due to a bug introduced by apache/spark#497, Pregel does not unpersist replicated vertices from previous iterations. As a result, they stay cached until memory is full, wasting GC time.
This PR corrects the problem by unpersisting both the edges and the replicated vertices of previous iterations. This is safe because the edges and replicated vertices of the current iteration are cached by the call to `g.cache()` and then materialized by the call to `messages.count()`. Therefore no unmaterialized RDDs depend on `prevG.edges`. I verified that no recomputation occurs by running PageRank with a custom patch to Spark that warns when a partition is recomputed.
Thanks to Tim Weninger for reporting this bug.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #972 from ankurdave/SPARK-2025 and squashes the following commits:
13d5b07 [Ankur Dave] Unpersist edges of previous graph in Pregel
|
|
|
|
|
|
|
|
| |
Author: Ankur Dave <ankurdave@gmail.com>
Closes #970 from ankurdave/SPARK-1991_docfix and squashes the following commits:
6d07343 [Ankur Dave] Minor: Fix documentation error from apache/spark#946
|
|
|
|
|
|
|
|
|
|
| |
It is currently very difficult to repartition a graph over a different number of partitions. This PR adds an additional `partitionBy` function that takes the number of partitions.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Closes #719 from jegonzal/graph_partitioning_options and squashes the following commits:
730b405 [Joseph E. Gonzalez] adding an additional number of partitions option to partitionBy
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed.
The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels.
In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods.
I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed.
Author: Ankur Dave <ankurdave@gmail.com>
Closes #946 from ankurdave/SPARK-1991 and squashes the following commits:
ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString
ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores
c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks"
34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks
6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR accomplishes two things:
1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph. This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets
2. This PR improves the implementation of the log-normal graph generator.
Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
Author: Ankur Dave <ankurdave@gmail.com>
Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits:
e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
bccccad [Ankur Dave] Fix long lines
374678a [Ankur Dave] Bugfix and style changes
1bdf39a [Joseph E. Gonzalez] updating options
d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder.
f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
stop resetting spark.driver.port in unit tests (scala, java and python).
Author: Syed Hashmi <shashmi@cloudera.com>
Author: CodingCat <zhunansjtu@gmail.com>
Closes #943 from syedhashmi/master and squashes the following commits:
885f210 [Syed Hashmi] Removing unnecessary file (created by mergetool)
b8bd4b5 [Syed Hashmi] Merge remote-tracking branch 'upstream/master'
b895e59 [Syed Hashmi] Revert "[SPARK-1784] Add a new partitioner"
57b6587 [Syed Hashmi] Revert "[SPARK-1784] Add a balanced partitioner"
1574769 [Syed Hashmi] [SPARK-1942] Stop clearing spark.driver.port in unit tests
4354836 [Syed Hashmi] Revert "SPARK-1686: keep schedule() calling in the main thread"
fd36542 [Syed Hashmi] [SPARK-1784] Add a balanced partitioner
6668015 [CodingCat] SPARK-1686: keep schedule() calling in the main thread
4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a modified version of apache/spark#10.
Author: Ankur Dave <ankurdave@gmail.com>
Author: Andres Perez <andres@tresata.com>
Closes #933 from ankurdave/shortestpaths and squashes the following commits:
03a103c [Ankur Dave] Style fixes
7a1ff48 [Ankur Dave] Improve ShortestPaths documentation
d75c8fc [Ankur Dave] Remove unnecessary VD type param, and pass through ED
d983fb4 [Ankur Dave] Fix style errors
60ed8e6 [Andres Perez] Add Shortest-path computations to graphx.lib with unit tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A straightforward implementation of LPA algorithm for detecting graph communities using the Pregel framework. Amongst the growing literature on community detection algorithms in networks, LPA is perhaps the most elementary, and despite its flaws it remains a nice and simple approach.
Author: Ankur Dave <ankurdave@gmail.com>
Author: haroldsultan <haroldsultan@gmail.com>
Author: Harold Sultan <haroldsultan@gmail.com>
Closes #905 from haroldsultan/master and squashes the following commits:
327aee0 [haroldsultan] Merge pull request #2 from ankurdave/label-propagation
227a4d0 [Ankur Dave] Untabify
0ac574c [haroldsultan] Merge pull request #1 from ankurdave/label-propagation
0e24303 [Ankur Dave] Add LabelPropagationSuite
84aa061 [Ankur Dave] LabelPropagation: Fix compile errors and style; rename from LPA
9830342 [Harold Sultan] initial version of LPA
|