aboutsummaryrefslogtreecommitdiff
path: root/graphx/src
Commit message (Collapse)AuthorAgeFilesLines
...
* SPARK-1501: Ensure assertions in Graph.apply are asserted.William Benton2014-04-151-1/+1
| | | | | | | | | | | | | | | | | | The Graph.apply test in GraphSuite had some assertions in a closure in a graph transformation. As a consequence, these assertions never actually executed. Furthermore, these closures had a reference to (non-serializable) test harness classes because they called assert(), which could be a problem if we proactively check closure serializability in the future. This commit simply changes the Graph.apply test to collect the graph triplets so it can assert about each triplet from a map method. Author: William Benton <willb@redhat.com> Closes #415 from willb/graphsuite-nop-fix and squashes the following commits: 0b63658 [William Benton] Ensure assertions in Graph.apply are asserted.
* SPARK-1488. Resolve scalac feature warnings during buildSean Owen2014-04-144-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For your consideration: scalac currently notes a number of feature warnings during compilation: ``` [warn] there were 65 feature warning(s); re-run with -feature for details ``` Warnings are like: ``` [warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled [warn] by making the implicit value scala.language.implicitConversions visible. [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions' [warn] or by setting the compiler option -language:implicitConversions. [warn] See the Scala docs for value scala.language.implicitConversions for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) = [warn] ^ ``` scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used. This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build. Author: Sean Owen <sowen@cloudera.com> Closes #404 from srowen/SPARK-1488 and squashes the following commits: 8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features. 39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings
* Remove Unnecessary Whitespace'sSandeep2014-04-102-2/+2
| | | | | | | | | | stack these together in a commit else they show up chunk by chunk in different commits. Author: Sandeep <sandeep@techaddict.me> Closes #380 from techaddict/white_space and squashes the following commits: b58f294 [Sandeep] Remove Unnecessary Whitespace's
* Revert "SPARK-729: Closures not always serialized at capture time"Patrick Wendell2014-04-101-1/+1
| | | | This reverts commit 8ca3b2bc90a63b23a03f339e390174cd7a672b40.
* SPARK-729: Closures not always serialized at capture timeWilliam Benton2014-04-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns when free variables in closure arguments to transformations are captured. Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created). There are a few possible approaches to solving this problem and this PR will discuss some of them. The approach I took has the advantage of being simple, obviously correct, and minimally-invasive, but it preserves something that has been bothering me about Spark's closure handling, so I'd like to discuss an alternative and get some feedback on whether or not it is worth pursuing. ## What I did The basic approach I took depends on the work I did for #143, and so this PR is based atop that. Specifically: #143 modifies `ClosureCleaner.clean` to preemptively determine whether or not closures are serializable immediately upon closure cleaning (rather than waiting for an job involving that closure to be scheduled). Thus non-serializable closure exceptions will be triggered by the line defining the closure rather than triggered where the closure is used. Since the easiest way to determine whether or not a closure is serializable is to attempt to serialize it, the code in #143 is creating a serialized closure as part of `ClosureCleaner.clean`. `clean` currently modifies its argument, but the method in `SparkContext` that wraps it to return a value (a reference to the modified-in-place argument). This branch modifies `ClosureCleaner.clean` so that it returns a value: if it is cleaning a serializable closure, it returns the result of deserializing its serialized argument; therefore it is returning a closure with an environment captured at cleaning time. `SparkContext.clean` then returns the result of `ClosureCleaner.clean`, rather than a reference to its modified-in-place argument. I've added tests for this behavior (777a1bc). The pull request as it stands, given the changes in #143, is nearly trivial. There is some overhead from deserializing the closure, but it is minimal and the benefit of obvious operational correctness (vs. a more sophisticated but harder-to-validate transformation in `ClosureCleaner`) seems pretty important. I think this is a fine way to solve this problem, but it's not perfect. ## What we might want to do The thing that has been bothering me about Spark's handling of closures is that it seems like we should be able to statically ensure that cleaning and serialization happen exactly once for a given closure. If we serialize a closure in order to determine whether or not it is serializable, we should be able to hang on to the generated byte buffer and use it instead of re-serializing the closure later. By replacing closures with instances of a sum type that encodes whether or not a closure has been cleaned or serialized, we could handle clean, to-be-cleaned, and serialized closures separately with case matches. Here's a somewhat-concrete sketch (taken from my git stash) of what this might look like: ```scala package org.apache.spark.util import java.nio.ByteBuffer import scala.reflect.ClassManifest sealed abstract class ClosureBox[T] { def func: T } final case class RawClosure[T](func: T) extends ClosureBox[T] {} final case class CleanedClosure[T](func: T) extends ClosureBox[T] {} final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends ClosureBox[T] {} object ClosureBoxImplicits { implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new RawClosure[T](fun) } ``` With these types declared, we'd be able to change `ClosureCleaner.clean` to take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a `SerializedClosure[T=>U]`, depending on whether or not serializability-checking was enabled) instead of a `T=>U`. A case match could thus short-circuit cleaning or serializing closures that had already been cleaned or serialized (both in `ClosureCleaner` and in the closure serializer). Cleaned-and-serialized closures would be represented by a boxed tuple of the original closure and a serialized copy (complete with an environment quiesced at transformation time). Additional implicit conversions could convert from `ClosureBox` instances to the underlying function type where appropriate. Tracking this sort of state in the type system seems like the right thing to do to me. ### Why we might not want to do that _It's pretty invasive._ Every function type used by every `RDD` subclass would have to change to reflect that they expected a `ClosureBox[T=>U]` instead of a `T=>U`. This obscures what's going on and is not a little ugly. Although I really like the idea of using the type system to enforce the clean-or-serialize once discipline, it might not be worth adding another layer of types (even if we could hide some of the extra boilerplate with judicious application of implicit conversions). _It statically guarantees a property whose absence is unlikely to cause any serious problems as it stands._ It appears that all closures are currently dynamically cleaned once and it's not obvious that repeated closure-cleaning is likely to be a problem in the future. Furthermore, serializing closures is relatively cheap, so doing it once to check for serialization and once again to actually ship them across the wire doesn't seem like a big deal. Taken together, these seem like a high price to pay for statically guaranteeing that closures are operated upon only once. ## Other possibilities I felt like the serialize-and-deserialize approach was best due to its obvious simplicity. But it would be possible to do a more sophisticated transformation within `ClosureCleaner.clean`. It might also be possible for `clean` to modify its argument in a way so that whether or not a given closure had been cleaned would be apparent upon inspection; this would buy us some of the operational benefits of the `ClosureBox` approach but not the static cleanliness. I'm interested in any feedback or discussion on whether or not the problems with the type-based approach indeed outweigh the advantage, as well as of approaches to this issue and to closure handling in general. Author: William Benton <willb@redhat.com> Closes #189 from willb/spark-729 and squashes the following commits: f4cafa0 [William Benton] Stylistic changes and cleanups b3d9c86 [William Benton] Fixed style issues in tests 9b56ce0 [William Benton] Added array-element capture test 97e9d91 [William Benton] Split closure-serializability failure tests 12ef6e3 [William Benton] Skip proactive closure capture for runJob 8ee3ee7 [William Benton] Predictable closure environment capture 12c63a7 [William Benton] Added tests for variable capture in closures d6e8dd6 [William Benton] Don't check serializability of DStream transforms. 4ecf841 [William Benton] Make proactive serializability checking optional. d8df3db [William Benton] Adds proactive closure-serializablilty checking 21b4b06 [William Benton] Test cases for SPARK-897. d5947b3 [William Benton] Ensure assertions in Graph.apply are asserted.
* SPARK-1093: Annotate developer and experimental API'sPatrick Wendell2014-04-091-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch marks some existing classes as private[spark] and adds two types of API annotations: - `EXPERIMENTAL API` = experimental user-facing module - `DEVELOPER API - UNSTABLE` = developer-facing API that might change There is some discussion of the different mechanisms for doing this here: https://issues.apache.org/jira/browse/SPARK-1081 I was pretty aggressive with marking things private. Keep in mind that if we want to open something up in the future we can, but we can never reduce visibility. A few notes here: - In the past we've been inconsistent with the visiblity of the X-RDD classes. This patch marks them private whenever there is an existing function in RDD that can directly creat them (e.g. CoalescedRDD and rdd.coalesce()). One trade-off here is users can't subclass them. - Noted that compression and serialization formats don't have to be wire compatible across versions. - Compression codecs and serialization formats are semi-private as users typically don't instantiate them directly. - Metrics sources are made private - user only interacts with them through Spark's reflection Author: Patrick Wendell <pwendell@gmail.com> Author: Andrew Or <andrewor14@gmail.com> Closes #274 from pwendell/private-apis and squashes the following commits: 44179e4 [Patrick Wendell] Merge remote-tracking branch 'apache-github/master' into private-apis 042c803 [Patrick Wendell] spark.annotations -> spark.annotation bfe7b52 [Patrick Wendell] Adding experimental for approximate counts 8d0c873 [Patrick Wendell] Warning in SparkEnv 99b223a [Patrick Wendell] Cleaning up annotations e849f64 [Patrick Wendell] Merge pull request #2 from andrewor14/annotations 982a473 [Andrew Or] Generalize jQuery matching for non Spark-core API docs a01c076 [Patrick Wendell] Merge pull request #1 from andrewor14/annotations c1bcb41 [Andrew Or] DeveloperAPI -> DeveloperApi 0d48908 [Andrew Or] Comments and new lines (minor) f3954e0 [Andrew Or] Add identifier tags in comments to work around scaladocs bug 99192ef [Andrew Or] Dynamically add badges based on annotations 824011b [Andrew Or] Add support for injecting arbitrary JavaScript to API docs 037755c [Patrick Wendell] Some changes after working with andrew or f7d124f [Patrick Wendell] Small fixes c318b24 [Patrick Wendell] Use CSS styles e4c76b9 [Patrick Wendell] Logging f390b13 [Patrick Wendell] Better visibility for workaround constructors d6b0afd [Patrick Wendell] Small chang to existing constructor 403ba52 [Patrick Wendell] Style fix 870a7ba [Patrick Wendell] Work around for SI-8479 7fb13b2 [Patrick Wendell] Changes to UnionRDD and EmptyRDD 4a9e90c [Patrick Wendell] EXPERIMENTAL API --> EXPERIMENTAL c581dce [Patrick Wendell] Changes after building against Shark. 8452309 [Patrick Wendell] Style fixes 1ed27d2 [Patrick Wendell] Formatting and coloring of badges cd7a465 [Patrick Wendell] Code review feedback 2f706f1 [Patrick Wendell] Don't use floats 542a736 [Patrick Wendell] Small fixes cf23ec6 [Patrick Wendell] Marking GraphX as alpha d86818e [Patrick Wendell] Another naming change 5a76ed6 [Patrick Wendell] More visiblity clean-up 42c1f09 [Patrick Wendell] Using better labels 9d48cbf [Patrick Wendell] Initial pass
* Do not re-use objects in the EdgePartition/EdgeTriplet iterators.Daniel Darabos2014-04-024-10/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This avoids a silent data corruption issue (https://spark-project.atlassian.net/browse/SPARK-1188) and has no performance impact by my measurements. It also simplifies the code. As far as I can tell the object re-use was nothing but premature optimization. I did actual benchmarks for all the included changes, and there is no performance difference. I am not sure where to put the benchmarks. Does Spark not have a benchmark suite? This is an example benchmark I did: test("benchmark") { val builder = new EdgePartitionBuilder[Int] for (i <- (1 to 10000000)) { builder.add(i.toLong, i.toLong, i) } val p = builder.toEdgePartition p.map(_.attr + 1).iterator.toList } It ran for 10 seconds both before and after this change. Author: Daniel Darabos <darabos.daniel@gmail.com> Closes #276 from darabos/spark-1188 and squashes the following commits: 574302b [Daniel Darabos] Restore "manual" copying in EdgePartition.map(Iterator). Add comment to discourage novices like myself from trying to simplify the code. 4117a64 [Daniel Darabos] Revert EdgePartitionSuite. 4955697 [Daniel Darabos] Create a copy of the Edge objects in EdgeRDD.compute(). This avoids exposing the object re-use, while still enables the more efficient behavior for internal code. 4ec77f8 [Daniel Darabos] Add comments about object re-use to the affected functions. 2da5e87 [Daniel Darabos] Restore object re-use in EdgePartition. 0182f2b [Daniel Darabos] Do not re-use objects in the EdgePartition/EdgeTriplet iterators. This avoids a silent data corruption issue (SPARK-1188) and has no performance impact in my measurements. It also simplifies the code. c55f52f [Daniel Darabos] Tests that reproduce the problems from SPARK-1188.
* SPARK-1352 - Comment style single space before ending */ check.Prashant Sharma2014-03-301-1/+1
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #261 from ScrapCodes/comment-style-check2 and squashes the following commits: 6cde61e [Prashant Sharma] comment style space before ending */ check.
* SPARK-1096, a space after comment start style checker.Prashant Sharma2014-03-286-9/+7
| | | | | | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #124 from ScrapCodes/SPARK-1096/scalastyle-comment-check and squashes the following commits: 214135a [Prashant Sharma] Review feedback. 5eba88c [Prashant Sharma] Fixed style checks for ///+ comments. e54b2f8 [Prashant Sharma] improved message, work around. 83e7144 [Prashant Sharma] removed dependency on scalastyle in plugin, since scalastyle sbt plugin already depends on the right version. Incase we update the plugin we will have to adjust our spark-style project to depend on right scalastyle version. 810a1d6 [Prashant Sharma] SPARK-1096, a space after comment style checker. ba33193 [Prashant Sharma] scala style as a project
* Spark 1095 : Adding explicit return types to all public methodsNirmalReddy2014-03-263-3/+4
| | | | | | | | | | | | | | Excluded those that are self-evident and the cases that are discussed in the mailing list. Author: NirmalReddy <nirmal_reddy2000@yahoo.com> Author: NirmalReddy <nirmal.reddy@imaginea.com> Closes #168 from NirmalReddy/Spark-1095 and squashes the following commits: ac54b29 [NirmalReddy] import misplaced 8c5ff3e [NirmalReddy] Changed syntax of unit returning methods 02d0778 [NirmalReddy] fixed explicit types in all the other packages 1c17773 [NirmalReddy] fixed explicit types in core package
* SPARK-1255: Allow user to pass Serializer object instead of class name for ↵Reynold Xin2014-03-164-32/+26
| | | | | | | | | | | | | | | | | | shuffle. This is more general than simply passing a string name and leaves more room for performance optimizations. Note that this is technically an API breaking change in the following two ways: 1. The shuffle serializer specification in ShuffleDependency now require an object instead of a String (of the class name), but I suspect nobody else in this world has used this API other than me in GraphX and Shark. 2. Serializer's in Spark from now on are required to be serializable. Author: Reynold Xin <rxin@apache.org> Closes #149 from rxin/serializer and squashes the following commits: 5acaccd [Reynold Xin] Properly call serializer's constructors. 2a8d75a [Reynold Xin] Added more documentation for the serializer option in ShuffleDependency. 7420185 [Reynold Xin] Allow user to pass Serializer object instead of class name for shuffle.
* SPARK-782 Clean up for ASM dependency.Patrick Wendell2014-03-091-2/+2
| | | | | | | | | | | | | | | | This makes two changes. 1) Spark uses the shaded version of asm that is (conveniently) published with Kryo. 2) Existing exclude rules around asm are updated to reflect the new groupId of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop versions that pull in new asm versions. Author: Patrick Wendell <pwendell@gmail.com> Closes #100 from pwendell/asm and squashes the following commits: 9235f3f [Patrick Wendell] SPARK-782 Clean up for ASM dependency.
* Graph primitives2Semih Salihoglu2014-02-242-10/+183
| | | | | | | | | | | | | | | | | | Hi guys, I'm following Joey and Ankur's suggestions to add collectEdges and pickRandomVertex. I'm also adding the tests for collectEdges and refactoring one method getCycleGraph in GraphOpsSuite.scala. Thank you, semih Author: Semih Salihoglu <semihsalihoglu@gmail.com> Closes #580 from semihsalihoglu/GraphPrimitives2 and squashes the following commits: 937d3ec [Semih Salihoglu] - Fixed the scalastyle errors. a69a152 [Semih Salihoglu] - Adding collectEdges and pickRandomVertices. - Adding tests for collectEdges. - Refactoring a getCycle utility function for GraphOpsSuite.scala. 41265a6 [Semih Salihoglu] - Adding collectEdges and pickRandomVertex. - Adding tests for collectEdges. - Recycling a getCycle utility test file.
* Merge pull request #567 from ScrapCodes/style2.Prashant Sharma2014-02-091-2/+1
| | | | | | | | | | | | | | | | SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Pt 2 Continuation of PR #557 With this all scala style errors are fixed across the code base !! The reason for creating a separate PR was to not interrupt an already reviewed and ready to merge PR. Hope this gets reviewed soon and merged too. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #567 and squashes the following commits: 3b1ec30 [Prashant Sharma] scala style fixes
* Merge pull request #557 from ScrapCodes/style. Closes #557.Patrick Wendell2014-02-096-44/+52
| | | | | | | | | | | | | | | | | | | | | SPARK-1058, Fix Style Errors and Add Scala Style to Spark Build. Author: Patrick Wendell <pwendell@gmail.com> Author: Prashant Sharma <scrapcodes@gmail.com> == Merge branch commits == commit 1a8bd1c059b842cb95cc246aaea74a79fec684f4 Author: Prashant Sharma <scrapcodes@gmail.com> Date: Sun Feb 9 17:39:07 2014 +0530 scala style fixes commit f91709887a8e0b608c5c2b282db19b8a44d53a43 Author: Patrick Wendell <pwendell@gmail.com> Date: Fri Jan 24 11:22:53 2014 -0800 Adding scalastyle snapshot
* Replace commons-math with jblasJianping J Wang2014-01-231-32/+36
|
* Depend on Commons Math explicitly instead of accidentally getting it from ↵Sean Owen2014-01-221-1/+1
| | | | Hadoop (which stops working in 2.2.x) and also use the newer commons-math3
* Merge pull request #436 from ankurdave/VertexId-caseReynold Xin2014-01-1432-209/+209
|\ | | | | | | Rename VertexID -> VertexId in GraphX
| * VertexID -> VertexIdAnkur Dave2014-01-1432-209/+209
| |
* | Fixed SVDPlusPlusSuite in Maven build.Reynold Xin2014-01-142-7/+19
|/
* Add missing header filesPatrick Wendell2014-01-1443-0/+731
|
* Adding minimal additional functionality to EdgeRDDJoseph E. Gonzalez2014-01-131-0/+17
|
* Fix bug in GraphLoader.edgeListFile that caused srcId > dstIdAnkur Dave2014-01-131-1/+1
|
* Edge object must be public for Edge case classAnkur Dave2014-01-131-2/+2
|
* Improve scaladoc linksAnkur Dave2014-01-132-6/+6
|
* Fix infinite loop in GraphGenerators.generateRandomEdgesAnkur Dave2014-01-131-8/+1
| | | | | The loop occurred when numEdges < numVertices. This commit fixes it by allowing generateRandomEdges to generate a multigraph.
* Make Graph{,Impl,Ops} serializable to work around captureAnkur Dave2014-01-133-3/+3
|
* Remove Graph.statistics and GraphImpl.printLineageAnkur Dave2014-01-133-77/+1
|
* Updated doc for PageRank.Reynold Xin2014-01-131-47/+39
|
* More cleanup.Reynold Xin2014-01-134-9/+10
|
* Moved SVDPlusPlusConf into SVDPlusPlus object itself.Reynold Xin2014-01-132-15/+17
|
* Moved PartitionStrategy's into an object.Reynold Xin2014-01-134-81/+85
|
* Updated GraphGenerator.Reynold Xin2014-01-131-30/+30
|
* Made more things private.Reynold Xin2014-01-1310-12/+26
|
* Merge branch 'graphx' of github.com:ankurdave/incubator-spark into graphxReynold Xin2014-01-1312-137/+70
|\ | | | | | | | | Conflicts: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
| * Improvements in example code for the programming guide as well as adding ↵Joseph E. Gonzalez2014-01-131-0/+3
| | | | | | | | serialization support for GraphImpl to address issues with failed closure capture.
| * Add EdgeDirection.Either and use it to fix CC bugAnkur Dave2014-01-1312-54/+64
| | | | | | | | | | | | | | | | | | The bug was due to a misunderstanding of the activeSetOpt parameter to Graph.mapReduceTriplets. Passing EdgeDirection.Both causes mapReduceTriplets to run only on edges with *both* vertices in the active set. This commit adds EdgeDirection.Either, which causes mapReduceTriplets to run on edges with *either* vertex in the active set. This is what connected components needed.
| * Remove aggregateNeighborsAnkur Dave2014-01-132-85/+5
| |
* | Miscel doc update.Reynold Xin2014-01-1317-143/+158
|/
* Merge pull request #2 from jegonzal/GraphXCCIssueAnkur Dave2014-01-133-16/+62
|\ | | | | Improving documentation and identifying potential bug in CC calculation.
| * Improving documentation and identifying potential bug in CC calculation.Joseph E. Gonzalez2014-01-133-16/+62
| |
* | Improve EdgeRDD scaladocAnkur Dave2014-01-131-2/+11
| |
* | Further improve VertexRDD scaladocsAnkur Dave2014-01-131-14/+25
|/
* Move algorithms to GraphOpsAnkur Dave2014-01-124-78/+51
|
* Add TriangleCount exampleAnkur Dave2014-01-121-3/+2
|
* adding Pregel as an operator in GraphOps and cleaning up documentation of ↵Joseph E. Gonzalez2014-01-122-22/+74
| | | | GraphOps
* Add PageRank example and dataAnkur Dave2014-01-121-1/+1
|
* Link methods in programming guide; document VertexIDAnkur Dave2014-01-121-0/+4
|
* Make EdgeDirection val instead of case object for Java compat.Ankur Dave2014-01-113-5/+15
|
* Use SparkConf in GraphX tests (via LocalSparkContext)Ankur Dave2014-01-111-5/+5
|