aboutsummaryrefslogtreecommitdiff
path: root/core
Commit message (Collapse)AuthorAgeFilesLines
* Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."Patrick Wendell2014-03-182-41/+38
| | | | | | | | | | | | This reverts commit ca4bf8c572c2f70b484830f1db414b5073744ab6. Jetty 9 requires JDK7 which is probably not a dependency we want to bump right now. Before Spark 1.0 we should consider upgrading to Jetty 8. However, in the mean time to ease some pain let's revert this. Sorry for not catching this during the initial review. cc/ @rxin Author: Patrick Wendell <pwendell@gmail.com> Closes #167 from pwendell/jetty-revert and squashes the following commits: 811b1c5 [Patrick Wendell] Revert "SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225."
* Spark 1246 add min max to stat counterDan McClary2014-03-185-2/+52
| | | | | | | | | | | | | | | | | | | | | Here's the addition of min and max to statscounter.py and min and max methods to rdd.py. Author: Dan McClary <dan.mcclary@gmail.com> Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits: fd3fd4b [Dan McClary] fixed error, updated test 82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter 5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark 21dd366 [Dan McClary] added max and min to StatCounter output, updated doc 1a97558 [Dan McClary] added max and min to StatCounter output, updated doc a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py 1e7056d [Dan McClary] added underscore to getBucket 37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived 29981f2 [Dan McClary] fixed indentation on doctest comment eaf89d9 [Dan McClary] added correct doctest for histogram 4916016 [Dan McClary] added histogram method, added max and min to statscounter
* SPARK-1244: Throw exception if map output status exceeds frame sizePatrick Wendell2014-03-176-20/+84
| | | | | | | | | | | | | | | | This is a very small change on top of @andrewor14's patch in #147. Author: Patrick Wendell <pwendell@gmail.com> Author: Andrew Or <andrewor14@gmail.com> Closes #152 from pwendell/akka-frame and squashes the following commits: e5fb3ff [Patrick Wendell] Reversing test order 393af4c [Patrick Wendell] Small improvement suggested by Andrew Or 8045103 [Patrick Wendell] Breaking out into two tests 2b4e085 [Patrick Wendell] Consolidate Executor use of akka frame size c9b6109 [Andrew Or] Simplify test + make access to akka frame size more modular 281d7c9 [Andrew Or] Throw exception on spark.akka.frameSize exceeded + Unit tests
* SPARK-1240: handle the case of empty RDD when takeSampleCodingCat2014-03-162-1/+13
| | | | | | | | | | | | | | | | | | | | https://spark-project.atlassian.net/browse/SPARK-1240 It seems that the current implementation does not handle the empty RDD case when run takeSample In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value In the test case, I also add several lines for this case Author: CodingCat <zhunansjtu@gmail.com> Closes #135 from CodingCat/SPARK-1240 and squashes the following commits: fef57d4 [CodingCat] fix the same problem in PySpark 36db06b [CodingCat] create new test cases for takeSample from an empty red 810948d [CodingCat] further fix a40e8fb [CodingCat] replace if with require ad483fd [CodingCat] handle the case with empty RDD when take sample
* SPARK-1255: Allow user to pass Serializer object instead of class name for ↵Reynold Xin2014-03-1614-139/+99
| | | | | | | | | | | | | | | | | | shuffle. This is more general than simply passing a string name and leaves more room for performance optimizations. Note that this is technically an API breaking change in the following two ways: 1. The shuffle serializer specification in ShuffleDependency now require an object instead of a String (of the class name), but I suspect nobody else in this world has used this API other than me in GraphX and Shark. 2. Serializer's in Spark from now on are required to be serializable. Author: Reynold Xin <rxin@apache.org> Closes #149 from rxin/serializer and squashes the following commits: 5acaccd [Reynold Xin] Properly call serializer's constructors. 2a8d75a [Reynold Xin] Added more documentation for the serializer option in ShuffleDependency. 7420185 [Reynold Xin] Allow user to pass Serializer object instead of class name for shuffle.
* Fix serialization of MutablePair. Also provide an interface for easy updating.Michael Armbrust2014-03-141-1/+11
| | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #141 from marmbrus/mutablePair and squashes the following commits: f5c4783 [Michael Armbrust] Change function name to update 8bfd973 [Michael Armbrust] Fix serialization of MutablePair. Also provide an interface for easy updating.
* SPARK-1236 - Upgrade Jetty to 9.1.3.v20140225.Reynold Xin2014-03-132-38/+41
| | | | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #113 from rxin/jetty9 and squashes the following commits: 867a2ce [Reynold Xin] Updated Jetty version to 9.1.3.v20140225 in Maven build file. d7c97ca [Reynold Xin] Return the correctly bound port. d14706f [Reynold Xin] Upgrade Jetty to 9.1.3.v20140225.
* SPARK-1019: pyspark RDD take() throws an NPEPatrick Wendell2014-03-122-1/+10
| | | | | | | | Author: Patrick Wendell <pwendell@gmail.com> Closes #112 from pwendell/pyspark-take and squashes the following commits: daae80e [Patrick Wendell] SPARK-1019: pyspark RDD take() throws an NPE
* hot fix for PR105 - change to Java annotationCodingCat2014-03-121-1/+2
| | | | | | | | Author: CodingCat <zhunansjtu@gmail.com> Closes #133 from CodingCat/SPARK-1160-2 and squashes the following commits: 6607155 [CodingCat] hot fix for PR105 - change to Java annotation
* SPARK-1160: Deprecate toArray in RDDCodingCat2014-03-124-3/+5
| | | | | | | | | | | | | | | | https://spark-project.atlassian.net/browse/SPARK-1160 reported by @mateiz: "It's redundant with collect() and the name doesn't make sense in Java, where we return a List (we can't return an array due to the way Java generics work). It's also missing in Python." In this patch, I deprecated the method and changed the source files using it by replacing toArray with collect() directly Author: CodingCat <zhunansjtu@gmail.com> Closes #105 from CodingCat/SPARK-1060 and squashes the following commits: 286f163 [CodingCat] deprecate in JavaRDDLike ee17b4e [CodingCat] add message and since 2ff7319 [CodingCat] deprecate toArray in RDD
* Fix #SPARK-1149 Bad partitioners can cause Spark to hangliguoqiang2014-03-122-0/+22
| | | | | | | | | | | | | | | | | | | | | | | | | Author: liguoqiang <liguoqiang@rd.tuan800.com> Closes #44 from witgo/SPARK-1149 and squashes the following commits: 3dcdcaf [liguoqiang] Merge branch 'master' into SPARK-1149 8425395 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149 3dad595 [liguoqiang] review comment e3e56aa [liguoqiang] Merge branch 'master' into SPARK-1149 b0d5c07 [liguoqiang] review comment d0a6005 [liguoqiang] review comment 3395ee7 [liguoqiang] Merge remote-tracking branch 'upstream/master' into SPARK-1149 ac006a3 [liguoqiang] code Formatting 3feb3a8 [liguoqiang] Merge branch 'master' into SPARK-1149 adc443e [liguoqiang] partitions check bugfix 928e1e3 [liguoqiang] Added a unit test for PairRDDFunctions.lookup with bad partitioner db6ecc5 [liguoqiang] Merge branch 'master' into SPARK-1149 1e3331e [liguoqiang] Merge branch 'master' into SPARK-1149 3348619 [liguoqiang] Optimize performance for partitions check 61e5a87 [liguoqiang] Merge branch 'master' into SPARK-1149 e68210a [liguoqiang] add partition index check to submitJob 3a65903 [liguoqiang] make the code more readable 6bb725e [liguoqiang] fix #SPARK-1149 Bad partitioners can cause Spark to hang
* [SPARK-1232] Fix the hadoop 0.23 yarn buildThomas Graves2014-03-121-0/+12
| | | | | | | | Author: Thomas Graves <tgraves@apache.org> Closes #127 from tgravescs/SPARK-1232 and squashes the following commits: c05cfd4 [Thomas Graves] Fix the hadoop 0.23 yarn build
* SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues...Patrick Wendell2014-03-112-88/+0
| | | | | | | | | | | | | | | | | | | This patch removes Ganglia integration from the default build. It allows users willing to link against LGPL code to use Ganglia by adding build flags or linking against a new Spark artifact called spark-ganglia-lgpl. This brings Spark in line with the Apache policy on LGPL code enumerated here: https://www.apache.org/legal/3party.html#options-optional Author: Patrick Wendell <pwendell@gmail.com> Closes #108 from pwendell/ganglia and squashes the following commits: 326712a [Patrick Wendell] Responding to review feedback 5f28ee4 [Patrick Wendell] SPARK-1167: Remove metrics-ganglia from default build due to LGPL issues.
* SPARK-1205: Clean up callSite/origin/generator.Patrick Wendell2014-03-108-38/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the `generator` field and simplifies + documents the tracking of callsites. There are two places where we care about call sites, when a job is run and when an RDD is created. This patch retains both of those features but does a slight refactoring and renaming to make things less confusing. There was another feature of an rdd called the `generator` which was by default the user class that in which the RDD was created. This is used exclusively in the JobLogger. It been subsumed by the ability to name a job group. The job logger can later be refectored to read the job group directly (will require some work) but for now this just preserves the default logged value of the user class. I'm not sure any users ever used the ability to override this. Author: Patrick Wendell <pwendell@gmail.com> Closes #106 from pwendell/callsite and squashes the following commits: fc1d009 [Patrick Wendell] Compile fix e17fb76 [Patrick Wendell] Review feedback: callSite -> creationSite 62e77ef [Patrick Wendell] Review feedback 576e60b [Patrick Wendell] SPARK-1205: Clean up callSite/origin/generator.
* SPARK-782 Clean up for ASM dependency.Patrick Wendell2014-03-092-6/+2
| | | | | | | | | | | | | | | | This makes two changes. 1) Spark uses the shaded version of asm that is (conveniently) published with Kryo. 2) Existing exclude rules around asm are updated to reflect the new groupId of `org.ow2.asm`. This made all of the old rules not work with newer Hadoop versions that pull in new asm versions. Author: Patrick Wendell <pwendell@gmail.com> Closes #100 from pwendell/asm and squashes the following commits: 9235f3f [Patrick Wendell] SPARK-782 Clean up for ASM dependency.
* Add timeout for fetch fileJiacheng Guo2014-03-091-0/+4
| | | | | | | | | | | | | | | Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set. Author: Jiacheng Guo <guojc03@gmail.com> Closes #98 from guojc/master and squashes the following commits: abfe698 [Jiacheng Guo] add space according request 2a37c34 [Jiacheng Guo] Add timeout for fetch file Currently, when fetch a file, the connection's connect timeout and read timeout is based on the default jvm setting, in this change, I change it to use spark.worker.timeout. This can be usefull, when the connection status between worker is not perfect. And prevent prematurely remove task set.
* SPARK-929: Fully deprecate usage of SPARK_MEMAaron Davidson2014-03-092-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | (Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615) This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables: SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public. SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property. SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory. Other memory considerations: - The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY. - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class). This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however. Author: Aaron Davidson <aaron@databricks.com> Closes #99 from aarondav/sparkmem and squashes the following commits: 9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
* SPARK-1190: Do not initialize log4j if slf4j log4j backend is not being usedPatrick Wendell2014-03-081-2/+5
| | | | | | | | Author: Patrick Wendell <pwendell@gmail.com> Closes #107 from pwendell/logging and squashes the following commits: be21c11 [Patrick Wendell] Logging fix
* [SPARK-1194] Fix the same-RDD rule for cache replacementCheng Lian2014-03-072-6/+19
| | | | | | | | | | | | | | | | | SPARK-1194: https://spark-project.atlassian.net/browse/SPARK-1194 In the current implementation, when selecting candidate blocks to be swapped out, once we find a block from the same RDD that the block to be stored belongs to, cache eviction fails and aborts. In this PR, we keep selecting blocks *not* from the RDD that the block to be stored belongs to until either enough free space can be ensured (cache eviction succeeds) or all such blocks are checked (cache eviction fails). Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #96 from liancheng/fix-spark-1194 and squashes the following commits: 2524ab9 [Cheng Lian] Added regression test case for SPARK-1194 6e40c22 [Cheng Lian] Remove redundant comments 40cdcb2 [Cheng Lian] Bug fix, and addressed PR comments from @mridulm 62c92ac [Cheng Lian] Fixed SPARK-1194 https://spark-project.atlassian.net/browse/SPARK-1194
* SPARK-1193. Fix indentation in pom.xmlsSandy Ryza2014-03-071-267/+253
| | | | | | | | Author: Sandy Ryza <sandy@cloudera.com> Closes #91 from sryza/sandy-spark-1193 and squashes the following commits: a878124 [Sandy Ryza] SPARK-1193. Fix indentation in pom.xmls
* Spark 1165 rdd.intersection in python and javaPrashant Sharma2014-03-074-0/+58
| | | | | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Author: Prashant Sharma <scrapcodes@gmail.com> Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits: 9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection. 1fea813 [Prashant Sharma] correct the lines wrapping d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
* SPARK-1195: set map_input_file environment variable in PipedRDDThomas Graves2014-03-073-53/+158
| | | | | | | | | | | | | | | | Hadoop uses the config mapreduce.map.input.file to indicate the input filename to the map when the input split is of type FileSplit. Some of the hadoop input and output formats set or use this config. This config can also be used by user code. PipedRDD runs an external process and the configs aren't available to that process. Hadoop Streaming does something very similar and the way they make configs available is exporting them into the environment replacing '.' with '_'. Spark should also export this variable when launching the pipe command so the user code has access to that config. Note that the config mapreduce.map.input.file is the new one, the old one which is deprecated but not yet removed is map.input.file. So we should handle both. Perhaps it would be better to abstract this out somehow so it goes into the HadoopParition code? Author: Thomas Graves <tgraves@apache.org> Closes #94 from tgravescs/map_input_file and squashes the following commits: cc97a6a [Thomas Graves] Update test to check for existence of command, add a getPipeEnvVars function to HadoopRDD e3401dc [Thomas Graves] Merge remote-tracking branch 'upstream/master' into map_input_file 2ba805e [Thomas Graves] set map_input_file environment variable in PipedRDD
* SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1Aaron Davidson2014-03-073-9/+56
| | | | | | | | | | | | | | | This patch allows the FaultToleranceTest to work in newer versions of Docker. See https://spark-project.atlassian.net/browse/SPARK-1136 for more details. Besides changing the Docker and FaultToleranceTest internals, this patch also changes the behavior of Master to accept new Workers which share an address with a Worker that we are currently trying to recover. This can only happen when the Worker itself was restarted and got the same IP address/port at the same time as a Master recovery occurs. Finally, this adds a good bit of ASCII art to the test to make failures, successes, and actions more apparent. This is very much needed. Author: Aaron Davidson <aaron@databricks.com> Closes #5 from aarondav/zookeeper and squashes the following commits: 5d7a72a [Aaron Davidson] SPARK-1136: Fix FaultToleranceTest for Docker 0.8.1
* Small clean-up to flatmap testsPatrick Wendell2014-03-061-8/+3
|
* SPARK-1197. Change yarn-standalone to yarn-cluster and fix up running on ↵Sandy Ryza2014-03-062-4/+14
| | | | | | | | | | | | | YARN docs This patch changes "yarn-standalone" to "yarn-cluster" (but still supports the former). It also cleans up the Running on YARN docs and adds a section on how to view logs. Author: Sandy Ryza <sandy@cloudera.com> Closes #95 from sryza/sandy-spark-1197 and squashes the following commits: 563ef3a [Sandy Ryza] Review feedback 6ad06d4 [Sandy Ryza] Change yarn-standalone to yarn-cluster and fix up running on YARN docs
* SPARK-1189: Add Security to Spark - Akka, Http, ConnectionManager, UI use ↵Thomas Graves2014-03-0657-241/+2043
| | | | | | | | | | | | | | | | | | | | | | | | | | | servlets resubmit pull request. was https://github.com/apache/incubator-spark/pull/332. Author: Thomas Graves <tgraves@apache.org> Closes #33 from tgravescs/security-branch-0.9-with-client-rebase and squashes the following commits: dfe3918 [Thomas Graves] Fix merge conflict since startUserClass now using runAsUser 05eebed [Thomas Graves] Fix dependency lost in upmerge d1040ec [Thomas Graves] Fix up various imports 05ff5e0 [Thomas Graves] Fix up imports after upmerging to master ac046b3 [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase 13733e1 [Thomas Graves] Pass securityManager and SparkConf around where we can. Switch to use sparkConf for reading config whereever possible. Added ConnectionManagerSuite unit tests. 4a57acc [Thomas Graves] Change UI createHandler routines to createServlet since they now return servlets 2f77147 [Thomas Graves] Rework from comments 50dd9f2 [Thomas Graves] fix header in SecurityManager ecbfb65 [Thomas Graves] Fix spacing and formatting b514bec [Thomas Graves] Fix reference to config ed3d1c1 [Thomas Graves] Add security.md 6f7ddf3 [Thomas Graves] Convert SaslClient and SaslServer to scala, change spark.authenticate.ui to spark.ui.acls.enable, and fix up various other things from review comments 2d9e23e [Thomas Graves] Merge remote-tracking branch 'upstream/master' into security-branch-0.9-with-client-rebase_rework 5721c5a [Thomas Graves] update AkkaUtilsSuite test for the actorSelection changes, fix typos based on comments, and remove extra lines I missed in rebase from AkkaUtils f351763 [Thomas Graves] Add Security to Spark - Akka, Http, ConnectionManager, UI to use servlets
* SPARK-942: Do not materialize partitions when DISK_ONLY storage level is usedKyle Ellrott2014-03-067-53/+215
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a port of a pull request original targeted at incubator-spark: https://github.com/apache/incubator-spark/pull/180 Essentially if a user returns a generative iterator (from a flatMap operation), when trying to persist the data, Spark would first unroll the iterator into an ArrayBuffer, and then try to figure out if it could store the data. In cases where the user provided an iterator that generated more data then available memory, this would case a crash. With this patch, if the user requests a persist with a 'StorageLevel.DISK_ONLY', the iterator will be unrolled as it is inputed into the serializer. To do this, two changes where made: 1) The type of the 'values' argument in the putValues method of the BlockStore interface was changed from ArrayBuffer to Iterator (and all code interfacing with this method was modified to connect correctly. 2) The JavaSerializer now calls the ObjectOutputStream 'reset' method every 1000 objects. This was done because the ObjectOutputStream caches objects (thus preventing them from being GC'd) to write more compact serialization. If reset is never called, eventually the memory fills up, if it is called too often then the serialization streams become much larger because of redundant class descriptions. Author: Kyle Ellrott <kellrott@gmail.com> Closes #50 from kellrott/iterator-to-disk and squashes the following commits: 9ef7cb8 [Kyle Ellrott] Fixing formatting issues. 60e0c57 [Kyle Ellrott] Fixing issues (formatting, variable names, etc.) from review comments 8aa31cd [Kyle Ellrott] Merge ../incubator-spark into iterator-to-disk 33ac390 [Kyle Ellrott] Merge branch 'iterator-to-disk' of github.com:kellrott/incubator-spark into iterator-to-disk 2f684ea [Kyle Ellrott] Refactoring the BlockManager to replace the Either[Either[A,B]] usage. Now using trait 'Values'. Also modified BlockStore.putBytes call to return PutResult, so that it behaves like putValues. f70d069 [Kyle Ellrott] Adding docs for spark.serializer.objectStreamReset configuration 7ccc74b [Kyle Ellrott] Moving the 'LargeIteratorSuite' to simply test persistance of iterators. It doesn't try to invoke a OOM error any more 16a4cea [Kyle Ellrott] Streamlined the LargeIteratorSuite unit test. It should now run in ~25 seconds. Confirmed that it still crashes an unpatched copy of Spark. c2fb430 [Kyle Ellrott] Removing more un-needed array-buffer to iterator conversions 627a8b7 [Kyle Ellrott] Wrapping a few long lines 0f28ec7 [Kyle Ellrott] Adding second putValues to BlockStore interface that accepts an ArrayBuffer (rather then an Iterator). This will allow BlockStores to have slightly different behaviors dependent on whether they get an Iterator or ArrayBuffer. In the case of the MemoryStore, it needs to duplicate and cache an Iterator into an ArrayBuffer, but if handed a ArrayBuffer, it can skip the duplication. 656c33e [Kyle Ellrott] Fixing the JavaSerializer to read from the SparkConf rather then the System property. 8644ee8 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 00c98e0 [Kyle Ellrott] Making the Java ObjectStreamSerializer reset rate configurable by the system variable 'spark.serializer.objectStreamReset', default is not 10000. 40fe1d7 [Kyle Ellrott] Removing rouge space 31fe08e [Kyle Ellrott] Removing un-needed semi-colons 9df0276 [Kyle Ellrott] Added check to make sure that streamed-to-dist RDD actually returns good data in the LargeIteratorSuite a6424ba [Kyle Ellrott] Wrapping long line 2eeda75 [Kyle Ellrott] Fixing dumb mistake ("||" instead of "&&") 0e6f808 [Kyle Ellrott] Deleting temp output directory when done 95c7f67 [Kyle Ellrott] Simplifying StorageLevel checks 56f71cd [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 44ec35a [Kyle Ellrott] Adding some comments. 5eb2b7e [Kyle Ellrott] Changing the JavaSerializer reset to occur every 1000 objects. f403826 [Kyle Ellrott] Merge branch 'master' into iterator-to-disk 81d670c [Kyle Ellrott] Adding unit test for straight to disk iterator methods. d32992f [Kyle Ellrott] Merge remote-tracking branch 'origin/master' into iterator-to-disk cac1fad [Kyle Ellrott] Fixing MemoryStore, so that it converts incoming iterators to ArrayBuffer objects. This was previously done higher up the stack. efe1102 [Kyle Ellrott] Changing CacheManager and BlockManager to pass iterators directly to the serializer when a 'DISK_ONLY' persist is called. This is in response to SPARK-942.
* SPARK-1171: when executor is removed, we should minus totalCores instead of ↵CodingCat2014-03-052-3/+7
| | | | | | | | | | | | | | | | | | | just freeCores on that executor https://spark-project.atlassian.net/browse/SPARK-1171 When the executor is removed, the current implementation will only minus the freeCores of that executor. Actually we should minus the totalCores... Author: CodingCat <zhunansjtu@gmail.com> Author: Nan Zhu <CodingCat@users.noreply.github.com> Closes #63 from CodingCat/simplify_CoarseGrainedSchedulerBackend and squashes the following commits: f6bf93f [Nan Zhu] code clean 19c2bb4 [CodingCat] use copy idiom to reconstruct the workerOffers 43c13e9 [CodingCat] keep WorkerOffer immutable af470d3 [CodingCat] style fix 0c0e409 [CodingCat] simplify the implementation of CoarseGrainedSchedulerBackend
* SPARK-1164 Deprecated reduceByKeyToDriver as it is an alias for ↵Prashant Sharma2014-03-041-0/+1
| | | | | | | | | | reduceByKeyLocally Author: Prashant Sharma <prashant.s@imaginea.com> Closes #72 from ScrapCodes/SPARK-1164/deprecate-reducebykeytodriver and squashes the following commits: ee521cd [Prashant Sharma] SPARK-1164 Deprecated reduceByKeyToDriver as it is an alias for reduceByKeyLocally
* [java8API] SPARK-964 Investigate the potential for using JDK 8 lambda ↵Prashant Sharma2014-03-0322-354/+241
| | | | | | | | | | | | | | | | | | | expressions for the Java/Scala APIs Author: Prashant Sharma <prashant.s@imaginea.com> Author: Patrick Wendell <pwendell@gmail.com> Closes #17 from ScrapCodes/java8-lambdas and squashes the following commits: 95850e6 [Patrick Wendell] Some doc improvements and build changes to the Java 8 patch. 85a954e [Prashant Sharma] Nit. import orderings. 673f7ac [Prashant Sharma] Added support for -java-home as well 80a13e8 [Prashant Sharma] Used fake class tag syntax 26eb3f6 [Prashant Sharma] Patrick's comments on PR. 35d8d79 [Prashant Sharma] Specified java 8 building in the docs 31d4cd6 [Prashant Sharma] Maven build to support -Pjava8-tests flag. 4ab87d3 [Prashant Sharma] Review feedback on the pr c33dc2c [Prashant Sharma] SPARK-964, Java 8 API Support.
* Remove broken/unused Connection.getChunkFIFO method.Kay Ousterhout2014-03-031-34/+2
| | | | | | | | | | | | | | | | This method appears to be broken -- since it never removes anything from messages, and it adds new messages to it, the while loop is an infinite loop. The method also does not appear to have ever been used since the code was added in 2012, so this commit removes it. cc @mateiz who originally added this method in case there's a reason it should be here! (https://github.com/apache/spark/commit/63051dd2bcc4bf09d413ff7cf89a37967edc33ba) Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #69 from kayousterhout/remove_get_fifo and squashes the following commits: 053bc59 [Kay Ousterhout] Remove broken/unused Connection.getChunkFIFO method.
* Added a unit test for PairRDDFunctions.lookupBryn Keller2014-03-031-0/+26
| | | | | | | | | | Lookup didn't have a unit test. Added two tests, one for with a partitioner, and one for without. Author: Bryn Keller <bryn.keller@intel.com> Closes #36 from xoltar/lookup and squashes the following commits: 3bc0d44 [Bryn Keller] Added a unit test for PairRDDFunctions.lookup
* Remove the remoteFetchTime metric.Kay Ousterhout2014-03-035-14/+0
| | | | | | | | | | | | | | | | This metric is confusing: it adds up all of the time to fetch shuffle inputs, but fetches often happen in parallel, so remoteFetchTime can be much longer than the task execution time. @squito it looks like you added this metric -- do you have a use case for it? cc @shivaram -- I know you've looked at the shuffle performance a lot so chime in here if this metric has turned out to be useful for you! Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #62 from kayousterhout/remove_fetch_variable and squashes the following commits: 43341eb [Kay Ousterhout] Remote the remoteFetchTime metric.
* Removed accidentally checked in commentKay Ousterhout2014-03-031-3/+0
| | | | | | | | | | It looks like this comment was added a while ago by @mridulm as part of a merge and was accidentally checked in. We should remove it. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #61 from kayousterhout/remove_comment and squashes the following commits: 0b2b3f2 [Kay Ousterhout] Removed accidentally checked in comment
* SPARK-1121: Include avro for yarn-alpha buildsPatrick Wendell2014-03-021-0/+14
| | | | | | | | | | | | | | | | | This lets us explicitly include Avro based on a profile for 0.23.X builds. It makes me sad how convoluted it is to express this logic in Maven. @tgraves and @sryza curious if this works for you. I'm also considering just reverting to how it was before. The only real problem was that Spark advertised a dependency on Avro even though it only really depends transitively on Avro through other deps. Author: Patrick Wendell <pwendell@gmail.com> Closes #49 from pwendell/avro-build-fix and squashes the following commits: 8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile
* SPARK-1084.2 (resubmitted)Sean Owen2014-03-021-0/+9
| | | | | | | | | | | | | (Ported from https://github.com/apache/incubator-spark/pull/650 ) This adds one more change though, to fix the scala version warning introduced by json4s recently. Author: Sean Owen <sowen@cloudera.com> Closes #32 from srowen/SPARK-1084.2 and squashes the following commits: 9240abd [Sean Owen] Avoid scala version conflict in scalap induced by json4s dependency 1561cec [Sean Owen] Remove "exclude *" dependencies that are causing Maven warnings, and that are apparently unneeded anyway
* SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUIDAaron Davidson2014-03-021-5/+13
| | | | | | | | | | | Previously, ZooKeeperPersistenceEngine would crash the whole Master process if there was stored data from a prior Spark version. Now, we just delete these files. Author: Aaron Davidson <aaron@databricks.com> Closes #4 from aarondav/zookeeper2 and squashes the following commits: fa8b40f [Aaron Davidson] SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID
* Remove remaining references to incubationPatrick Wendell2014-03-021-2/+2
| | | | | | | | | | This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier. Author: Patrick Wendell <pwendell@gmail.com> Closes #51 from pwendell/tlp and squashes the following commits: d553b1b [Patrick Wendell] Remove remaining references to incubation
* [SPARK-1100] prevent Spark from overwriting directory silentlyCodingCat2014-03-012-10/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100) the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because overwriting the data silently is not user-friendly if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings My fix includes: add some new APIs with a flag for users to define whether he/she wants to overwrite the directory: if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running; if the flag is set to false, Spark will throw an exception if the output directory already exists changed JavaAPI part default behaviour is overwriting Two questions should we deprecate the old APIs without such a flag? I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas Author: CodingCat <zhunansjtu@gmail.com> Closes #11 from CodingCat/SPARK-1100 and squashes the following commits: 6a4e3a3 [CodingCat] code clean ef2d43f [CodingCat] add new test cases and code clean ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory
* Revert "[SPARK-1150] fix repo location in create script"Patrick Wendell2014-03-011-8/+2
| | | | This reverts commit 9aa095711858ce8670e51488f66a3d7c1a821c30.
* [SPARK-1150] fix repo location in create scriptMark Grover2014-03-011-2/+8
| | | | | | | | | | | | | https://spark-project.atlassian.net/browse/SPARK-1150 fix the repo location in create_release script Author: Mark Grover <mark@apache.org> Closes #48 from CodingCat/script_fixes and squashes the following commits: 01f4bf7 [Mark Grover] Fixing some nitpicks d2244d4 [Mark Grover] SPARK-676: Abbreviation in SPARK_MEM but not in SPARK_WORKER_MEMORY
* [SPARK-979] Randomize order of offers.Kay Ousterhout2014-03-014-41/+75
| | | | | | | | | | | | | This commit randomizes the order of resource offers to avoid scheduling all tasks on the same small set of machines. This is a much simpler solution to SPARK-979 than #7. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #27 from kayousterhout/randomize and squashes the following commits: 435d817 [Kay Ousterhout] [SPARK-979] Randomize order of offers.
* SPARK-1051. On YARN, executors don't doAs submitting userSandy Ryza2014-02-281-8/+10
| | | | | | | | | | This reopens https://github.com/apache/incubator-spark/pull/538 against the new repo Author: Sandy Ryza <sandy@cloudera.com> Closes #29 from sryza/sandy-spark-1051 and squashes the following commits: 708ce49 [Sandy Ryza] SPARK-1051. doAs submitting user in YARN
* Remote BlockFetchTracker traitKay Ousterhout2014-02-272-38/+17
| | | | | | | | | | | | This trait seems to have been created a while ago when there were multiple implementations; now that there's just one, I think it makes sense to merge it into the BlockFetcherIterator trait. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #39 from kayousterhout/remove_tracker and squashes the following commits: 8173939 [Kay Ousterhout] Remote BlockFetchTracker.
* [HOTFIX] Patching maven build after #6 (SPARK-1121).Patrick Wendell2014-02-271-8/+0
| | | | | | | | | | | That patch removed the Maven avro declaration but didn't remove the actual dependency in core. /cc @scrapcodes Author: Patrick Wendell <pwendell@gmail.com> Closes #37 from pwendell/master and squashes the following commits: 0ef3008 [Patrick Wendell] [HOTFIX] Patching maven build after #6 (SPARK-1121).
* SPARK 1084.1 (resubmitted)Sean Owen2014-02-277-20/+35
| | | | | | | | | | | | | | | (Ported from https://github.com/apache/incubator-spark/pull/637 ) Author: Sean Owen <sowen@cloudera.com> Closes #31 from srowen/SPARK-1084.1 and squashes the following commits: 6c4a32c [Sean Owen] Suppress warnings about legitimate unchecked array creations, or change code to avoid it f35b833 [Sean Owen] Fix two misc javadoc problems 254e8ef [Sean Owen] Fix one new style error introduced in scaladoc warning commit 5b2fce2 [Sean Owen] Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates 007762b [Sean Owen] Remove dead scaladoc links b8ff8cb [Sean Owen] Replace deprecated Ant <tasks> with <target>
* Show Master status on UI pageRaymond Liu2014-02-261-0/+1
| | | | | | | | | | For standalone HA mode, A status is useful to identify the current master, already in json format too. Author: Raymond Liu <raymond.liu@intel.com> Closes #24 from colorant/status and squashes the following commits: df630b3 [Raymond Liu] Show Master status on UI page
* SPARK-1129: use a predefined seed when seed is zero in XORShiftRandomXiangrui Meng2014-02-262-3/+16
| | | | | | | | | | | | | | If the seed is zero, XORShift generates all zeros, which would create unexpected result. JIRA: https://spark-project.atlassian.net/browse/SPARK-1129 Author: Xiangrui Meng <meng@databricks.com> Closes #645 from mengxr/xor and squashes the following commits: 1b086ab [Xiangrui Meng] use MurmurHash3 to set seed in XORShiftRandom 45c6f16 [Xiangrui Meng] minor style change 51f4050 [Xiangrui Meng] use a predefined seed when seed is zero in XORShiftRandom
* Remove references to ClusterScheduler (SPARK-1140)Kay Ousterhout2014-02-268-46/+47
| | | | | | | | | | | ClusterScheduler was renamed to TaskSchedulerImpl; this commit updates comments and tests accordingly. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9 from kayousterhout/cluster_scheduler_death and squashes the following commits: d6fd119 [Kay Ousterhout] Remove references to ClusterScheduler.
* Deprecated and added a few java api methods for corresponding scala api.Prashant Sharma2014-02-265-6/+32
| | | | | | | | | | | | PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits: 11d0c2b [Prashant Sharma] Integer -> java.lang.Integer 737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs. 3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs