aboutsummaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
...
* Remove broken/unused Connection.getChunkFIFO method.Kay Ousterhout2014-03-031-34/+2
| | | | | | | | | | | | | | | | This method appears to be broken -- since it never removes anything from messages, and it adds new messages to it, the while loop is an infinite loop. The method also does not appear to have ever been used since the code was added in 2012, so this commit removes it. cc @mateiz who originally added this method in case there's a reason it should be here! (https://github.com/apache/spark/commit/63051dd2bcc4bf09d413ff7cf89a37967edc33ba) Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #69 from kayousterhout/remove_get_fifo and squashes the following commits: 053bc59 [Kay Ousterhout] Remove broken/unused Connection.getChunkFIFO method.
* SPARK-1158: Fix flaky RateLimitedOutputStreamSuite.Reynold Xin2014-03-032-20/+32
| | | | | | | | | | | | There was actually a problem with the RateLimitedOutputStream implementation where the first second doesn't write anything because of integer rounding. So RateLimitedOutputStream was overly aggressive in throttling. Author: Reynold Xin <rxin@apache.org> Closes #55 from rxin/ratelimitest and squashes the following commits: 52ce1b7 [Reynold Xin] SPARK-1158: Fix flaky RateLimitedOutputStreamSuite.
* Added a unit test for PairRDDFunctions.lookupBryn Keller2014-03-031-0/+26
| | | | | | | | | | Lookup didn't have a unit test. Added two tests, one for with a partitioner, and one for without. Author: Bryn Keller <bryn.keller@intel.com> Closes #36 from xoltar/lookup and squashes the following commits: 3bc0d44 [Bryn Keller] Added a unit test for PairRDDFunctions.lookup
* Remove the remoteFetchTime metric.Kay Ousterhout2014-03-035-14/+0
| | | | | | | | | | | | | | | | This metric is confusing: it adds up all of the time to fetch shuffle inputs, but fetches often happen in parallel, so remoteFetchTime can be much longer than the task execution time. @squito it looks like you added this metric -- do you have a use case for it? cc @shivaram -- I know you've looked at the shuffle performance a lot so chime in here if this metric has turned out to be useful for you! Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #62 from kayousterhout/remove_fetch_variable and squashes the following commits: 43341eb [Kay Ousterhout] Remote the remoteFetchTime metric.
* update proportion of memoryChen Chao2014-03-031-2/+2
| | | | | | | | | | The default value of "spark.storage.memoryFraction" has been changed from 0.66 to 0.6 . So it should be 60% of the memory to cache while 40% used for task execution. Author: Chen Chao <crazyjvm@gmail.com> Closes #66 from CrazyJvm/master and squashes the following commits: 0f84d86 [Chen Chao] update proportion of memory
* Removed accidentally checked in commentKay Ousterhout2014-03-031-3/+0
| | | | | | | | | | It looks like this comment was added a while ago by @mridulm as part of a merge and was accidentally checked in. We should remove it. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #61 from kayousterhout/remove_comment and squashes the following commits: 0b2b3f2 [Kay Ousterhout] Removed accidentally checked in comment
* SPARK-1173. (#2) Fix typo in Java streaming example.Aaron Kimball2014-03-021-1/+1
| | | | | | | | | | Companion commit to pull request #64, fix the typo on the Java side of the docs. Author: Aaron Kimball <aaron@magnify.io> Closes #65 from kimballa/spark-1173-java-doc-update and squashes the following commits: 8ce11d3 [Aaron Kimball] SPARK-1173. (#2) Fix typo in Java streaming example.
* SPARK-1173. Improve scala streaming docs.Aaron Kimball2014-03-021-5/+33
| | | | | | | | | | | | | Clarify imports to add implicit conversions to DStream and fix other small typos in the streaming intro documentation. Tested by inspecting output via a local jekyll server, c&p'ing the scala commands into a spark terminal. Author: Aaron Kimball <aaron@magnify.io> Closes #64 from kimballa/spark-1173-streaming-docs and squashes the following commits: 6fbff0e [Aaron Kimball] SPARK-1173. Improve scala streaming docs.
* Add Jekyll tag to isolate "production-only" doc components.Patrick Wendell2014-03-024-6/+33
| | | | | | | | Author: Patrick Wendell <pwendell@gmail.com> Closes #56 from pwendell/jekyll-prod and squashes the following commits: 1bdc3a8 [Patrick Wendell] Add Jekyll tag to isolate "production-only" doc components.
* SPARK-1121: Include avro for yarn-alpha buildsPatrick Wendell2014-03-0218-12/+234
| | | | | | | | | | | | | | | | | This lets us explicitly include Avro based on a profile for 0.23.X builds. It makes me sad how convoluted it is to express this logic in Maven. @tgraves and @sryza curious if this works for you. I'm also considering just reverting to how it was before. The only real problem was that Spark advertised a dependency on Avro even though it only really depends transitively on Avro through other deps. Author: Patrick Wendell <pwendell@gmail.com> Closes #49 from pwendell/avro-build-fix and squashes the following commits: 8d6ee92 [Patrick Wendell] SPARK-1121: Add avro to yarn-alpha profile
* SPARK-1084.2 (resubmitted)Sean Owen2014-03-026-61/+22
| | | | | | | | | | | | | (Ported from https://github.com/apache/incubator-spark/pull/650 ) This adds one more change though, to fix the scala version warning introduced by json4s recently. Author: Sean Owen <sowen@cloudera.com> Closes #32 from srowen/SPARK-1084.2 and squashes the following commits: 9240abd [Sean Owen] Avoid scala version conflict in scalap induced by json4s dependency 1561cec [Sean Owen] Remove "exclude *" dependencies that are causing Maven warnings, and that are apparently unneeded anyway
* Ignore RateLimitedOutputStreamSuite for now.Reynold Xin2014-03-021-1/+1
| | | | | | | | | | This test has been flaky. We can re-enable it after @tdas has a chance to look at it. Author: Reynold Xin <rxin@apache.org> Closes #54 from rxin/ratelimit and squashes the following commits: 1a12198 [Reynold Xin] Ignore RateLimitedOutputStreamSuite for now.
* SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUIDAaron Davidson2014-03-021-5/+13
| | | | | | | | | | | Previously, ZooKeeperPersistenceEngine would crash the whole Master process if there was stored data from a prior Spark version. Now, we just delete these files. Author: Aaron Davidson <aaron@databricks.com> Closes #4 from aarondav/zookeeper2 and squashes the following commits: fa8b40f [Aaron Davidson] SPARK-1137: Make ZK PersistenceEngine not crash for wrong serialVersionUID
* Remove remaining references to incubationPatrick Wendell2014-03-0222-58/+54
| | | | | | | | | | This removes some loose ends not caught by the other (incubating -> tlp) patches. @markhamstra this updates the version as you mentioned earlier. Author: Patrick Wendell <pwendell@gmail.com> Closes #51 from pwendell/tlp and squashes the following commits: d553b1b [Patrick Wendell] Remove remaining references to incubation
* Update io.netty from 4.0.13 Final to 4.0.17.FinalBinh Nguyen2014-03-022-2/+2
| | | | | | | | | | | | | | | This update contains a lot of bug fixes and some new perf improvements. It is also binary compatible with the current 4.0.13.Final For more information: http://netty.io/news/2014/02/25/4-0-17-Final.html Author: Binh Nguyen <ngbinh@gmail.com> Author: Binh Nguyen <ngbinh@gmail.com> Closes #41 from ngbinh/master and squashes the following commits: a9498f4 [Binh Nguyen] update io.netty to 4.0.17.Final
* Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading ↵Michael Armbrust2014-03-024-51/+315
| | | | | | | | | | | | logic. This allows developers to pass options (such as -D) to sbt. I also modified the SparkBuild to ensure spark specific properties are propagated to forked test JVMs. Author: Michael Armbrust <michael@databricks.com> Closes #14 from marmbrus/sbtScripts and squashes the following commits: c008b18 [Michael Armbrust] Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
* Initialized the regVal for first iteration in SGD optimizerDB Tsai2014-03-023-1/+50
| | | | | | | | | | | | | | | | | | Ported from https://github.com/apache/incubator-spark/pull/633 In runMiniBatchSGD, the regVal (for 1st iter) should be initialized as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed. It maybe not be important here for SGD since the updater doesn't take the loss as parameter to find the new weights. But it will give us the correct history of loss. However, for LBFGS optimizer we implemented, the correct loss with regVal is crucial to find the new weights. Author: DB Tsai <dbtsai@alpinenow.com> Closes #40 from dbtsai/dbtsai-smallRegValFix and squashes the following commits: 77d47da [DB Tsai] In runMiniBatchSGD, the regVal (for 1st iter) should be initialized as sum of sqrt of weights if it's L2 update; for L1 update, the same logic is followed.
* [SPARK-1100] prevent Spark from overwriting directory silentlyCodingCat2014-03-012-10/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thanks for Diana Carroll to report this issue (https://spark-project.atlassian.net/browse/SPARK-1100) the current saveAsTextFile/SequenceFile will overwrite the output directory silently if the directory already exists, this behaviour is not desirable because overwriting the data silently is not user-friendly if the partition number of two writing operation changed, then the output directory will contain the results generated by two runnings My fix includes: add some new APIs with a flag for users to define whether he/she wants to overwrite the directory: if the flag is set to true, then the output directory is deleted first and then written into the new data to prevent the output directory contains results from multiple rounds of running; if the flag is set to false, Spark will throw an exception if the output directory already exists changed JavaAPI part default behaviour is overwriting Two questions should we deprecate the old APIs without such a flag? I noticed that Spark Streaming also called these APIs, I thought we don't need to change the related part in streaming? @tdas Author: CodingCat <zhunansjtu@gmail.com> Closes #11 from CodingCat/SPARK-1100 and squashes the following commits: 6a4e3a3 [CodingCat] code clean ef2d43f [CodingCat] add new test cases and code clean ac63136 [CodingCat] checkOutputSpecs not applicable to FSOutputFormat ec490e8 [CodingCat] prevent Spark from overwriting directory silently and leaving dirty directory
* [SPARK-1150] fix repo location in create script (re-open)CodingCat2014-03-011-8/+8
| | | | | | | | | | reopen for https://spark-project.atlassian.net/browse/SPARK-1150 Author: CodingCat <zhunansjtu@gmail.com> Closes #52 from CodingCat/script_fixes and squashes the following commits: fc05a71 [CodingCat] fix repo location in create script
* Revert "[SPARK-1150] fix repo location in create script"Patrick Wendell2014-03-013-11/+5
| | | | This reverts commit 9aa095711858ce8670e51488f66a3d7c1a821c30.
* [SPARK-1150] fix repo location in create scriptMark Grover2014-03-013-5/+11
| | | | | | | | | | | | | https://spark-project.atlassian.net/browse/SPARK-1150 fix the repo location in create_release script Author: Mark Grover <mark@apache.org> Closes #48 from CodingCat/script_fixes and squashes the following commits: 01f4bf7 [Mark Grover] Fixing some nitpicks d2244d4 [Mark Grover] SPARK-676: Abbreviation in SPARK_MEM but not in SPARK_WORKER_MEMORY
* [SPARK-979] Randomize order of offers.Kay Ousterhout2014-03-014-41/+75
| | | | | | | | | | | | | This commit randomizes the order of resource offers to avoid scheduling all tasks on the same small set of machines. This is a much simpler solution to SPARK-979 than #7. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #27 from kayousterhout/randomize and squashes the following commits: 435d817 [Kay Ousterhout] [SPARK-979] Randomize order of offers.
* SPARK-1151: Update dev merge script to use spark.git instead of incubator-sparkThomas Graves2014-02-281-1/+1
| | | | | | | | Author: Thomas Graves <tgraves@apache.org> Closes #47 from tgravescs/fix_merge_script and squashes the following commits: 8209ab1 [Thomas Graves] Update dev merge script to use spark.git instead of incubator-spark
* SPARK-1051. On YARN, executors don't doAs submitting userSandy Ryza2014-02-285-10/+25
| | | | | | | | | | This reopens https://github.com/apache/incubator-spark/pull/538 against the new repo Author: Sandy Ryza <sandy@cloudera.com> Closes #29 from sryza/sandy-spark-1051 and squashes the following commits: 708ce49 [Sandy Ryza] SPARK-1051. doAs submitting user in YARN
* SPARK-1032. If Yarn app fails before registering, app master stays aroun...Sandy Ryza2014-02-282-18/+38
| | | | | | | | | | | | ...d long after This reopens https://github.com/apache/incubator-spark/pull/648 against the new repo. Author: Sandy Ryza <sandy@cloudera.com> Closes #28 from sryza/sandy-spark-1032 and squashes the following commits: 5953f50 [Sandy Ryza] SPARK-1032. If Yarn app fails before registering, app master stays around long after
* Remote BlockFetchTracker traitKay Ousterhout2014-02-272-38/+17
| | | | | | | | | | | | This trait seems to have been created a while ago when there were multiple implementations; now that there's just one, I think it makes sense to merge it into the BlockFetcherIterator trait. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #39 from kayousterhout/remove_tracker and squashes the following commits: 8173939 [Kay Ousterhout] Remote BlockFetchTracker.
* Removed reference to incubation in Spark user docs.Reynold Xin2014-02-278-24/+14
| | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #2 from rxin/docs and squashes the following commits: 08bbd5f [Reynold Xin] Removed reference to incubation in Spark user docs.
* [HOTFIX] Patching maven build after #6 (SPARK-1121).Patrick Wendell2014-02-271-8/+0
| | | | | | | | | | | That patch removed the Maven avro declaration but didn't remove the actual dependency in core. /cc @scrapcodes Author: Patrick Wendell <pwendell@gmail.com> Closes #37 from pwendell/master and squashes the following commits: 0ef3008 [Patrick Wendell] [HOTFIX] Patching maven build after #6 (SPARK-1121).
* SPARK 1084.1 (resubmitted)Sean Owen2014-02-2715-88/+154
| | | | | | | | | | | | | | | (Ported from https://github.com/apache/incubator-spark/pull/637 ) Author: Sean Owen <sowen@cloudera.com> Closes #31 from srowen/SPARK-1084.1 and squashes the following commits: 6c4a32c [Sean Owen] Suppress warnings about legitimate unchecked array creations, or change code to avoid it f35b833 [Sean Owen] Fix two misc javadoc problems 254e8ef [Sean Owen] Fix one new style error introduced in scaladoc warning commit 5b2fce2 [Sean Owen] Fix scaladoc invocation warning, and enable javac warnings properly, with plugin config updates 007762b [Sean Owen] Remove dead scaladoc links b8ff8cb [Sean Owen] Replace deprecated Ant <tasks> with <target>
* Show Master status on UI pageRaymond Liu2014-02-261-0/+1
| | | | | | | | | | For standalone HA mode, A status is useful to identify the current master, already in json format too. Author: Raymond Liu <raymond.liu@intel.com> Closes #24 from colorant/status and squashes the following commits: df630b3 [Raymond Liu] Show Master status on UI page
* [SPARK-1089] fix the regression problem on ADD_JARS in 0.9CodingCat2014-02-261-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://spark-project.atlassian.net/browse/SPARK-1089 copied from JIRA, reported by @ash211 "Using the ADD_JARS environment variable with spark-shell used to add the jar to both the shell and the various workers. Now it only adds to the workers and importing a custom class in the shell is broken. The workaround is to add custom jars to both ADD_JARS and SPARK_CLASSPATH. We should fix ADD_JARS so it works properly again. See various threads on the user list: https://mail-archives.apache.org/mod_mbox/incubator-spark-user/201402.mbox/%3CCAJbo4neMLiTrnm1XbyqomWmp0m+EUcg4yE-txuRGSVKOb5KLeA@mail.gmail.com%3E (another one that doesn't appear in the archives yet titled "ADD_JARS not working on 0.9")" The reason of this bug is two-folds in the current implementation of SparkILoop.scala, the settings.classpath is not set properly when the process() method is invoked the weird behaviour of Scala 2.10, (I personally thought it is a bug) if we simply set value of a PathSettings object (like settings.classpath), the isDefault is not set to true (this is a flag showing if the variable is modified), so it makes the PathResolver loads the default CLASSPATH environment variable value to calculated the path (see https://github.com/scala/scala/blob/2.10.x/src/compiler/scala/tools/util/PathResolver.scala#L215) what we have to do is to manually make this flag set, (https://github.com/CodingCat/incubator-spark/blob/e3991d97ddc33e77645e4559b13bf78b9e68239a/repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala#L884) Author: CodingCat <zhunansjtu@gmail.com> Closes #13 from CodingCat/SPARK-1089 and squashes the following commits: 8af81e7 [CodingCat] impose non-null settings 9aa2125 [CodingCat] code cleaning ce36676 [CodingCat] code cleaning e045582 [CodingCat] fix the regression problem on ADD_JARS in 0.9
* SPARK-1121 Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is setPrashant Sharma2014-02-263-55/+39
| | | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #6 from ScrapCodes/SPARK-1121/avro-dep-fix and squashes the following commits: 9b29e34 [Prashant Sharma] Review feedback on PR 46ed2ad [Prashant Sharma] SPARK-1121-Only add avro if the build is for Hadoop 0.23.X and SPARK_YARN is set
* SPARK-1129: use a predefined seed when seed is zero in XORShiftRandomXiangrui Meng2014-02-262-3/+16
| | | | | | | | | | | | | | If the seed is zero, XORShift generates all zeros, which would create unexpected result. JIRA: https://spark-project.atlassian.net/browse/SPARK-1129 Author: Xiangrui Meng <meng@databricks.com> Closes #645 from mengxr/xor and squashes the following commits: 1b086ab [Xiangrui Meng] use MurmurHash3 to set seed in XORShiftRandom 45c6f16 [Xiangrui Meng] minor style change 51f4050 [Xiangrui Meng] use a predefined seed when seed is zero in XORShiftRandom
* Remove references to ClusterScheduler (SPARK-1140)Kay Ousterhout2014-02-268-46/+47
| | | | | | | | | | | ClusterScheduler was renamed to TaskSchedulerImpl; this commit updates comments and tests accordingly. Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9 from kayousterhout/cluster_scheduler_death and squashes the following commits: d6fd119 [Kay Ousterhout] Remove references to ClusterScheduler.
* Updated link for pyspark examples in docsJyotiska NK2014-02-261-1/+1
| | | | | | | | Author: Jyotiska NK <jyotiska123@gmail.com> Closes #22 from jyotiska/pyspark_docs and squashes the following commits: 426136c [Jyotiska NK] Updated link for pyspark examples
* Deprecated and added a few java api methods for corresponding scala api.Prashant Sharma2014-02-265-6/+32
| | | | | | | | | | | | PR [402](https://github.com/apache/incubator-spark/pull/402) from incubator repo. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #19 from ScrapCodes/java-api-completeness and squashes the following commits: 11d0c2b [Prashant Sharma] Integer -> java.lang.Integer 737819a [Prashant Sharma] SPARK-1095 add explicit return types to APIs. 3ddc8bb [Prashant Sharma] Deprected *With functions in scala and added a few missing Java APIs
* Removed reference to incubation in README.md.Reynold Xin2014-02-261-14/+3
| | | | | | | | Author: Reynold Xin <rxin@apache.org> Closes #1 from rxin/readme and squashes the following commits: b3a77cd [Reynold Xin] Removed reference to incubation in README.md.
* SPARK-1115: Catch depickling errorsBouke van der Bijl2014-02-261-24/+24
| | | | | | | | | | | | | This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason @JoshRosen Author: Bouke van der Bijl <boukevanderbijl@gmail.com> Closes #644 from bouk/catch-depickling-errors and squashes the following commits: f0f67cc [Bouke van der Bijl] Lol indentation 0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
* SPARK-1135: fix broken anchors in docsMatei Zaharia2014-02-261-28/+1
| | | | | | | | | | | | | | A recent PR that added Java vs Scala tabs for streaming also inadvertently added some bad code to a document.ready handler, breaking our other handler that manages scrolling to anchors correctly with the floating top bar. As a result the section title ended up always being hidden below the top bar. This removes the unnecessary JavaScript code. Author: Matei Zaharia <matei@databricks.com> Closes #3 from mateiz/doc-links and squashes the following commits: e2a3488 [Matei Zaharia] SPARK-1135: fix broken anchors in docs
* SPARK-1078: Replace lift-json with json4s-jackson.William Benton2014-02-269-24/+32
| | | | | | | | | | | | | | The aim of the Json4s project is to provide a common API for Scala JSON libraries. It is Apache-licensed, easier for downstream distributions to package, and mostly API-compatible with lift-json. Furthermore, the Jackson-backed implementation parses faster than lift-json on all but the smallest inputs. Author: William Benton <willb@redhat.com> Closes #582 from willb/json4s and squashes the following commits: 7ca62c4 [William Benton] Replace lift-json with json4s-jackson.
* SPARK-1053. Don't require SPARK_YARN_APP_JARSandy Ryza2014-02-264-12/+7
| | | | | | | | | | | | It looks this just requires taking out the checks. I verified that, with the patch, I was able to run spark-shell through yarn without setting the environment variable. Author: Sandy Ryza <sandy@cloudera.com> Closes #553 from sryza/sandy-spark-1053 and squashes the following commits: b037676 [Sandy Ryza] SPARK-1053. Don't require SPARK_YARN_APP_JAR
* For SPARK-1082, Use Curator for ZK interaction in standalone clusterRaymond Liu2014-02-249-300/+99
| | | | | | | | | | | Author: Raymond Liu <raymond.liu@intel.com> Closes #611 from colorant/curator and squashes the following commits: 7556aa1 [Raymond Liu] Address review comments af92e1f [Raymond Liu] Fix coding style 964f3c2 [Raymond Liu] Ignore NodeExists exception 6df2966 [Raymond Liu] Rewrite zookeeper client code with curator
* Graph primitives2Semih Salihoglu2014-02-242-10/+183
| | | | | | | | | | | | | | | | | | Hi guys, I'm following Joey and Ankur's suggestions to add collectEdges and pickRandomVertex. I'm also adding the tests for collectEdges and refactoring one method getCycleGraph in GraphOpsSuite.scala. Thank you, semih Author: Semih Salihoglu <semihsalihoglu@gmail.com> Closes #580 from semihsalihoglu/GraphPrimitives2 and squashes the following commits: 937d3ec [Semih Salihoglu] - Fixed the scalastyle errors. a69a152 [Semih Salihoglu] - Adding collectEdges and pickRandomVertices. - Adding tests for collectEdges. - Refactoring a getCycle utility function for GraphOpsSuite.scala. 41265a6 [Semih Salihoglu] - Adding collectEdges and pickRandomVertex. - Adding tests for collectEdges. - Recycling a getCycle utility test file.
* Include reference to twitter/chill in tuning docsAndrew Ash2014-02-241-3/+6
| | | | | | | | Author: Andrew Ash <andrew@andrewash.com> Closes #647 from ash211/doc-tuning and squashes the following commits: b87de0a [Andrew Ash] Include reference to twitter/chill in tuning docs
* For outputformats that are Configurable, call setConf before sending data to ↵Bryn Keller2014-02-242-1/+80
| | | | | | | | | | | | | | | them. [SPARK-1108] This allows us to use, e.g. HBase's TableOutputFormat with PairRDDFunctions.saveAsNewAPIHadoopFile, which otherwise would throw NullPointerException because the output table name hasn't been configured. Note this bug also affects branch-0.9 Author: Bryn Keller <bryn.keller@intel.com> Closes #638 from xoltar/SPARK-1108 and squashes the following commits: 7e94e7d [Bryn Keller] Import, comment, and format cleanup per code review 7cbcaa1 [Bryn Keller] For outputformats that are Configurable, call setConf before sending data to them. This allows us to use, e.g. HBase TableOutputFormat, which otherwise would throw NullPointerException because the output table name hasn't been configured
* Merge pull request #641 from mateiz/spark-1124-masterMatei Zaharia2014-02-242-14/+30
|\ | | | | | | | | | | | | | | SPARK-1124: Fix infinite retries of reduce stage when a map stage failed In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null". See https://spark-project.atlassian.net/browse/SPARK-1124 for an example. This PR also cleans up code style slightly where there was a variable named "s" and some weird map manipulation.
| * Fix removal from shuffleToMapStage to search for a key-value pair withMatei Zaharia2014-02-241-2/+2
| | | | | | | | our stage instead of using our shuffleID.
| * SPARK-1124: Fix infinite retries of reduce stage when a map stage failedMatei Zaharia2014-02-232-14/+30
|/ | | | | | | | In the previous code, if you had a failing map stage and then tried to run reduce stages on it repeatedly, the first reduce stage would fail correctly, but the later ones would mistakenly believe that all map outputs are available and start failing infinitely with fetch failures from "null".
* SPARK-1071: Tidy logging strategy and use of log4jSean Owen2014-02-2311-69/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prompted by a recent thread on the mailing list, I tried and failed to see if Spark can be made independent of log4j. There are a few cases where control of the underlying logging is pretty useful, and to do that, you have to bind to a specific logger. Instead I propose some tidying that leaves Spark's use of log4j, but gets rid of warnings and should still enable downstream users to switch. The idea is to pipe everything (except log4j) through SLF4J, and have Spark use SLF4J directly when logging, and where Spark needs to output info (REPL and tests), bind from SLF4J to log4j. This leaves the same behavior in Spark. It means that downstream users who want to use something except log4j should: - Exclude dependencies on log4j, slf4j-log4j12 from Spark - Include dependency on log4j-over-slf4j - Include dependency on another logger X, and another slf4j-X - Recreate any log config that Spark does, that is needed, in the other logger's config That sounds about right. Here are the key changes: - Include the jcl-over-slf4j shim everywhere by depending on it in core. - Exclude dependencies on commons-logging from third-party libraries. - Include the jul-to-slf4j shim everywhere by depending on it in core. - Exclude slf4j-* dependencies from third-party libraries to prevent collision or warnings - Added missing slf4j-log4j12 binding to GraphX, Bagel module tests And minor/incidental changes: - Update to SLF4J 1.7.5, which happily matches Hadoop 2’s version and is a recommended update over 1.7.2 - (Remove a duplicate HBase dependency declaration in SparkBuild.scala) - (Remove a duplicate mockito dependency declaration that was causing warnings and bugging me) Author: Sean Owen <sowen@cloudera.com> Closes #570 from srowen/SPARK-1071 and squashes the following commits: 52eac9f [Sean Owen] Add slf4j-over-log4j12 dependency to core (non-test) and remove it from things that depend on core. 77a7fa9 [Sean Owen] SPARK-1071: Tidy logging strategy and use of log4j
* [SPARK-1041] remove dead code in start script, remind user to set that in ↵CodingCat2014-02-223-18/+1
| | | | | | | | | | | | | | | | | | | | | | | | spark-env.sh the lines in start-master.sh and start-slave.sh no longer work in ec2, the host name has changed, e.g. ubuntu@ip-172-31-36-93:~$ hostname ip-172-31-36-93 also, the URL to fetch public DNS name also changed, e.g. ubuntu@ip-172-31-36-93:~$ wget -q -O - http://instance-data.ec2.internal/latest/meta-data/public-hostname ubuntu@ip-172-31-36-93:~$ (returns nothing) since we have spark-ec2 project, we don't need to have such ec2-specific lines here, instead, user only need to set in spark-env.sh Author: CodingCat <zhunansjtu@gmail.com> Closes #588 from CodingCat/deadcode_in_sbin and squashes the following commits: e4236e0 [CodingCat] remove dead code in start script, remind user set that in spark-env.sh