aboutsummaryrefslogtreecommitdiff
path: root/yarn/src/main
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two ↵Holden Karau2015-11-171-1/+2
| | | | | | | | | | params have both in error msg When we exceed the max memory tell users to increase both params instead of just the one. Author: Holden Karau <holden@us.ibm.com> Closes #9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.
* [SPARK-11718][YARN][CORE] Fix explicitly killed executor dies silently issuejerryshao2015-11-161-6/+24
| | | | | | | | | | | | | | | | | | Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work. The problem is `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done. One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple. Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query. Here this PR chooses solution 2. Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #9684 from jerryshao/SPARK-11718.
* [SPARK-11555] spark on yarn spark-class --num-workers doesn't workThomas Graves2015-11-062-3/+6
| | | | | | | | | | I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied. --num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Closes #9523 from tgravescs/SPARK-11555.
* [SPARK-10622][CORE][YARN] Differentiate dead from "mostly dead" executors.Marcelo Vanzin2015-11-042-34/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In YARN mode, when preemption is enabled, we may leave executors in a zombie state while we wait to retrieve the reason for which the executor exited. This is so that we don't account for failed tasks that were running on a preempted executor. The issue is that while we wait for this information, the scheduler might decide to schedule tasks on the executor, which will never be able to run them. Other side effects include the block manager still considering the executor available to cache blocks, for example. So, when we know that an executor went down but we don't know why, stop everything related to the executor, except its running tasks. Only when we know the reason for the exit (or give up waiting for it) we do update the running tasks. This is achieved by a new `disableExecutor()` method in the `Schedulable` interface. For managers that do not behave like this (i.e. every one but YARN), the existing `executorLost()` method will behave the same way it did before. On top of that change, a few minor changes that made debugging easier, and fixed some other minor issues: - The cluster-mode AM was printing a misleading log message every time an executor disconnected from the driver (because the akka actor system was shared between driver and AM). - Avoid sending unnecessary requests for an executor's exit reason when we already know it was explicitly disabled / killed. This avoids both multiple requests, and unnecessary requests that would just cause warning messages on the AM (in the explicit kill case). - Tone down a log message about the executor being lost when it exited normally (e.g. preemption) - Wake up the AM monitor thread when requests for executor loss reasons arrive too, so that we can more quickly remove executors from this zombie state. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8887 from vanzin/SPARK-10622.
* [SPARK-10997][CORE] Add "client mode" to netty rpc env.Marcelo Vanzin2015-11-021-4/+2
| | | | | | | | | | | | | | | | | | | | | | | "Client mode" means the RPC env will not listen for incoming connections. This allows certain processes in the Spark stack (such as Executors or tha YARN client-mode AM) to act as pure clients when using the netty-based RPC backend, reducing the number of sockets needed by the app and also the number of open ports. Client connections are also preferred when endpoints that actually have a listening socket are involved; so, for example, if a Worker connects to a Master and the Master needs to send a message to a Worker endpoint, that client connection will be used, even though the Worker is also listening for incoming connections. With this change, the workaround for SPARK-10987 isn't necessary anymore, and is removed. The AM connects to the driver in "client mode", and that connection is used for all driver <-> AM communication, and so the AM is properly notified when the connection goes down. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9210 from vanzin/SPARK-10997.
* [SPARK-9817][YARN] Improve the locality calculation of containers by taking ↵jerryshao2015-11-023-22/+113
| | | | | | | | | | | | pending container requests into consideraion This is a follow-up PR to further improve the locality calculation by considering the pending container's request. Since the locality preferences of tasks may be shifted from time to time, current localities of pending container requests may not fully match the new preferences, this PR improve it by removing outdated, unmatched container requests and replace with new requests. sryza please help to review, thanks a lot. Author: jerryshao <sshao@hortonworks.com> Closes #8100 from jerryshao/SPARK-9817.
* [SPARK-11073][CORE][YARN] Remove akka dependency in secret key generation.Marcelo Vanzin2015-11-011-1/+2
| | | | | | | | | | Use standard JDK APIs for that (with a little help from Guava). Most of the changes here are in test code, since there were no tests specific to that part of the code. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9257 from vanzin/SPARK-11073.
* [SPARK-11265][YARN] YarnClient can't get tokens to talk to Hive 1.2.1 in a ↵Steve Loughran2015-10-312-49/+75
| | | | | | | | | | secure cluster This is a fix for SPARK-11265; the introspection code to get Hive delegation tokens failing on Spark 1.5.1+, due to changes in the Hive codebase Author: Steve Loughran <stevel@hortonworks.com> Closes #9232 from steveloughran/stevel/patches/SPARK-11265-hive-tokens.
* [SPARK-11178] Improving naming around task failures.Kay Ousterhout2015-10-271-13/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit af3bc59d1f5d9d952c2d7ad1af599c49f1dbdaf0 introduced new functionality so that if an executor dies for a reason that's not caused by one of the tasks running on the executor (e.g., due to pre-emption), Spark doesn't count the failure towards the maximum number of failures for the task. That commit introduced some vague naming that this commit attempts to fix; in particular: (1) The variable "isNormalExit", which was used to refer to cases where the executor died for a reason unrelated to the tasks running on the machine, has been renamed (and reversed) to "exitCausedByApp". The problem with the existing name is that it's not clear (at least to me!) what it means for an exit to be "normal"; the new name is intended to make the purpose of this variable more clear. (2) The variable "shouldEventuallyFailJob" has been renamed to "countTowardsTaskFailures". This variable is used to determine whether a task's failure should be counted towards the maximum number of failures allowed for a task before the associated Stage is aborted. The problem with the existing name is that it can be confused with implying that the task's failure should immediately cause the stage to fail because it is somehow fatal (this is the case for a fetch failure, for example: if a task fails because of a fetch failure, there's no point in retrying, and the whole stage should be failed). Author: Kay Ousterhout <kayousterhout@gmail.com> Closes #9164 from kayousterhout/SPARK-11178.
* [SPARK-11105] [YARN] Distribute log4j.properties to executorsvundela2015-10-201-0/+13
| | | | | | | | | | | Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings. If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties Author: vundela <vsr@cloudera.com> Author: Srinivasa Reddy Vundela <vsr@cloudera.com> Closes #9118 from vundela/master.
* [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9Holden Karau2015-10-201-2/+2
| | | | | | | | | Upgrade to Py4j0.9 Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
* [SPARK-11120] Allow sane default number of executor failures when ↵Ryan Williams2015-10-192-12/+26
| | | | | | | | | | dynamically allocating in YARN I also added some information to container-failure error msgs about what host they failed on, which would have helped me identify the problem that lead me to this JIRA and PR sooner. Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #9147 from ryan-williams/dyn-exec-failures.
* [SPARK-10921][YARN] Completely remove the use of SparkContext.prefer…Jacek Laskowski2015-10-192-3/+0
| | | | | | | | …redNodeLocationData Author: Jacek Laskowski <jacek.laskowski@deepsense.io> Closes #8976 from jaceklaskowski/SPARK-10921.
* [SPARK-11000] [YARN] Load `metadata.Hive` class only when ↵huangzhaowei2015-10-171-4/+4
| | | | | | | | `hive.metastore.uris` was set to avoid bootting the database twice Author: huangzhaowei <carlmartinmax@gmail.com> Closes #9026 from SaintBacchus/SPARK-11000.
* [SPARK-11026] [YARN] spark.yarn.user.classpath.first does work for ↵Lianhui Wang2015-10-131-8/+15
| | | | | | | | | | 'spark-submit --jars hdfs://user/foo.jar' when spark.yarn.user.classpath.first=true and using 'spark-submit --jars hdfs://user/foo.jar', it can not put foo.jar to system classpath. so we need to put yarn's linkNames of jars to the system classpath. vanzin tgravescs Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #9045 from lianhuiwang/spark-11026.
* [SPARK-10739] [YARN] Add application attempt window for Spark on Yarnjerryshao2015-10-121-0/+14
| | | | | | | | | | | | | Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures. Author: jerryshao <sshao@hortonworks.com> Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits: 36eabdc [jerryshao] change the doc 7f9b77d [jerryshao] Style change 1c9afd0 [jerryshao] Address the comments caca695 [jerryshao] Add application attempt window for Spark on Yarn
* [SPARK-11023] [YARN] Avoid creating URIs from local paths directly.Marcelo Vanzin2015-10-121-5/+6
| | | | | | | | | | | | | | The issue is that local paths on Windows, when provided with drive letters or backslashes, are not valid URIs. Instead of trying to figure out whether paths are URIs or not, use Utils.resolveURI() which does that for us. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9049 from vanzin/SPARK-11023 and squashes the following commits: 77021f2 [Marcelo Vanzin] [SPARK-11023] [yarn] Avoid creating URIs from local paths directly.
* [SPARK-8673] [LAUNCHER] API and infrastructure for communicating with child ↵Marcelo Vanzin2015-10-092-5/+48
| | | | | | | | | | | | | | | | | | apps. This change adds an API that encapsulates information about an app launched using the library. It also creates a socket-based communication layer for apps that are launched as child processes; the launching application listens for connections from launched apps, and once communication is established, the channel can be used to send updates to the launching app, or to send commands to the child app. The change also includes hooks for local, standalone/client and yarn masters. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7052 from vanzin/SPARK-8673.
* [SPARK-10987] [YARN] Workaround for missing netty rpc disconnection event.Marcelo Vanzin2015-10-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | In YARN client mode, when the AM connects to the driver, it may be the case that the driver never needs to send a message back to the AM (i.e., no dynamic allocation or preemption). This triggers an issue in the netty rpc backend where no disconnection event is sent to endpoints, and the AM never exits after the driver shuts down. The real fix is too complicated, so this is a quick hack to unblock YARN client mode until we can work on the real fix. It forces the driver to send a message to the AM when the AM registers, thus establishing that connection and enabling the disconnection event when the driver goes away. Also, a minor side issue: when the executor is shutting down, it needs to send an "ack" back to the driver when using the netty rpc backend; but that "ack" wasn't being sent because the handler was shutting down the rpc env before returning. So added a change to delay the shutdown a little bit, allowing the ack to be sent back. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9021 from vanzin/SPARK-10987.
* [SPARK-10964] [YARN] Correctly register the AM with the driver.Marcelo Vanzin2015-10-071-1/+3
| | | | | | | | | | The `self` method returns null when called from the constructor; instead, registration should happen in the `onStart` method, at which point the `self` reference has already been initialized. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #9005 from vanzin/SPARK-10964.
* [SPARK-10812] [YARN] Fix shutdown of token renewer.Marcelo Vanzin2015-10-071-1/+1
| | | | | | | | | | | | | | | | | | A recent change to fix the referenced bug caused this exception in the `SparkContext.stop()` path: org.apache.spark.SparkException: YarnSparkHadoopUtil is not available in non-YARN mode! at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:167) at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:182) at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:440) at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1579) at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1730) at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185) at org.apache.spark.SparkContext.stop(SparkContext.scala:1729) Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8996 from vanzin/SPARK-10812.
* [SPARK-10901] [YARN] spark.yarn.user.classpath.first doesn't workThomas Graves2015-10-061-12/+27
| | | | | | | | | | | This should go into 1.5.2 also. The issue is we were no longer adding the __app__.jar to the system classpath. Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com> Author: Tom Graves <tgraves@yahoo-inc.com> Closes #8959 from tgravescs/SPARK-10901.
* [SPARK-10916] [YARN] Set perm gen size when launching containers on YARN.Marcelo Vanzin2015-10-063-3/+24
| | | | | | | | | | This makes YARN containers behave like all other processes launched by Spark, which launch with a default perm gen size of 256m unless overridden by the user (or not needed by the vm). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8970 from vanzin/SPARK-10916.
* [SPARK-6028] [CORE] Remerge #6457: new RPC implemetation and also pick #8905zsxwing2015-10-031-4/+1
| | | | | | | | This PR just reverted https://github.com/apache/spark/commit/02144d6745ec0a6d8877d969feb82139bd22437f to remerge #6457 and also included the commits in #8905. Author: zsxwing <zsxwing@gmail.com> Closes #8944 from zsxwing/SPARK-6028.
* [SPARK-10871] include number of executor failures in error msgRyan Williams2015-09-291-1/+1
| | | | | | Author: Ryan Williams <ryan.blake.williams@gmail.com> Closes #8939 from ryan-williams/errmsg.
* [SPARK-10790] [YARN] Fix initial executor number not set issue and ↵jerryshao2015-09-284-40/+27
| | | | | | | | | | | | consolidate the codes This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790). Also consolidate and simplify some similar code snippets to keep the consistent semantics. Author: jerryshao <sshao@hortonworks.com> Closes #8910 from jerryshao/SPARK-10790.
* [SPARK-10812] [YARN] Spark hadoop util support switching to yarnHolden Karau2015-09-281-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight. ``` [info] SampleMiniClusterTest: [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED *** [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163) [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257) [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561) [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115) [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57) [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141) [info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186) [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26) [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103) ``` Author: Holden Karau <holden@pigscanfly.ca> Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
* Revert "[SPARK-6028][Core]A new RPC implemetation based on the network module"Xiangrui Meng2015-09-241-1/+4
| | | | This reverts commit 084e4e126211d74a79e8dbd2d0e604dd3c650822.
* [SPARK-6028][Core]A new RPC implemetation based on the network modulezsxwing2015-09-231-4/+1
| | | | | | | | Design doc: https://docs.google.com/document/d/1CF5G6rGVQMKSyV_QKo4D2M-x6rxz5x1Ew7aK3Uq6u8c/edit?usp=sharing Author: zsxwing <zsxwing@gmail.com> Closes #6457 from zsxwing/new-rpc.
* [SPARK-10594] [YARN] Remove reference to --num-executors, add --properties-fileErick Tryzelaar2015-09-141-1/+1
| | | | | | | | | | `ApplicationMaster` no longer has the `--num-executors` flag, and had an undocumented `--properties-file` configuration option. cc srowen Author: Erick Tryzelaar <erick.tryzelaar@gmail.com> Closes #8754 from erickt/master.
* [SPARK-8167] Make tasks that fail from YARN preemption not fail jobmcheah2015-09-102-24/+75
| | | | | | | | | | | | | | | | | The architecture is that, in YARN mode, if the driver detects that an executor has disconnected, it asks the ApplicationMaster why the executor died. If the ApplicationMaster is aware that the executor died because of preemption, all tasks associated with that executor are not marked as failed. The executor is still removed from the driver's list of available executors, however. There's a few open questions: 1. Should standalone mode have a similar "get executor loss reason" as well? I localized this change as much as possible to affect only YARN, but there could be a valid case to differentiate executor losses in standalone mode as well. 2. I make a pretty strong assumption in YarnAllocator that getExecutorLossReason(executorId) will only be called once per executor id; I do this so that I can remove the metadata from the in-memory map to avoid object accumulation. It's not clear if I'm being overly zealous to save space, however. cc vanzin specifically for review because it collided with some earlier YARN scheduling work. cc JoshRosen because it's similar to output commit coordination we did in the past cc andrewor14 for our discussion on how to get executor exit codes and loss reasons Author: mcheah <mcheah@palantir.com> Closes #8007 from mccheah/feature/preemption-handling.
* [SPARK-10481] [YARN] SPARK_PREPEND_CLASSES make spark-yarn related jar could ↵Jeff Zhang2015-09-091-1/+4
| | | | | | | | | | n… Throw a more readable exception. Please help review. Thanks Author: Jeff Zhang <zjffdu@apache.org> Closes #8649 from zjffdu/SPARK-10481.
* [SPARK-10332] [CORE] Fix yarn spark executor validationHolden Karau2015-09-031-0/+3
| | | | | | | | | | | | | | | | From Jira: Running spark-submit with yarn with number-executors equal to 0 when not using dynamic allocation should error out. In spark 1.5.0 it continues and ends up hanging. yarn.ClientArguments still has the check so something else must have changed. spark-submit --master yarn --deploy-mode cluster --class org.apache.spark.examples.SparkPi --num-executors 0 .... spark 1.4.1 errors with: java.lang.IllegalArgumentException: Number of executors was 0, but must be at least 1 (or 0 if dynamic executor allocation is enabled). Author: Holden Karau <holden@pigscanfly.ca> Closes #8580 from holdenk/SPARK-10332-spark-submit-to-yarn-executors-0-message.
* [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵Sean Owen2015-08-254-32/+32
| | | | | | | | | | | | to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
* [SPARK-9833] [YARN] Add options to disable delegation token retrieval.Marcelo Vanzin2015-08-191-6/+21
| | | | | | | | | | This allows skipping the code that tries to talk to Hive and HBase to fetch delegation tokens, in case that somehow conflicts with the application being run. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #8134 from vanzin/SPARK-9833.
* [SPARK-5754] [YARN] Spark/Yarn/Windows driver/executor escaping FixCarsten Blank2015-08-194-12/+75
| | | | | | | | | | | | | | | | | This is my retry to suggest a fix for using Spark on Yarn on Windows. The former request lacked coding style which I hope to have learned to do better, and wasn't a true solution as I didn't really understand where the problem came from. Albeit being still a bit obscure, I can name the "players" and have come up with a better explaination of why I am suggesting this fix. I also used vanzin and srowen input to *try* to give a more elegant solution. I am not so sure if that worked out though. I still hope that this PR is a lot more useful than the last. Also do I hope that this is a _solution_ to the problem that Spark doesn't work on Yarn on Windows. With these changes it works (and I can also explain why!). I still believe that a Unit Test should be included, kind of like the one I committed the last time. But that was premature, as I want to get the principal 'Go' from vanzin and srowen. Thanks for your time both of you. Author: Carsten Blank <blank@cncengine.com> Author: cbvoxel <blank@cncengine.com> Closes #8053 from cbvoxel/master.
* [SPARK-9969] [YARN] Remove old MR classpath API supportjerryshao2015-08-181-11/+1
| | | | | | | | | | Here propose to remove old MRJobConfig#DEFAULT_APPLICATION_CLASSPATH support, since we now move to Yarn stable API. vanzin and sryza , any opinion on this? If we still want to support old API, I can close it. But as far as I know now major Hadoop releases has moved to stable API. Author: jerryshao <sshao@hortonworks.com> Closes #8192 from jerryshao/SPARK-9969.
* [SPARK-9782] [YARN] Support YARN application tags via SparkConfDennis Huo2015-08-181-0/+21
| | | | | | | | | Add a new test case in yarn/ClientSuite which checks how the various SparkConf and ClientArguments propagate into the ApplicationSubmissionContext. Author: Dennis Huo <dhuo@google.com> Closes #8072 from dennishuo/dhuo-yarn-application-tags.
* [SPARK-7736] [CORE] [YARN] Make pyspark fail YARN app on failure.Marcelo Vanzin2015-08-171-2/+6
| | | | | | | | | | | | | | | The YARN backend doesn't like when user code calls `System.exit`, since it cannot know the exit status and thus cannot set an appropriate final status for the application. So, for pyspark, avoid that call and instead throw an exception with the exit code. SparkSubmit handles that exception and exits with the given exit code, while YARN uses the exit code as the failure code for the Spark app. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7751 from vanzin/SPARK-9416.
* [SPARK-9826] [CORE] Fix cannot use custom classes in log4j.propertiesMichel Lemay2015-08-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor Utils class and create ShutdownHookManager. NOTE: Wasn't able to run /dev/run-tests on windows machine. Manual tests were conducted locally using custom log4j.properties file with Redis appender and logstash formatter (bundled in the fat-jar submitted to spark) ex: log4j.rootCategory=WARN,console,redis log4j.appender.console=org.apache.log4j.ConsoleAppender log4j.appender.console.target=System.err log4j.appender.console.layout=org.apache.log4j.PatternLayout log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n log4j.logger.org.eclipse.jetty=WARN log4j.logger.org.eclipse.jetty.util.component.AbstractLifeCycle=ERROR log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO log4j.logger.org.apache.spark.graphx.Pregel=INFO log4j.appender.redis=com.ryantenney.log4j.FailoverRedisAppender log4j.appender.redis.endpoints=hostname:port log4j.appender.redis.key=mykey log4j.appender.redis.alwaysBatch=false log4j.appender.redis.layout=net.logstash.log4j.JSONEventLayoutV1 Author: michellemay <mlemay@gmail.com> Closes #8109 from michellemay/SPARK-9826.
* [SPARK-9092] Fixed incompatibility when both num-executors and dynamic...Niranjan Padmanabhan2015-08-126-19/+15
| | | | | | | | … allocation are set. Now, dynamic allocation is set to false when num-executors is explicitly specified as an argument. Consequently, executorAllocationManager in not initialized in the SparkContext. Author: Niranjan Padmanabhan <niranjan.padmanabhan@cloudera.com> Closes #7657 from neurons/SPARK-9092.
* [SPARK-9737] [YARN] Add the suggested configuration when required executor ↵Yadong Qi2015-08-091-2/+4
| | | | | | | | | | memory is above the max threshold of this cluster on YARN mode Author: Yadong Qi <qiyadong2010@gmail.com> Closes #8028 from watermen/SPARK-9737 and squashes the following commits: 48bdf3d [Yadong Qi] Add suggested configuration.
* [SPARK-9519] [YARN] Confirm stop sc successfully when application was killedlinweizhong2015-08-051-15/+32
| | | | | | | | | | | | | Currently, when we kill application on Yarn, then will call sc.stop() at Yarn application state monitor thread, then in YarnClientSchedulerBackend.stop() will call interrupt this will cause SparkContext not stop fully as we will wait executor to exit. Author: linweizhong <linweizhong@huawei.com> Closes #7846 from Sephiroth-Lin/SPARK-9519 and squashes the following commits: 1ae736d [linweizhong] Update comments 2e8e365 [linweizhong] Add comment explaining the code ad0e23b [linweizhong] Update 243d2c7 [linweizhong] Confirm stop sc successfully when application was killed
* [SPARK-9491] Avoid fetching HBase tokens when not needed.Marcelo Vanzin2015-08-011-5/+6
| | | | | | | | | | | | | | | Look at HBase's configuration to make sure it's configured for Kerberos. If the HBase configuration is missing, or if HBase is configured for non-kerberos authentication, then skip getting tokens. Reference: http://hbase.apache.org/book.html#security.prerequisites Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7810 from vanzin/SPARK-9491 and squashes the following commits: a57c776 [Marcelo Vanzin] [SPARK-9491] Avoid fetching HBase tokens when not needed.
* [SPARK-9388] [YARN] Make executor info log messages easier to read.Marcelo Vanzin2015-07-302-4/+11
| | | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7706 from vanzin/SPARK-9388 and squashes the following commits: 028b990 [Marcelo Vanzin] Single log statement. 3c5fb6a [Marcelo Vanzin] YARN not Yarn. 5bcd7a0 [Marcelo Vanzin] [SPARK-9388] [yarn] Make executor info log messages easier to read.
* [SPARK-8297] [YARN] Scheduler backend is not notified in case node fails in YARNMridul Muralidharan2015-07-303-14/+45
| | | | | | | | | | | | | | | | | | | | | | | | This change adds code to notify the scheduler backend when a container dies in YARN. Author: Mridul Muralidharan <mridulm@yahoo-inc.com> Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #7431 from vanzin/SPARK-8297 and squashes the following commits: 471e4a0 [Marcelo Vanzin] Fix unit test after merge. d4adf4e [Marcelo Vanzin] Merge branch 'master' into SPARK-8297 3b262e8 [Marcelo Vanzin] Merge branch 'master' into SPARK-8297 537da6f [Marcelo Vanzin] Make an expected log less scary. 04dc112 [Marcelo Vanzin] Use driver <-> AM communication to send "remove executor" request. 8855b97 [Marcelo Vanzin] Merge remote-tracking branch 'mridul/fix_yarn_scheduler_bug' into SPARK-8297 687790f [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug e1b0067 [Mridul Muralidharan] Fix failing testcase, fix merge issue from our 1.3 -> master 9218fcc [Mridul Muralidharan] Fix failing testcase 362d64a [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug 62ad0cc [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug bbf8811 [Mridul Muralidharan] Merge branch 'fix_yarn_scheduler_bug' of github.com:mridulm/spark into fix_yarn_scheduler_bug 9ee1307 [Mridul Muralidharan] Fix SPARK-8297 a3a0f01 [Mridul Muralidharan] Fix SPARK-8297
* [SPARK-4352] [YARN] [WIP] Incorporate locality preferences in dynamic ↵jerryshao2015-07-273-11/+223
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | allocation requests Currently there's no locality preference for container request in YARN mode, this will affect the performance if fetching data remotely, so here proposed to add locality in Yarn dynamic allocation mode. Ping sryza, please help to review, thanks a lot. Author: jerryshao <saisai.shao@intel.com> Closes #6394 from jerryshao/SPARK-4352 and squashes the following commits: d45fecb [jerryshao] Add documents 6c3fe5c [jerryshao] Fix bug 8db6c0e [jerryshao] Further address the comments 2e2b2cb [jerryshao] Fix rebase compiling problem ce5f096 [jerryshao] Fix style issue 7f7df95 [jerryshao] Fix rebase issue 9ca9e07 [jerryshao] Code refactor according to comments d3e4236 [jerryshao] Further address the comments 5e7a593 [jerryshao] Fix bug introduced code rebase 9ca7783 [jerryshao] Style changes 08317f9 [jerryshao] code and comment refines 65b2423 [jerryshao] Further address the comments a27c587 [jerryshao] address the comment 27faabc [jerryshao] redundant code remove 9ce06a1 [jerryshao] refactor the code f5ba27b [jerryshao] Style fix 2c6cc8a [jerryshao] Fix bug and add unit tests 0757335 [jerryshao] Consider the distribution of existed containers to recalculate the new container requests 0ad66ff [jerryshao] Fix compile bugs 1c20381 [jerryshao] Minor fix 5ef2dc8 [jerryshao] Add docs and improve the code 3359814 [jerryshao] Fix rebase and test bugs 0398539 [jerryshao] reinitialize the new implementation 67596d6 [jerryshao] Still fix the code 654e1d2 [jerryshao] Fix some bugs 45b1c89 [jerryshao] Further polish the algorithm dea0152 [jerryshao] Enable node locality information in YarnAllocator 74bbcc6 [jerryshao] Support node locality for dynamic allocation initial commit
* [SPARK-8988] [YARN] Make sure driver log links appear in secure cluste…Hari Shreedharan2015-07-271-54/+17
| | | | | | | | | | | | …r mode. The NodeReports API currently used does not work in secure mode since we do not get RM tokens. Instead this patch just uses environment vars exported by YARN to create the log links. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #7624 from harishreedharan/driver-logs-env and squashes the following commits: 7368c7e [Hari Shreedharan] [SPARK-8988][YARN] Make sure driver log links appear in secure cluster mode.
* [SPARK-8851] [YARN] In Client mode, make sure the client logs in and updates ↵Hari Shreedharan2015-07-172-13/+30
| | | | | | | | | | | | | | | | | | | tokens In client side, the flow is SparkSubmit -> SparkContext -> yarn/Client. Since the yarn client only gets a cloned config and the staging dir is set here, it is not really possible to do re-logins in the SparkContext. So, do the initial logins in Spark Submit and do re-logins as we do now in the AM, but the Client behaves like an executor in this specific context and reads the credentials file to update the tokens. This way, even if the streaming context is started up from checkpoint - it is fine since we have logged in from SparkSubmit itself itself. Author: Hari Shreedharan <hshreedharan@apache.org> Closes #7394 from harishreedharan/yarn-client-login and squashes the following commits: 9a2166f [Hari Shreedharan] make it possible to use command line args and config parameters together. de08f57 [Hari Shreedharan] Fix import order. 5c4fa63 [Hari Shreedharan] Add a comment explaining what is being done in YarnClientSchedulerBackend. c872caa [Hari Shreedharan] Fix typo in log message. 2c80540 [Hari Shreedharan] Move token renewal to YarnClientSchedulerBackend. 0c48ac2 [Hari Shreedharan] Remove direct use of ExecutorDelegationTokenUpdater in Client. 26f8bfa [Hari Shreedharan] [SPARK-8851][YARN] In Client mode, make sure the client logs in and updates tokens. 58b1969 [Hari Shreedharan] Simple attempt 1.
* [SPARK-8646] PySpark does not run on YARN if master not provided in command lineLianhui Wang2015-07-161-1/+1
| | | | | | | | | | | andrewor14 davies vanzin can you take a look at this? thanks Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #7438 from lianhuiwang/SPARK-8646 and squashes the following commits: cb3f12d [Lianhui Wang] add whitespace 6d874a6 [Lianhui Wang] support pyspark for yarn-client