aboutsummaryrefslogtreecommitdiff
path: root/core
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-16637] Unified containerizerMichael Gummelt2016-07-2910-72/+132
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? New config var: spark.mesos.docker.containerizer={"mesos","docker" (default)} This adds support for running docker containers via the Mesos unified containerizer: http://mesos.apache.org/documentation/latest/container-image/ The benefit is losing the dependency on `dockerd`, and all the costs which it incurs. I've also updated the supported Mesos version to 0.28.2 for support of the required protobufs. This is blocked on: https://github.com/apache/spark/pull/14167 ## How was this patch tested? - manually testing jobs submitted with both "mesos" and "docker" settings for the new config var. - spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14275 from mgummelt/unified-containerizer.
* [SPARK-5847][CORE] Allow for configuring MetricsSystem's use of app ID to ↵Mark Grover2016-07-275-26/+158
| | | | | | | | | | | | | | | | | | | | | | | | | | namespace all metrics ## What changes were proposed in this pull request? Adding a new property to SparkConf called spark.metrics.namespace that allows users to set a custom namespace for executor and driver metrics in the metrics systems. By default, the root namespace used for driver or executor metrics is the value of `spark.app.id`. However, often times, users want to be able to track the metrics across apps for driver and executor metrics, which is hard to do with application ID (i.e. `spark.app.id`) since it changes with every invocation of the app. For such use cases, users can set the `spark.metrics.namespace` property to another spark configuration key like `spark.app.name` which is then used to populate the root namespace of the metrics system (with the app name in our example). `spark.metrics.namespace` property can be set to any arbitrary spark property key, whose value would be used to set the root namespace of the metrics system. Non driver and executor metrics are never prefixed with `spark.app.id`, nor does the `spark.metrics.namespace` property have any such affect on such metrics. ## How was this patch tested? Added new unit tests, modified existing unit tests. Author: Mark Grover <mark@apache.org> Closes #14270 from markgrover/spark-5847.
* [SPARK-15703][SCHEDULER][CORE][WEBUI] Make ListenerBus event queue size ↵Dhruve Ashar2016-07-268-36/+56
| | | | | | | | | | | | | | | | | configurable ## What changes were proposed in this pull request? This change adds a new configuration entry to specify the size of the spark listener bus event queue. The value for this config ("spark.scheduler.listenerbus.eventqueue.size") is set to a default to 10000. Note: I haven't currently documented the configuration entry. We can decide whether it would be appropriate to make it a public configuration or keep it as an undocumented one. Refer JIRA for more details. ## How was this patch tested? Ran existing jobs and verified the event queue size with debug logs and from the Spark WebUI Environment tab. Author: Dhruve Ashar <dhruveashar@gmail.com> Closes #14269 from dhruve/bug/SPARK-15703.
* [SPARK-15271][MESOS] Allow force pulling executor docker imagesPhilipp Hoffmann2016-07-266-22/+91
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting (`spark.mesos.executor.docker.forcePullImage`). Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #14348 from philipphoffmann/force-pull-image.
* [SPARK-15590][WEBUI] Paginate Job Table in Jobs tabTao Lin2016-07-252-62/+312
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds pagination support for the Job Tables in the Jobs tab. Pagination is provided for all of the three Job Tables (active, completed, and failed). Interactions (jumping, sorting, and setting page size) for paged tables are also included. The diff didn't keep track of some lines based on the original ones. The function `makeRow`of the original `AllJobsPage.scala` is reused. They are separated at the beginning of the function `jobRow` (L427-439) and the function `row`(L594-618) in the new `AllJobsPage.scala`. ## How was this patch tested? Tested manually by using checking the Web UI after completing and failing hundreds of jobs. Generate completed jobs by: ```scala val d = sc.parallelize(Array(1,2,3,4,5)) for(i <- 1 to 255){ var b = d.collect() } ``` Generate failed jobs by calling the following code multiple times: ```scala var b = d.map(_/0).collect() ``` Interactions like jumping, sorting, and setting page size are all tested. This shows the pagination for completed jobs: ![paginate success jobs](https://cloud.githubusercontent.com/assets/5558370/15986498/efa12ef6-303b-11e6-8b1d-c3382aeb9ad0.png) This shows the sorting works in job tables: ![sorting](https://cloud.githubusercontent.com/assets/5558370/15986539/98c8a81a-303c-11e6-86f2-8d2bc7924ee9.png) This shows the pagination for failed jobs and the effect of jumping and setting page size: ![paginate failed jobs](https://cloud.githubusercontent.com/assets/5558370/15986556/d8c1323e-303c-11e6-8e4b-7bdb030ea42b.png) Author: Tao Lin <nblintao@gmail.com> Closes #13620 from nblintao/dev.
* [SPARK-16166][CORE] Also take off-heap memory usage into consideration in ↵jerryshao2016-07-257-9/+26
| | | | | | | | | | | | | | | | | | | | | | | log and webui display ## What changes were proposed in this pull request? Currently in the log and UI display, only on-heap storage memory is calculated and displayed, ``` 16/06/27 13:41:52 INFO MemoryStore: Block rdd_5_0 stored as values in memory (estimated size 17.8 KB, free 665.9 MB) ``` <img width="1232" alt="untitled" src="https://cloud.githubusercontent.com/assets/850797/16369960/53fb614e-3c6e-11e6-8fa3-7ffe65abcb49.png"> With [SPARK-13992](https://issues.apache.org/jira/browse/SPARK-13992) off-heap memory is supported for data persistence, so here change to also take off-heap storage memory into consideration. ## How was this patch tested? Unit test and local verification. Author: jerryshao <sshao@hortonworks.com> Closes #13920 from jerryshao/SPARK-16166.
* Revert "[SPARK-15271][MESOS] Allow force pulling executor docker images"Josh Rosen2016-07-256-91/+22
| | | | This reverts commit 978cd5f125eb5a410bad2e60bf8385b11cf1b978.
* [SPARK-15271][MESOS] Allow force pulling executor docker imagesPhilipp Hoffmann2016-07-256-22/+91
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Mesos agents by default will not pull docker images which are cached locally already. In order to run Spark executors from mutable tags like `:latest` this commit introduces a Spark setting `spark.mesos.executor.docker.forcePullImage`. Setting this flag to true will tell the Mesos agent to force pull the docker image (default is `false` which is consistent with the previous implementation and Mesos' default behaviour). ## How was this patch tested? I ran a sample application including this change on a Mesos cluster and verified the correct behaviour for both, with and without, force pulling the executor image. As expected the image is being force pulled if the flag is set. Author: Philipp Hoffmann <mail@philipphoffmann.de> Closes #13051 from philipphoffmann/force-pull-image.
* [SPARK-5581][CORE] When writing sorted map output file, avoid open / …Brian Cho2016-07-247-156/+192
| | | | | | | | | | | | | | | | | | | | | | …close between each partition ## What changes were proposed in this pull request? Replace commitAndClose with separate commit and close to avoid opening and closing the file between partitions. ## How was this patch tested? Run existing unit tests, add a few unit tests regarding reverts. Observed a ~20% reduction in total time in tasks on stages with shuffle writes to many partitions. JoshRosen Author: Brian Cho <bcho@fb.com> Closes #13382 from dafrista/separatecommit-master.
* [SPARK-16416][CORE] force eager creation of loggers to avoid shutdown hook ↵Mikael Ståldal2016-07-247-0/+9
| | | | | | | | | | | | | | | | conflicts ## What changes were proposed in this pull request? Force eager creation of loggers to avoid shutdown hook conflicts. ## How was this patch tested? Manually tested with a project using Log4j 2, verified that the shutdown hook conflict issue was solved. Author: Mikael Ståldal <mikael.staldal@magine.com> Closes #14320 from mikaelstaldal/shutdown-hook-logging.
* [SPARK-16194] Mesos Driver env varsMichael Gummelt2016-07-214-106/+191
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Added new configuration namespace: spark.mesos.env.* This allows a user submitting a job in cluster mode to set arbitrary environment variables on the driver. spark.mesos.driverEnv.KEY=VAL will result in the env var "KEY" being set to "VAL" I've also refactored the tests a bit so we can re-use code in MesosClusterScheduler. And I've refactored the command building logic in `buildDriverCommand`. Command builder values were very intertwined before, and now it's easier to determine exactly how each variable is set. ## How was this patch tested? unit tests Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14167 from mgummelt/driver-env-vars.
* [SPARK-16272][CORE] Allow config values to reference conf, env, system props.Marcelo Vanzin2016-07-204-32/+233
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows configuration to be more flexible, for example, when the cluster does not have a homogeneous configuration (e.g. packages are installed on different paths in different nodes). By allowing one to reference the environment from the conf, it becomes possible to work around those in certain cases. As part of the implementation, ConfigEntry now keeps track of all "known" configs (i.e. those created through the use of ConfigBuilder), since that list is used by the resolution code. This duplicates some code in SQLConf, which could potentially be merged with this now. It will also make it simpler to implement some missing features such as filtering which configs show up in the UI or in event logs - which are not part of this change. Another change is in the way ConfigEntry reads config data; it now takes a string map and a function that reads env variables, so that it can be called both from SparkConf and SQLConf. This makes it so both places follow the same read path, instead of having to replicate certain logic in SQLConf. There are still a couple of methods in SQLConf that peek into fields of ConfigEntry directly, though. Tested via unit tests, and by using the new variable expansion functionality in a shell session with a custom spark.sql.hive.metastore.jars value. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14022 from vanzin/SPARK-16272.
* [SPARK-15951] Change Executors Page to use datatables to support sorting ↵Kishor Patil2016-07-209-301/+693
| | | | | | | | | | | | | | | | | | | columns and searching 1. Create the executorspage-template.html for displaying application information in datables. 2. Added REST API endpoint "allexecutors" to be able to see all executors created for particular job. 3. The executorspage.js uses jQuery to access the data from /api/v1/applications/appid/allexecutors REST API, and use DataTable to display executors for the application. It also, generates summary of dead/live and total executors created during life of the application. 4. Similar changes applicable to Executors Page on history server for a given application. Snapshots for how it looks like now: <img width="938" alt="screen shot 2016-06-14 at 2 45 44 pm" src="https://cloud.githubusercontent.com/assets/6090397/16060092/ad1de03a-324b-11e6-8469-9eaa3f2548b5.png"> New Executors Page screenshot looks like this: <img width="1436" alt="screen shot 2016-06-15 at 10 12 01 am" src="https://cloud.githubusercontent.com/assets/6090397/16085514/ee7004f0-32e1-11e6-9340-33d91e407f2b.png"> Author: Kishor Patil <kpatil@yahoo-inc.com> Closes #13670 from kishorvpatil/execTemplates.
* [SPARK-16613][CORE] RDD.pipe returns values for empty partitionsSean Owen2016-07-202-1/+15
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Document RDD.pipe semantics; don't execute process for empty input partitions. Note this includes the fix in https://github.com/apache/spark/pull/14256 because it's necessary to even test this. One or the other will merge the fix. ## How was this patch tested? Jenkins tests including new test. Author: Sean Owen <sowen@cloudera.com> Closes #14260 from srowen/SPARK-16613.
* [SPARK-10683][SPARK-16510][SPARKR] Move SparkR include jar test to ↵Shivaram Venkataraman2016-07-192-0/+47
| | | | | | | | | | | | | | | SparkSubmitSuite ## What changes were proposed in this pull request? This change moves the include jar test from R to SparkSubmitSuite and uses a dynamically compiled jar. This helps us remove the binary jar from the R package and solves both the CRAN warnings and the lack of source being available for this jar. ## How was this patch tested? SparkR unit tests, SparkSubmitSuite, check-cran.sh Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #14243 from shivaram/sparkr-jar-move.
* [SPARK-14702] Make environment of SparkLauncher launched process more ↵Andrew Duffy2016-07-191-4/+63
| | | | | | | | | | | | | | | | configurable ## What changes were proposed in this pull request? Adds a few public methods to `SparkLauncher` to allow configuring some extra features of the `ProcessBuilder`, including the working directory, output and error stream redirection. ## How was this patch tested? Unit testing + simple Spark driver programs Author: Andrew Duffy <root@aduffy.org> Closes #14201 from andreweduffy/feature/launcher.
* [SPARK-16620][CORE] Add back the tokenization process in `RDD.pipe(command: ↵Liwei Lin2016-07-192-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | String)` ## What changes were proposed in this pull request? Currently `RDD.pipe(command: String)`: - works only when the command is specified without any options, such as `RDD.pipe("wc")` - does NOT work when the command is specified with some options, such as `RDD.pipe("wc -l")` This is a regression from Spark 1.6. This patch adds back the tokenization process in `RDD.pipe(command: String)` to fix this regression. ## How was this patch tested? Added a test which: - would pass in `1.6` - _[prior to this patch]_ would fail in `master` - _[after this patch]_ would pass in `master` Author: Liwei Lin <lwlin7@gmail.com> Closes #14256 from lw-lin/rdd-pipe.
* [SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant ↵Xin Ren2016-07-191-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | definition and inherited from the parent https://issues.apache.org/jira/browse/SPARK-16535 ## What changes were proposed in this pull request? When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot ``` Definition of groupId is redundant, because it's inherited from the parent ``` ![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png) I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok. ``` <groupId>org.apache.spark</groupId> ``` As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1). ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762 ## How was this patch tested? I've tested by re-building the project, and build succeeded. Author: Xin Ren <iamshrek@126.com> Closes #14189 from keypointt/SPARK-16535.
* [SPARK-16230][CORE] CoarseGrainedExecutorBackend to self kill if there is an ↵Tejas Patil2016-07-151-12/+20
| | | | | | | | | | | | | | | | | | exception while creating an Executor ## What changes were proposed in this pull request? With the fix from SPARK-13112, I see that `LaunchTask` is always processed after `RegisteredExecutor` is done and so it gets chance to do all retries to startup an executor. There is still a problem that if `Executor` creation itself fails and there is some exception, it gets unnoticed and the executor is killed when it tries to process the `LaunchTask` as `executor` is null : https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L88 So if one looks at the logs, it does not tell that there was problem during `Executor` creation and thats why it was killed. This PR explicitly catches exception in `Executor` creation, logs a proper message and then exits the JVM. Also, I have changed the `exitExecutor` method to accept `reason` so that backends can use that reason and do stuff like logging to a DB to get an aggregate of such exits at a cluster level ## How was this patch tested? I am relying on existing tests Author: Tejas Patil <tejasp@fb.com> Closes #14202 from tejasapatil/exit_executor_failure.
* [SPARK-16540][YARN][CORE] Avoid adding jars twice for Spark running on yarnjerryshao2016-07-141-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently when running spark on yarn, jars specified with --jars, --packages will be added twice, one is Spark's own file server, another is yarn's distributed cache, this can be seen from log: for example: ``` ./bin/spark-shell --master yarn-client --jars examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar ``` If specified the jar to be added is scopt jar, it will added twice: ``` ... 16/07/14 15:06:48 INFO Server: Started 5603ms 16/07/14 15:06:48 INFO Utils: Successfully started service 'SparkUI' on port 4040. 16/07/14 15:06:48 INFO SparkUI: Bound SparkUI to 0.0.0.0, and started at http://192.168.0.102:4040 16/07/14 15:06:48 INFO SparkContext: Added JAR file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar at spark://192.168.0.102:63996/jars/scopt_2.11-3.3.0.jar with timestamp 1468480008637 16/07/14 15:06:49 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 16/07/14 15:06:49 INFO Client: Requesting a new application from cluster with 1 NodeManagers 16/07/14 15:06:49 INFO Client: Verifying our application has not requested more than the maximum memory capability of the cluster (8192 MB per container) 16/07/14 15:06:49 INFO Client: Will allocate AM container, with 896 MB memory including 384 MB overhead 16/07/14 15:06:49 INFO Client: Setting up container launch context for our AM 16/07/14 15:06:49 INFO Client: Setting up the launch environment for our AM container 16/07/14 15:06:49 INFO Client: Preparing resources for our AM container 16/07/14 15:06:49 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 16/07/14 15:06:50 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_libs__6486179704064718817.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_libs__6486179704064718817.zip 16/07/14 15:06:51 INFO Client: Uploading resource file:/Users/sshao/projects/apache-spark/examples/target/scala-2.11/jars/scopt_2.11-3.3.0.jar -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/scopt_2.11-3.3.0.jar 16/07/14 15:06:51 INFO Client: Uploading resource file:/private/var/folders/tb/8pw1511s2q78mj7plnq8p9g40000gn/T/spark-a446300b-84bf-43ff-bfb1-3adfb0571a42/__spark_conf__326416236462420861.zip -> hdfs://localhost:8020/user/sshao/.sparkStaging/application_1468468348998_0009/__spark_conf__.zip ... ``` So here try to avoid adding jars to Spark's fileserver unnecessarily. ## How was this patch tested? Manually verified both in yarn client and cluster mode, also in standalone mode. Author: jerryshao <sshao@hortonworks.com> Closes #14196 from jerryshao/SPARK-16540.
* [SPARK-16435][YARN][MINOR] Add warning log if initialExecutors is less than ↵jerryshao2016-07-132-1/+21
| | | | | | | | | | | | | | | | minExecutors ## What changes were proposed in this pull request? Currently if `spark.dynamicAllocation.initialExecutors` is less than `spark.dynamicAllocation.minExecutors`, Spark will automatically pick the minExecutors without any warning. While in 1.6 Spark will throw exception if configured like this. So here propose to add warning log if these parameters are configured invalidly. ## How was this patch tested? Unit test added to verify the scenario. Author: jerryshao <sshao@hortonworks.com> Closes #14149 from jerryshao/SPARK-16435.
* [SPARK-16375][WEB UI] Fixed misassigned var: numCompletedTasks was assigned ↵Alex Bozarth2016-07-137-12/+12
| | | | | | | | | | | | | | | | to numSkippedTasks ## What changes were proposed in this pull request? I fixed a misassigned var, numCompletedTasks was assigned to numSkippedTasks in the convertJobData method ## How was this patch tested? dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14141 from ajbozarth/spark16375.
* [SPARK-16405] Add metrics and source for external shuffle serviceYangyang Liu2016-07-122-0/+45
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since externalShuffleService is essential for spark, better monitoring for shuffle service is necessary. In order to do so, we added various metrics in shuffle service and imported into ExternalShuffleServiceSource for metric system. Metrics added in shuffle service: * registeredExecutorsSize * openBlockRequestLatencyMillis * registerExecutorRequestLatencyMillis * blockTransferRateBytes JIRA Issue: https://issues.apache.org/jira/browse/SPARK-16405 ## How was this patch tested? Some test cases are added to verify metrics as expected in metric system. Those unit test cases are shown in `ExternalShuffleBlockHandlerSuite ` Author: Yangyang Liu <yangyangliu@fb.com> Closes #14080 from lovexi/yangyang-metrics.
* [SPARK-16477] Bump master version to 2.1.0-SNAPSHOTReynold Xin2016-07-111-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #14130 from rxin/SPARK-16477.
* [SPARK-16432] Empty blocks fail to serialize due to assert in ChunkedByteBufferEric Liang2016-07-082-13/+8
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? It's possible to also change the callers to not pass in empty chunks, but it seems cleaner to just allow `ChunkedByteBuffer` to handle empty arrays. cc JoshRosen ## How was this patch tested? Unit tests, also checked that the original reproduction case in https://github.com/apache/spark/pull/11748#issuecomment-230760283 is resolved. Author: Eric Liang <ekl@databricks.com> Closes #14099 from ericl/spark-16432.
* [SPARK-16376][WEBUI][SPARK WEB UI][APP-ID] HTTP ERROR 500 when using rest ↵Sean Owen2016-07-081-1/+6
| | | | | | | | | | | | | | | | api "/applications//jobs" if array "stageIds" is empty ## What changes were proposed in this pull request? Avoid error finding max of empty Seq when stageIds is empty. It does fix the immediate problem; I don't know if it results in meaningful output, but not an error at least. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14105 from srowen/SPARK-16376.
* [SPARK-16420] Ensure compression streams are closed.Ryan Blue2016-07-083-11/+34
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This uses the try/finally pattern to ensure streams are closed after use. `UnsafeShuffleWriter` wasn't closing compression streams, causing them to leak resources until garbage collected. This was causing a problem with codecs that use off-heap memory. ## How was this patch tested? Current tests are sufficient. This should not change behavior. Author: Ryan Blue <blue@apache.org> Closes #14093 from rdblue/SPARK-16420-unsafe-shuffle-writer-leak.
* [SPARK-15885][WEB UI] Provide links to executor logs from stage details page ↵Tom Magrino2016-07-074-8/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in UI ## What changes were proposed in this pull request? This moves over old PR https://github.com/apache/spark/pull/13664 to target master rather than branch-1.6. Added links to logs (or an indication that there are no logs) for entries which list an executor in the stage details page of the UI. This helps streamline the workflow where a user views a stage details page and determines that they would like to see the associated executor log for further examination. Previously, a user would have to cross reference the executor id listed on the stage details page with the corresponding entry on the executors tab. Link to the JIRA: https://issues.apache.org/jira/browse/SPARK-15885 ## How was this patch tested? Ran existing unit tests. Ran test queries on a platform which did not record executor logs and again on a platform which did record executor logs and verified that the new table column was empty and links to the logs (which were verified as linking to the appropriate files), respectively. Attached is a screenshot of the UI page with no links, with the new columns highlighted. Additional screenshot of these columns with the populated links. Without links: ![updated without logs](https://cloud.githubusercontent.com/assets/1450821/16059721/2b69dbaa-3239-11e6-9eed-e539764ca159.png) With links: ![updated with logs](https://cloud.githubusercontent.com/assets/1450821/16059725/32c6e316-3239-11e6-90bd-2553f43f7779.png) This contribution is my original work and I license the work to the project under the Apache Spark project's open source license. Author: Tom Magrino <tmagrino@fb.com> Closes #13861 from tmagrino/uilogstweak.
* [SPARK-16398][CORE] Make cancelJob and cancelStage APIs publicMasterDDT2016-07-061-4/+14
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Make SparkContext `cancelJob` and `cancelStage` APIs public. This allows applications to use `SparkListener` to do their own management of jobs via events, but without using the REST API. ## How was this patch tested? Existing tests (dev/run-tests) Author: MasterDDT <miteshp@live.com> Closes #14072 from MasterDDT/SPARK-16398.
* [SPARK-16379][CORE][MESOS] Spark on mesos is broken due to race condition in ↵Sean Owen2016-07-062-5/+10
| | | | | | | | | | | | | | | | | | | | Logging ## What changes were proposed in this pull request? The commit https://github.com/apache/spark/commit/044971eca0ff3c2ce62afa665dbd3072d52cbbec introduced a lazy val to simplify code in Logging. Simple enough, though one side effect is that accessing log now means grabbing the instance's lock. This in turn turned up a form of deadlock in the Mesos code. It was arguably a bit of a problem in how this code is structured, but, in any event the safest thing to do seems to be to revert the commit, and that's 90% of the change here; it's just not worth the risk of similar more subtle issues. What I didn't revert here was the removal of this odd override of log in the Mesos code. In retrospect it might have been put in place at some stage as a defense against this type of problem. After all the Logging code still involved a lock at initialization before the change in question. Even after the revert, it doesn't seem like it does anything, given how Logging works now, so I left it removed. However, I also removed the particular log message that ended up playing a part in this problem anyway, maybe being paranoid, to make sure this type of problem can't happen even with how the current locking works in logging initialization. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14069 from srowen/SPARK-16379.
* [SPARK-16304] LinkageError should not crash Spark executorpetermaxlee2016-07-062-1/+14
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch updates the failure handling logic so Spark executor does not crash when seeing LinkageError. ## How was this patch tested? Added an end-to-end test in FailureSuite. Author: petermaxlee <petermaxlee@gmail.com> Closes #13982 from petermaxlee/SPARK-16304.
* [SPARK-15591][WEBUI] Paginate Stage Table in Stages tabTao Lin2016-07-065-141/+441
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds pagination support for the Stage Tables in the Stage tab. Pagination is provided for all of the four Job Tables (active, pending, completed, and failed). Besides, the paged stage tables are also used in JobPage (the detail page for one job) and PoolPage. Interactions (jumping, sorting, and setting page size) for paged tables are also included. ## How was this patch tested? Tested manually by using checking the Web UI after completing and failing hundreds of jobs. Same as the testings for [Paginate Job Table in Jobs tab](https://github.com/apache/spark/pull/13620). This shows the pagination for completed stages: ![paged stage table](https://cloud.githubusercontent.com/assets/5558370/16125696/5804e35e-3427-11e6-8923-5c5948982648.png) Author: Tao Lin <nblintao@gmail.com> Closes #13708 from nblintao/stageTable.
* [SPARK-16385][CORE] Catch correct exception when calling method via reflection.Marcelo Vanzin2016-07-051-1/+1
| | | | | | | | | Using "Method.invoke" causes an exception to be thrown, not an error, so Utils.waitForProcess() was always throwing an exception when run on Java 7. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14056 from vanzin/SPARK-16385.
* [MINOR][DOCS] Remove unused images; crush PNGs that could use it for good ↵Sean Owen2016-07-041-0/+0
| | | | | | | | | | | | | | | | | | measure ## What changes were proposed in this pull request? Coincidentally, I discovered that a couple images were unused in `docs/`, and then searched and found more, and then realized some PNGs were pretty big and could be crushed, and before I knew it, had done the same for the ASF site (not committed yet). No functional change at all, just less superfluous image data. ## How was this patch tested? `jekyll serve` Author: Sean Owen <sowen@cloudera.com> Closes #14029 from srowen/RemoveCompressImages.
* [MINOR][BUILD] Fix Java linter errorsDongjoon Hyun2016-07-022-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the minor Java linter errors like the following. ``` - public int read(char cbuf[], int off, int len) throws IOException { + public int read(char[] cbuf, int off, int len) throws IOException { ``` ## How was this patch tested? Manual. ``` $ build/mvn -T 4 -q -DskipTests -Pyarn -Phadoop-2.3 -Pkinesis-asl -Phive -Phive-thriftserver install $ dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14017 from dongjoon-hyun/minor_build_java_linter_error.
* [SPARK-16335][SQL] Structured streaming should fail if source directory does ↵Reynold Xin2016-07-011-5/+5
| | | | | | | | | | | | | | not exist ## What changes were proposed in this pull request? In structured streaming, Spark does not report errors when the specified directory does not exist. This is a behavior different from the batch mode. This patch changes the behavior to fail if the directory does not exist (when the path is not a glob pattern). ## How was this patch tested? Updated unit tests to reflect the new behavior. Author: Reynold Xin <rxin@databricks.com> Closes #14002 from rxin/SPARK-16335.
* [SPARK-16182][CORE] Utils.scala -- terminateProcess() should call ↵Sean Owen2016-07-012-31/+47
| | | | | | | | | | | | | | | | | | Process.destroyForcibly() if and only if Process.destroy() fails ## What changes were proposed in this pull request? Utils.terminateProcess should `destroy()` first and only fall back to `destroyForcibly()` if it fails. It's kind of bad that we're force-killing executors -- and only in Java 8. See JIRA for an example of the impact: no shutdown While here: `Utils.waitForProcess` should use the Java 8 method if available instead of a custom implementation. ## How was this patch tested? Existing tests, which cover the force-kill case, and Amplab tests, which will cover both Java 7 and Java 8 eventually. However I tested locally on Java 8 and the PR builder will try Java 7 here. Author: Sean Owen <sowen@cloudera.com> Closes #13973 from srowen/SPARK-16182.
* [SPARK-15865][CORE] Blacklist should not result in job hanging with less ↵Imran Rashid2016-06-309-15/+194
| | | | | | | | | | | | | | | | | | | | | than 4 executors ## What changes were proposed in this pull request? Before this change, when you turn on blacklisting with `spark.scheduler.executorTaskBlacklistTime`, but you have fewer than `spark.task.maxFailures` executors, you can end with a job "hung" after some task failures. Whenever a taskset is unable to schedule anything on resourceOfferSingleTaskSet, we check whether the last pending task can be scheduled on *any* known executor. If not, the taskset (and any corresponding jobs) are failed. * Worst case, this is O(maxTaskFailures + numTasks). But unless many executors are bad, this should be small * This does not fail as fast as possible -- when a task becomes unschedulable, we keep scheduling other tasks. This is to avoid an O(numPendingTasks * numExecutors) operation * Also, it is conceivable this fails too quickly. You may be 1 millisecond away from unblacklisting a place for a task to run, or acquiring a new executor. ## How was this patch tested? Added unit test which failed before the change, ran new test 5k times manually, ran all scheduler tests manually, and the full suite via jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes #13603 from squito/progress_w_few_execs_and_blacklist.
* [SPARK-13850] Force the sorter to Spill when number of elements in th…Sital Kedia2016-06-303-6/+30
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Force the sorter to Spill when number of elements in the pointer array reach a certain size. This is to workaround the issue of timSort failing on large buffer size. ## How was this patch tested? Tested by running a job which was failing without this change due to TimSort bug. Author: Sital Kedia <skedia@fb.com> Closes #13107 from sitalkedia/fix_TimSort.
* [SPARK-16238] Metrics for generated method and class bytecode sizeEric Liang2016-06-291-0/+12
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This extends SPARK-15860 to include metrics for the actual bytecode size of janino-generated methods. They can be accessed in the same way as any other codahale metric, e.g. ``` scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_CLASS_BYTECODE_SIZE.getSnapshot().getValues() res7: Array[Long] = Array(532, 532, 532, 542, 1479, 2670, 3585, 3585) scala> org.apache.spark.metrics.source.CodegenMetrics.METRIC_GENERATED_METHOD_BYTECODE_SIZE.getSnapshot().getValues() res8: Array[Long] = Array(5, 5, 5, 5, 10, 10, 10, 10, 15, 15, 15, 38, 63, 79, 88, 94, 94, 94, 132, 132, 165, 165, 220, 220) ``` ## How was this patch tested? Small unit test, also verified manually that the performance impact is minimal (<10%). hvanhovell Author: Eric Liang <ekl@databricks.com> Closes #13934 from ericl/spark-16238.
* [SPARK-16148][SCHEDULER] Allow for underscores in TaskLocation in the ↵Tom Magrino2016-06-282-7/+9
| | | | | | | | | | | | | | | | | | | | | | Executor ID ## What changes were proposed in this pull request? Previously, the TaskLocation implementation would not allow for executor ids which include underscores. This tweaks the string split used to get the hostname and executor id, allowing for underscores in the executor id. This addresses the JIRA found here: https://issues.apache.org/jira/browse/SPARK-16148 This is moved over from a previous PR against branch-1.6: https://github.com/apache/spark/pull/13857 ## How was this patch tested? Ran existing unit tests for core and streaming. Manually ran a simple streaming job with an executor whose id contained underscores and confirmed that the job ran successfully. This is my original work and I license the work to the project under the project's open source license. Author: Tom Magrino <tmagrino@fb.com> Closes #13858 from tmagrino/fixtasklocation.
* [SPARK-16106][CORE] TaskSchedulerImpl should properly track executors added ↵Imran Rashid2016-06-272-65/+111
| | | | | | | | | | | | | | | | to existing hosts ## What changes were proposed in this pull request? TaskSchedulerImpl used to only set `newExecAvailable` when a new *host* was added, not when a new executor was added to an existing host. It also didn't update some internal state tracking live executors until a task was scheduled on the executor. This patch changes it to properly update as soon as it knows about a new executor. ## How was this patch tested? added a unit test, ran everything via jenkins. Author: Imran Rashid <irashid@cloudera.com> Closes #13826 from squito/SPARK-16106_executorByHosts.
* [SPARK-16136][CORE] Fix flaky TaskManagerSuiteImran Rashid2016-06-271-25/+43
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? TaskManagerSuite "Kill other task attempts when one attempt belonging to the same task succeeds" was flaky. When checking whether a task is speculatable, at least one millisecond must pass since the task was submitted. Use a manual clock to avoid the problem. I noticed these tests were leaving lots of threads lying around as well (which prevented me from running the test repeatedly), so I fixed that too. ## How was this patch tested? Ran the test 1k times on my laptop, passed every time (it failed about 20% of the time before this). Author: Imran Rashid <irashid@cloudera.com> Closes #13848 from squito/fix_flaky_taskmanagersuite.
* [MINOR][CORE] Fix display wrong free memory size in the logjerryshao2016-06-271-1/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Free memory size displayed in the log is wrong (used memory), fix to make it correct. ## How was this patch tested? N/A Author: jerryshao <sshao@hortonworks.com> Closes #13804 from jerryshao/memory-log-fix.
* [SPARK-16193][TESTS] Address flaky ExternalAppendOnlyMapSuite spilling testsSean Owen2016-06-251-1/+12
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Make spill tests wait until job has completed before returning the number of stages that spilled ## How was this patch tested? Existing Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13896 from srowen/SPARK-16193.
* [SPARK-1301][WEB UI] Added anchor links to Accumulators and Tasks on StagePageAlex Bozarth2016-06-254-4/+64
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Sometimes the "Aggregated Metrics by Executor" table on the Stage page can get very long so actor links to the Accumulators and Tasks tables below it have been added to the summary at the top of the page. This has been done in the same way as the Jobs and Stages pages. Note: the Accumulators link only displays when the table exists. ## How was this patch tested? Manually Tested and dev/run-tests ![justtasks](https://cloud.githubusercontent.com/assets/13952758/15165269/6e8efe8c-16c9-11e6-9784-cffe966fdcf0.png) ![withaccumulators](https://cloud.githubusercontent.com/assets/13952758/15165270/7019ec9e-16c9-11e6-8649-db69ed7a317d.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #13037 from ajbozarth/spark1301.
* [SPARK-15958] Make initial buffer size for the Sorter configurableSital Kedia2016-06-252-4/+7
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently the initial buffer size in the sorter is hard coded inside the code and is too small for large workload. As a result, the sorter spends significant time expanding the buffer size and copying the data. It would be useful to have it configurable. ## How was this patch tested? Tested by running a job on the cluster. Author: Sital Kedia <skedia@fb.com> Closes #13699 from sitalkedia/config_sort_buffer_upstream.
* [SPARK-15963][CORE] Catch `TaskKilledException` correctly in Executor.TaskRunnerLiwei Lin2016-06-242-1/+145
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## The problem Before this change, if either of the following cases happened to a task , the task would be marked as `FAILED` instead of `KILLED`: - the task was killed before it was deserialized - `executor.kill()` marked `taskRunner.killed`, but before calling `task.killed()` the worker thread threw the `TaskKilledException` The reason is, in the `catch` block of the current [Executor.TaskRunner](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L362)'s implementation, we are mistakenly catching: ```scala case _: TaskKilledException | _: InterruptedException if task.killed => ... ``` the semantics of which is: - **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed` Then when `TaskKilledException` is thrown but `task.killed` is not marked, we would mark the task as `FAILED` (which should really be `KILLED`). ## What changes were proposed in this pull request? This patch alters the catch condition's semantics from: - **(**`TaskKilledException` **OR** `InterruptedException`**)** **AND** `task.killed` to - `TaskKilledException` **OR** **(**`InterruptedException` **AND** `task.killed`**)** so that we can catch `TaskKilledException` correctly and mark the task as `KILLED` correctly. ## How was this patch tested? Added unit test which failed before the change, ran new test 1000 times manually Author: Liwei Lin <lwlin7@gmail.com> Closes #13685 from lw-lin/fix-task-killed.
* [SPARK-16129][CORE][SQL] Eliminate direct use of commons-lang classes in ↵Sean Owen2016-06-241-3/+2
| | | | | | | | | | | | | | | | favor of commons-lang3 ## What changes were proposed in this pull request? Replace use of `commons-lang` in favor of `commons-lang3` and forbid the former via scalastyle; remove `NotImplementedException` from `comons-lang` in favor of JDK `UnsupportedOperationException` ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #13843 from srowen/SPARK-16129.
* [SPARK-16125][YARN] Fix not test yarn cluster mode correctly in YarnClusterSuitepeng.zhang2016-06-241-1/+2
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since SPARK-13220(Deprecate "yarn-client" and "yarn-cluster"), YarnClusterSuite doesn't test "yarn cluster" mode correctly. This pull request fixes it. ## How was this patch tested? Unit test (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: peng.zhang <peng.zhang@xiaomi.com> Closes #13836 from renozhang/SPARK-16125-test-yarn-cluster-mode.