aboutsummaryrefslogtreecommitdiff
path: root/core
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-17465][SPARK CORE] Inappropriate memory management in ↵Xing SHI2016-09-141-3/+3
| | | | | | | | | | | | | | `org.apache.spark.storage.MemoryStore` may lead to memory leak The expression like `if (memoryMap(taskAttemptId) == 0) memoryMap.remove(taskAttemptId)` in method `releaseUnrollMemoryForThisTask` and `releasePendingUnrollMemoryForThisTask` should be called after release memory operation, whatever `memoryToRelease` is > 0 or not. If the memory of a task has been set to 0 when calling a `releaseUnrollMemoryForThisTask` or a `releasePendingUnrollMemoryForThisTask` method, the key in the memory map corresponding to that task will never be removed from the hash map. See the details in [SPARK-17465](https://issues.apache.org/jira/browse/SPARK-17465). Author: Xing SHI <shi-kou@indetail.co.jp> Closes #15022 from saturday-shi/SPARK-17465.
* [SPARK-17463][CORE] Make CollectionAccumulator and SetAccumulator's value ↵Shixiong Zhu2016-09-144-23/+39
| | | | | | | | | | | | | | | | can be read thread-safely ## What changes were proposed in this pull request? Make CollectionAccumulator and SetAccumulator's value can be read thread-safely to fix the ConcurrentModificationException reported in [JIRA](https://issues.apache.org/jira/browse/SPARK-17463). ## How was this patch tested? Existing tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #15063 from zsxwing/SPARK-17463.
* [SPARK-10747][SQL] Support NULLS FIRST|LAST clause in ORDER BYXin Wu2016-09-143-15/+81
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, ORDER BY clause returns nulls value according to sorting order (ASC|DESC), considering null value is always smaller than non-null values. However, SQL2003 standard support NULLS FIRST or NULLS LAST to allow users to specify whether null values should be returned first or last, regardless of sorting order (ASC|DESC). This PR is to support this new feature. ## How was this patch tested? New test cases are added to test NULLS FIRST|LAST for regular select queries and windowing queries. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Xin Wu <xinwu@us.ibm.com> Closes #14842 from xwu0226/SPARK-10747.
* [CORE][DOC] remove redundant commentwm624@hotmail.com2016-09-141-9/+9
| | | | | | | | | | | ## What changes were proposed in this pull request? In the comment, there is redundant `the estimated`. This PR simply remove the redundant comment and adjusts format. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #15091 from wangmiao1981/comment.
* [SPARK-17449][DOCUMENTATION] Relation between heartbeatInterval and…Jagadeesan2016-09-141-0/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? The relation between spark.network.timeout and spark.executor.heartbeatInterval should be mentioned in the document. … network timeout] Author: Jagadeesan <as2@us.ibm.com> Closes #15042 from jagadeesanas2/SPARK-17449.
* [SPARK-17485] Prevent failed remote reads of cached blocks from failing ↵Josh Rosen2016-09-123-33/+6
| | | | | | | | | | | | | | | | | | | | | | entire job ## What changes were proposed in this pull request? In Spark's `RDD.getOrCompute` we first try to read a local copy of a cached RDD block, then a remote copy, and only fall back to recomputing the block if no cached copy (local or remote) can be read. This logic works correctly in the case where no remote copies of the block exist, but if there _are_ remote copies and reads of those copies fail (due to network issues or internal Spark bugs) then the BlockManager will throw a `BlockFetchException` that will fail the task (and which could possibly fail the whole job if the read failures keep occurring). In the cases of TorrentBroadcast and task result fetching we really do want to fail the entire job in case no remote blocks can be fetched, but this logic is inappropriate for reads of cached RDD blocks because those can/should be recomputed in case cached blocks are unavailable. Therefore, I think that the `BlockManager.getRemoteBytes()` method should never throw on remote fetch errors and, instead, should handle failures by returning `None`. ## How was this patch tested? Block manager changes should be covered by modified tests in `BlockManagerSuite`: the old tests expected exceptions to be thrown on failed remote reads, while the modified tests now expect `None` to be returned from the `getRemote*` method. I also manually inspected all usages of `BlockManager.getRemoteValues()`, `getRemoteBytes()`, and `get()` to verify that they correctly pattern-match on the result and handle `None`. Note that these `None` branches are already exercised because the old `getRemoteBytes` returned `None` when no remote locations for the block could be found (which could occur if an executor died and its block manager de-registered with the master). Author: Josh Rosen <joshrosen@databricks.com> Closes #15037 from JoshRosen/SPARK-17485.
* [SPARK-17483] Refactoring in BlockManager status reporting and block removalJosh Rosen2016-09-121-45/+42
| | | | | | | | | | | | | | This patch makes three minor refactorings to the BlockManager: - Move the `if (info.tellMaster)` check out of `reportBlockStatus`; this fixes an issue where a debug logging message would incorrectly claim to have reported a block status to the master even though no message had been sent (in case `info.tellMaster == false`). This also makes it easier to write code which unconditionally sends block statuses to the master (which is necessary in another patch of mine). - Split `removeBlock()` into two methods, the existing method and an internal `removeBlockInternal()` method which is designed to be called by internal code that already holds a write lock on the block. This is also needed by a followup patch. - Instead of calling `getCurrentBlockStatus()` in `removeBlock()`, just pass `BlockStatus.empty`; the block status should always be empty following complete removal of a block. These changes were originally authored as part of a bug fix patch which is targeted at branch-2.0 and master; I've split them out here into their own separate PR in order to make them easier to review and so that the behavior-changing parts of my other patch can be isolated to their own PR. Author: Josh Rosen <joshrosen@databricks.com> Closes #15036 from JoshRosen/cache-failure-race-conditions-refactorings-only.
* [SPARK-17503][CORE] Fix memory leak in Memory store when unable to cache the ↵Sean Zhong2016-09-122-14/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | whole RDD in memory ## What changes were proposed in this pull request? MemoryStore may throw OutOfMemoryError when trying to cache a super big RDD that cannot fit in memory. ``` scala> sc.parallelize(1 to 1000000000, 100).map(x => new Array[Long](1000)).cache().count() java.lang.OutOfMemoryError: Java heap space at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:24) at $line14.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$1.apply(<console>:23) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$JoinIterator.next(Iterator.scala:232) at org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:683) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:43) at org.apache.spark.util.Utils$.getIteratorSize(Utils.scala:1684) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) at org.apache.spark.rdd.RDD$$anonfun$count$1.apply(RDD.scala:1134) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915) at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:1915) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:86) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) ``` Spark MemoryStore uses SizeTrackingVector as a temporary unrolling buffer to store all input values that it has read so far before transferring the values to storage memory cache. The problem is that when the input RDD is too big for caching in memory, the temporary unrolling memory SizeTrackingVector is not garbage collected in time. As SizeTrackingVector can occupy all available storage memory, it may cause the executor JVM to run out of memory quickly. More info can be found at https://issues.apache.org/jira/browse/SPARK-17503 ## How was this patch tested? Unit test and manual test. ### Before change Heap memory consumption <img width="702" alt="screen shot 2016-09-12 at 4 16 15 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429524/60d73a26-7906-11e6-9768-6f286f5c58c8.png"> Heap dump <img width="1402" alt="screen shot 2016-09-12 at 4 34 19 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429577/cbc1ef20-7906-11e6-847b-b5903f450b3b.png"> ### After change Heap memory consumption <img width="706" alt="screen shot 2016-09-12 at 4 29 10 pm" src="https://cloud.githubusercontent.com/assets/2595532/18429503/4abe9342-7906-11e6-844a-b2f815072624.png"> Author: Sean Zhong <seanzhong@databricks.com> Closes #15056 from clockfly/memory_store_leak.
* [SPARK CORE][MINOR] fix "default partitioner cannot partition array keys" ↵WeichenXu2016-09-121-5/+5
| | | | | | | | | | | | | | | | | | | | | | | error message in PairRDDfunctions ## What changes were proposed in this pull request? In order to avoid confusing user, error message in `PairRDDfunctions` `Default partitioner cannot partition array keys.` is updated, the one in `partitionBy` is replaced with `Specified partitioner cannot partition array keys.` other is replaced with `Specified or default partitioner cannot partition array keys.` ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #15045 from WeichenXu123/fix_partitionBy_error_message.
* [SPARK-17447] Performance improvement in Partitioner.defaultPartitioner ↵codlife2016-09-121-7/+9
| | | | | | | | | | | | | | | | without sortBy ## What changes were proposed in this pull request? if there are many rdds in some situations,the sort will loss he performance servely,actually we needn't sort the rdds , we can just scan the rdds one time to gain the same goal. ## How was this patch tested? manual tests Author: codlife <1004910847@qq.com> Closes #15039 from codlife/master.
* [SPARK-17171][WEB UI] DAG will list all partitions in the graphcenyuhai2016-09-122-8/+33
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? DAG will list all partitions in the graph, it is too slow and hard to see all graph. Always we don't want to see all partitions,we just want to see the relations of DAG graph. So I just show 2 root nodes for Rdds. Before this PR, the DAG graph looks like [dag1.png](https://issues.apache.org/jira/secure/attachment/12824702/dag1.png), [dag3.png](https://issues.apache.org/jira/secure/attachment/12825456/dag3.png), after this PR, the DAG graph looks like [dag2.png](https://issues.apache.org/jira/secure/attachment/12824703/dag2.png),[dag4.png](https://issues.apache.org/jira/secure/attachment/12825457/dag4.png) Author: cenyuhai <cenyuhai@didichuxing.com> Author: 岑玉海 <261810726@qq.com> Closes #14737 from cenyuhai/SPARK-17171.
* [SPARK-17486] Remove unused TaskMetricsUIData.updatedBlockStatuses fieldJosh Rosen2016-09-111-3/+0
| | | | | | | | The `TaskMetricsUIData.updatedBlockStatuses` field is assigned to but never read, increasing the memory consumption of the web UI. We should remove this field. Author: Josh Rosen <joshrosen@databricks.com> Closes #15038 from JoshRosen/remove-updated-block-statuses-from-TaskMetricsUIData.
* [SPARK-17396][CORE] Share the task support between UnionRDD instances.Ryan Blue2016-09-101-5/+7
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Share the ForkJoinTaskSupport between UnionRDD instances to avoid creating a huge number of threads if lots of RDDs are created at the same time. ## How was this patch tested? This uses existing UnionRDD tests. Author: Ryan Blue <blue@apache.org> Closes #14985 from rdblue/SPARK-17396-use-shared-pool.
* [SPARK-17456][CORE] Utility for parsing Spark versionsJoseph K. Bradley2016-09-092-0/+128
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds methods for extracting major and minor versions as Int types in Scala from a Spark version string. Motivation: There are many hacks within Spark's codebase to identify and compare Spark versions. We should add a simple utility to standardize these code paths, especially since there have been mistakes made in the past. This will let us add unit tests as well. Currently, I want this functionality to check Spark versions to provide backwards compatibility for ML model persistence. ## How was this patch tested? Unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #15017 from jkbradley/version-parsing.
* [SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and ↵Gurvinder Singh2016-09-0812-9/+235
| | | | | | | | | | | | | | | | | | | | | | | Workers UI ## What changes were proposed in this pull request? This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/ ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/ This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy ## How was this patch tested? The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address. pwendell bomeng BryanCutler can you please review it, thanks. Author: Gurvinder Singh <gurvinder.singh@uninett.no> Closes #13950 from gurvindersingh/rproxy.
* [SPARK-17370] Shuffle service files not invalidated when a slave is lostEric Liang2016-09-0712-31/+92
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? DAGScheduler invalidates shuffle files when an executor loss event occurs, but not when the external shuffle service is enabled. This is because when shuffle service is on, the shuffle file lifetime can exceed the executor lifetime. However, it also doesn't invalidate shuffle files when the shuffle service itself is lost (due to whole slave loss). This can cause long hangs when slaves are lost since the file loss is not detected until a subsequent stage attempts to read the shuffle files. The proposed fix is to also invalidate shuffle files when an executor is lost due to a `SlaveLost` event. ## How was this patch tested? Unit tests, also verified on an actual cluster that slave loss invalidates shuffle files immediately as expected. cc mateiz Author: Eric Liang <ekl@databricks.com> Closes #14931 from ericl/sc-4439.
* [SPARK-17339][SPARKR][CORE] Fix some R tests and use Path.toUri in ↵hyukjinkwon2016-09-071-2/+2
| | | | | | | | | | | | | | | | | | | | SparkContext for Windows paths in SparkR ## What changes were proposed in this pull request? This PR fixes the Windows path issues in several APIs. Please refer https://issues.apache.org/jira/browse/SPARK-17339 for more details. ## How was this patch tested? Tests via AppVeyor CI - https://ci.appveyor.com/project/HyukjinKwon/spark/build/82-SPARK-17339-fix-r Also, manually, ![2016-09-06 3 14 38](https://cloud.githubusercontent.com/assets/6477701/18263406/b93a98be-7444-11e6-9521-b28ee65a4771.png) Author: hyukjinkwon <gurwls223@gmail.com> Closes #14960 from HyukjinKwon/SPARK-17339.
* [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ↵Liwei Lin2016-09-078-14/+13
| | | | | | | | | | | | | | | | ArrayBuffer.append(A) in performance critical paths ## What changes were proposed in this pull request? We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing. ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #14914 from lw-lin/append_to_plus_eq_v2.
* [SPARK-17371] Resubmitted shuffle outputs can get deleted by zombie map tasksEric Liang2016-09-063-5/+0
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? It seems that old shuffle map tasks hanging around after a stage resubmit will delete intended shuffle output files on stop(), causing downstream stages to fail even after successful resubmit completion. This can happen easily if the prior map task is waiting for a network timeout when its stage is resubmitted. This can cause unnecessary stage resubmits, sometimes multiple times as fetch fails cause a cascade of shuffle file invalidations, and confusing FetchFailure messages that report shuffle index files missing from the local disk. Given that IndexShuffleBlockResolver commits data atomically, it seems unnecessary to ever delete committed task output: even in the rare case that a task is failed after it finishes committing shuffle output, it should be safe to retain that output. ## How was this patch tested? Prior to the fix proposed in https://github.com/apache/spark/pull/14931, I was able to reproduce this behavior by killing slaves in the middle of a large shuffle. After this patch, stages were no longer resubmitted multiple times due to shuffle index loss. cc JoshRosen vanzin Author: Eric Liang <ekl@databricks.com> Closes #14932 from ericl/dont-remove-committed-files.
* [SPARK-17316][CORE] Fix the 'ask' type parameter in 'removeExecutor'Shixiong Zhu2016-09-061-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix the 'ask' type parameter in 'removeExecutor' to eliminate a lot of error logs `Cannot cast java.lang.Boolean to scala.runtime.Nothing$` ## How was this patch tested? Jenkins Author: Shixiong Zhu <shixiong@databricks.com> Closes #14983 from zsxwing/SPARK-17316-3.
* [SPARK-17110] Fix StreamCorruptionException in BlockManager.getRemoteValues()Josh Rosen2016-09-064-13/+17
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch fixes a `java.io.StreamCorruptedException` error affecting remote reads of cached values when certain data types are used. The problem stems from #11801 / SPARK-13990, a patch to have Spark automatically pick the "best" serializer when caching RDDs. If PySpark cached a PythonRDD, then this would be cached as an `RDD[Array[Byte]]` and the automatic serializer selection would pick KryoSerializer for replication and block transfer. However, the `getRemoteValues()` / `getRemoteBytes()` code path did not pass proper class tags in order to enable the same serializer to be used during deserialization, causing Java to be inappropriately used instead of Kryo, leading to the StreamCorruptedException. We already fixed a similar bug in #14311, which dealt with similar issues in block replication. Prior to that patch, it seems that we had no tests to ensure that block replication actually succeeded. Similarly, prior to this bug fix patch it looks like we had no tests to perform remote reads of cached data, which is why this bug was able to remain latent for so long. This patch addresses the bug by modifying `BlockManager`'s `get()` and `getRemoteValues()` methods to accept ClassTags, allowing the proper class tag to be threaded in the `getOrElseUpdate` code path (which is used by `rdd.iterator`) ## How was this patch tested? Extended the caching tests in `DistributedSuite` to exercise the `getRemoteValues` path, plus manual testing to verify that the PySpark bug reproduction in SPARK-17110 is fixed. Author: Josh Rosen <joshrosen@databricks.com> Closes #14952 from JoshRosen/SPARK-17110.
* [SPARK-17308] Improved the spark core code by replacing all pattern match on ↵Shivansh2016-09-041-21/+20
| | | | | | | | | | | | | | | boolean value by if/else block. ## What changes were proposed in this pull request? Improved the code quality of spark by replacing all pattern match on boolean value by if/else block. ## How was this patch tested? By running the tests Author: Shivansh <shiv4nsh@gmail.com> Closes #14873 from shiv4nsh/SPARK-17308.
* [SPARK-16829][SPARKR] sparkR sc.setLogLevel doesn't workwm624@hotmail.com2016-09-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | (Please fill in changes proposed in this fix) ./bin/sparkR Launching java with spark-submit command /Users/mwang/spark_ws_0904/bin/spark-submit "sparkr-shell" /var/folders/s_/83b0sgvj2kl2kwq4stvft_pm0000gn/T//RtmpQxJGiZ/backend_porte9474603ed1e Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). > sc.setLogLevel("INFO") Error: could not find function "sc.setLogLevel" sc.setLogLevel doesn't exist. R has a function setLogLevel. I rename the setLogLevel function to sc.setLogLevel. (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Change unit test. Run unit tests. Manually tested it in sparkR shell. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14433 from wangmiao1981/sc.
* [SPARK-16984][SQL] don't try whole dataset immediately when first partition ↵Robert Kruszewski2016-09-021-3/+4
| | | | | | | | | | | | | | | | doesn't have… ## What changes were proposed in this pull request? Try increase number of partitions to try so we don't revert to all. ## How was this patch tested? Empirically. This is common case optimization. Author: Robert Kruszewski <robertk@palantir.com> Closes #14573 from robert3005/robertk/execute-take-backoff.
* [SPARK-17352][WEBUI] Executor computing time can be negative-number because ↵Kousuke Saruta2016-09-021-2/+2
| | | | | | | | | | | | | | | | | | | | of calculation error ## What changes were proposed in this pull request? In StagePage, executor-computing-time is calculated but calculation error can occur potentially because it's calculated by subtraction of floating numbers. Following capture is an example. <img width="949" alt="capture-timeline" src="https://cloud.githubusercontent.com/assets/4736016/18152359/43f07a28-7030-11e6-8cbd-8e73bf4c4c67.png"> ## How was this patch tested? Manual tests. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #14908 from sarutak/SPARK-17352.
* [SPARK-17342][WEBUI] Style of event timeline is brokenKousuke Saruta2016-09-022-32/+31
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? SPARK-15373 (#13158) updated the version of vis.js to 4.16.1. As of 4.0.0, some class was renamed like 'timeline to vis-timeline' but that ticket didn't care and now style is broken. In this PR, I've restored the style by modifying `timeline-view.css` and `timeline-view.js`. ## How was this patch tested? manual tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) * Before <img width="1258" alt="2016-09-01 1 38 31" src="https://cloud.githubusercontent.com/assets/4736016/18141311/fddf1bac-6ff3-11e6-935f-28b389073b39.png"> * After <img width="1256" alt="2016-09-01 3 30 19" src="https://cloud.githubusercontent.com/assets/4736016/18141394/49af65dc-6ff4-11e6-8640-70e20300f3c3.png"> Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #14900 from sarutak/SPARK-17342.
* [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arraysSean Owen2016-09-012-3/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]() ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14895 from srowen/SPARK-17331.
* [SPARK-16533][CORE] resolve deadlocking in driver when executors dieAngus Gerry2016-09-016-79/+115
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach: * Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention. * Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention. ## How was this patch tested? Existing tests, and manual tests under yarn-client mode. Author: Angus Gerry <angolon@gmail.com> Closes #14710 from angolon/SPARK-16533.
* [SPARK-17332][CORE] Make Java Loggers static membersSean Owen2016-08-316-6/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Make all Java Loggers static members ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14896 from srowen/SPARK-17332.
* [SPARK-17316][CORE] Make CoarseGrainedSchedulerBackend.removeExecutor ↵Shixiong Zhu2016-08-311-8/+9
| | | | | | | | | | | | | | | | | | non-blocking ## What changes were proposed in this pull request? StandaloneSchedulerBackend.executorRemoved is a blocking call right now. It may cause some deadlock since it's called inside StandaloneAppClient.ClientEndpoint. This PR just changed CoarseGrainedSchedulerBackend.removeExecutor to be non-blocking. It's safe since the only two usages (StandaloneSchedulerBackend and YarnSchedulerEndpoint) don't need the return value). ## How was this patch tested? Jenkins unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #14882 from zsxwing/SPARK-17316.
* [SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large ↵Alex Bozarth2016-08-307-4/+93
| | | | | | | | | | | | | | | | | | application history ## What changes were proposed in this pull request? With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.) I've also added a new test for the `limit` param in `HistoryServerSuite.scala` ## How was this patch tested? Manual testing and dev/run-tests Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14835 from ajbozarth/spark17243.
* [SPARK-17304] Fix perf. issue caused by ↵Josh Rosen2016-08-301-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TaskSetManager.abortIfCompletelyBlacklisted This patch addresses a minor scheduler performance issue that was introduced in #13603. If you run ``` sc.parallelize(1 to 100000, 100000).map(identity).count() ``` then most of the time ends up being spent in `TaskSetManager.abortIfCompletelyBlacklisted()`: ![image](https://cloud.githubusercontent.com/assets/50748/18071032/428732b0-6e07-11e6-88b2-c9423cd61f53.png) When processing resource offers, the scheduler uses a nested loop which considers every task set at multiple locality levels: ```scala for (taskSet <- sortedTaskSets; maxLocality <- taskSet.myLocalityLevels) { do { launchedTask = resourceOfferSingleTaskSet( taskSet, maxLocality, shuffledOffers, availableCpus, tasks) } while (launchedTask) } ``` In order to prevent jobs with globally blacklisted tasks from hanging, #13603 added a `taskSet.abortIfCompletelyBlacklisted` call inside of `resourceOfferSingleTaskSet`; if a call to `resourceOfferSingleTaskSet` fails to schedule any tasks, then `abortIfCompletelyBlacklisted` checks whether the tasks are completely blacklisted in order to figure out whether they will ever be schedulable. The problem with this placement of the call is that the last call to `resourceOfferSingleTaskSet` in the `while` loop will return `false`, implying that `resourceOfferSingleTaskSet` will call `abortIfCompletelyBlacklisted`, so almost every call to `resourceOffers` will trigger the `abortIfCompletelyBlacklisted` check for every task set. Instead, I think that this call should be moved out of the innermost loop and should be called _at most_ once per task set in case none of the task set's tasks can be scheduled at any locality level. Before this patch's changes, the microbenchmark example that I posted above took 35 seconds to run, but it now only takes 15 seconds after this change. /cc squito and kayousterhout for review. Author: Josh Rosen <joshrosen@databricks.com> Closes #14871 from JoshRosen/bail-early-if-no-cpus.
* [SPARK-5682][CORE] Add encrypted shuffle in sparkFerdinand Xu2016-08-3018-28/+326
| | | | | | | | | This patch is using Apache Commons Crypto library to enable shuffle encryption support. Author: Ferdinand Xu <cheng.a.xu@intel.com> Author: kellyzly <kellyzly@126.com> Closes #8880 from winningsix/SPARK-10771.
* [MINOR][MLLIB][SQL] Clean up unused variables and unused importXin Ren2016-08-301-2/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement, when I walk through the code examples. ## How was this patch tested? Testet manually on local laptop. Author: Xin Ren <iamshrek@126.com> Closes #14836 from keypointt/codeWalkThroughML.
* [SPARK-17276][CORE][TEST] Stop env params output on Jenkins job pageXin Ren2016-08-301-1/+1
| | | | | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-17276 ## What changes were proposed in this pull request? When trying to find error msg in a failed Jenkins build job, I'm annoyed by the huge env output. The env parameter output should be muted. ![screen shot 2016-08-26 at 10 52 07 pm](https://cloud.githubusercontent.com/assets/3925641/18025581/b8d567ba-6be2-11e6-9eeb-6aec223f1730.png) ## How was this patch tested? Tested manually on local laptop. Author: Xin Ren <iamshrek@126.com> Closes #14848 from keypointt/SPARK-17276.
* [SPARK-17216][UI] fix event timeline bars lengthRobert Kruszewski2016-08-271-0/+4
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Make event timeline bar expand to full length of the bar (which is total time) This issue occurs only on chrome, firefox looks fine. Haven't tested other browsers. ## How was this patch tested? Inspection in browsers Before ![screen shot 2016-08-24 at 3 38 24 pm](https://cloud.githubusercontent.com/assets/512084/17935104/0d6cda74-6a12-11e6-9c66-e00cfa855606.png) After ![screen shot 2016-08-24 at 3 36 39 pm](https://cloud.githubusercontent.com/assets/512084/17935114/15740ea4-6a12-11e6-83a1-7c06eef6abb8.png) Author: Robert Kruszewski <robertk@palantir.com> Closes #14791 from robert3005/robertk/event-timeline.
* [SPARK-17266][TEST] Add empty strings to the regressionTests of ↵Yin Huai2016-08-261-1/+2
| | | | | | | | | | | | | PrefixComparatorsSuite ## What changes were proposed in this pull request? This PR adds a regression test to PrefixComparatorsSuite's "String prefix comparator" because this test failed on jenkins once (https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1620/testReport/junit/org.apache.spark.util.collection.unsafe.sort/PrefixComparatorsSuite/String_prefix_comparator/). I could not reproduce it locally. But, let's this test case in the regressionTests. Author: Yin Huai <yhuai@databricks.com> Closes #14837 from yhuai/SPARK-17266.
* [SPARK-16967] move mesos to moduleMichael Gummelt2016-08-2627-5357/+1
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Move Mesos code into a mvn module ## How was this patch tested? unit tests manually submitting a client mode and cluster mode job spark/mesos integration test suite Author: Michael Gummelt <mgummelt@mesosphere.io> Closes #14637 from mgummelt/mesos-module.
* [SPARK-17240][CORE] Make SparkConf serializable again.Marcelo Vanzin2016-08-252-5/+28
| | | | | | | | | | | | | Make the config reader transient, and initialize it lazily so that serialization works with both java and kryo (and hopefully any other custom serializer). Added unit test to make sure SparkConf remains serializable and the reader works with both built-in serializers. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14813 from vanzin/SPARK-17240.
* [SPARK-17193][CORE] HadoopRDD NPE at DEBUG log level when getLocationInfo == ↵Sean Owen2016-08-252-15/+13
| | | | | | | | | | | | | | | | null ## What changes were proposed in this pull request? Handle null from Hadoop getLocationInfo directly instead of catching (and logging) exception ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14760 from srowen/SPARK-17193.
* [SPARK-15083][WEB UI] History Server can OOM due to unlimited TaskUIDataAlex Bozarth2016-08-248-228/+248
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Based on #12990 by tankkyo Since the History Server currently loads all application's data it can OOM if too many applications have a significant task count. `spark.ui.trimTasks` (default: false) can be set to true to trim tasks by `spark.ui.retainedTasks` (default: 10000) (This is a "quick fix" to help those running into the problem until a update of how the history server loads app data can be done) ## How was this patch tested? Manual testing and dev/run-tests ![spark-15083](https://cloud.githubusercontent.com/assets/13952758/17713694/fe82d246-63b0-11e6-9697-b87ea75ff4ef.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14673 from ajbozarth/spark15083.
* [SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the ↵Sean Owen2016-08-242-2/+2
| | | | | | | | | | | | | | | | same java used in the spark environment ## What changes were proposed in this pull request? Update to py4j 0.10.3 to enable JAVA_HOME support ## How was this patch tested? Pyspark tests Author: Sean Owen <sowen@cloudera.com> Closes #14748 from srowen/SPARK-16781.
* [MINOR][BUILD] Fix Java CheckStyle ErrorWeiqing Yang2016-08-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As Spark 2.0.1 will be released soon (mentioned in the spark dev mailing list), besides the critical bugs, it's better to fix the code style errors before the release. Before: ``` ./dev/lint-java Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[525] (sizes) LineLength: Line is longer than 100 characters (found 119). [ERROR] src/main/java/org/apache/spark/examples/sql/streaming/JavaStructuredNetworkWordCount.java:[64] (sizes) LineLength: Line is longer than 100 characters (found 103). ``` After: ``` ./dev/lint-java Using `mvn` from path: /usr/local/bin/mvn Checkstyle checks passed. ``` ## How was this patch tested? Manual. Author: Weiqing Yang <yangweiqing001@gmail.com> Closes #14768 from Sherry302/fixjavastyle.
* [SPARK-16862] Configurable buffer size in `UnsafeSorterSpillReader`Tejas Patil2016-08-231-1/+21
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Jira: https://issues.apache.org/jira/browse/SPARK-16862 `BufferedInputStream` used in `UnsafeSorterSpillReader` uses the default 8k buffer to read data off disk. This PR makes it configurable to improve on disk reads. I have made the default value to be 1 MB as with that value I observed improved performance. ## How was this patch tested? I am relying on the existing unit tests. ## Performance After deploying this change to prod and setting the config to 1 mb, there was a 12% reduction in the CPU time and 19.5% reduction in CPU reservation time. Author: Tejas Patil <tejasp@fb.com> Closes #14726 from tejasapatil/spill_buffer_2.
* [SPARK-16550][SPARK-17042][CORE] Certain classes fail to deserialize in ↵Eric Liang2016-08-223-58/+46
| | | | | | | | | | | | | | | | block manager replication ## What changes were proposed in this pull request? This is a straightforward clone of JoshRosen 's original patch. I have follow-up changes to fix block replication for repl-defined classes as well, but those appear to be flaking tests so I'm going to leave that for SPARK-17042 ## How was this patch tested? End-to-end test in ReplSuite (also more tests in DistributedSuite from the original patch). Author: Eric Liang <ekl@databricks.com> Closes #14311 from ericl/spark-16550.
* [SPARK-17002][CORE] Document that spark.ssl.protocol. is required for SSLwm624@hotmail.com2016-08-211-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `spark.ssl.enabled`=true, but failing to set `spark.ssl.protocol` will fail and throw meaningless exception. `spark.ssl.protocol` is required when `spark.ssl.enabled`. Improvement: require `spark.ssl.protocol` when initializing SSLContext, otherwise throws an exception to indicate that. Remove the OrElse("default"). Document this requirement in configure.md ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Manual tests: Build document and check document Configure `spark.ssl.enabled` only, it throws exception below: 6/08/16 16:04:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(mwang); groups with view permissions: Set(); users with modify permissions: Set(mwang); groups with modify permissions: Set() Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: spark.ssl.protocol is required when enabling SSL connections. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:285) at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1026) at org.apache.spark.deploy.master.Master$.main(Master.scala:1011) at org.apache.spark.deploy.master.Master.main(Master.scala) Configure `spark.ssl.protocol` and `spark.ssl.protocol` It works fine. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14674 from wangmiao1981/ssl.
* [SPARK-12666][CORE] SparkSubmit packages fix for when 'default' conf doesn't ↵Bryan Cutler2016-08-201-4/+7
| | | | | | | | | | | | | | | exist in dependent module ## What changes were proposed in this pull request? Adding a "(runtime)" to the dependency configuration will set a fallback configuration to be used if the requested one is not found. E.g. with the setting "default(runtime)", Ivy will look for the conf "default" in the module ivy file and if not found will look for the conf "runtime". This can help with the case when using "sbt publishLocal" which does not write a "default" conf in the published ivy.xml file. ## How was this patch tested? used spark-submit with --packages option for a package published locally with no default conf, and a package resolved from Maven central. Author: Bryan Cutler <cutlerb@gmail.com> Closes #13428 from BryanCutler/fallback-package-conf-SPARK-12666.
* [SPARK-17113] [SHUFFLE] Job failure due to Executor OOM in offheap modeSital Kedia2016-08-192-1/+8
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes executor OOM in offheap mode due to bug in Cooperative Memory Management for UnsafeExternSorter. UnsafeExternalSorter was checking if memory page is being used by upstream by comparing the base object address of the current page with the base object address of upstream. However, in case of offheap memory allocation, the base object addresses are always null, so there was no spilling happening and eventually the operator would OOM. Following is the stack trace this issue addresses - java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0 at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341) at org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93) at org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170) ## How was this patch tested? Tested by running the failing job. Author: Sital Kedia <skedia@fb.com> Closes #14693 from sitalkedia/fix_offheap_oom.
* [SPARK-11227][CORE] UnknownHostException can be thrown when NameNode HA is ↵Kousuke Saruta2016-08-191-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | enabled. ## What changes were proposed in this pull request? If the following conditions are satisfied, executors don't load properties in `hdfs-site.xml` and UnknownHostException can be thrown. (1) NameNode HA is enabled (2) spark.eventLogging is disabled or logging path is NOT on HDFS (3) Using Standalone or Mesos for the cluster manager (4) There are no code to load `HdfsCondition` class in the driver regardless of directly or indirectly. (5) The tasks access to HDFS (There might be some more conditions...) For example, following code causes UnknownHostException when the conditions above are satisfied. ``` sc.textFile("<path on HDFS>").collect ``` ``` java.lang.IllegalArgumentException: java.net.UnknownHostException: hacluster at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378) at org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310) at org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678) at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92) at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:170) at org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:438) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986) at org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177) at scala.Option.map(Option.scala:146) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177) at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:213) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209) at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318) at org.apache.spark.rdd.RDD.iterator(RDD.scala:282) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70) at org.apache.spark.scheduler.Task.run(Task.scala:85) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.UnknownHostException: hacluster ``` But following code doesn't cause the Exception because `textFile` method loads `HdfsConfiguration` indirectly. ``` sc.textFile("<path on HDFS>").collect ``` When a job includes some operations which access to HDFS, the object of `org.apache.hadoop.Configuration` is wrapped by `SerializableConfiguration`, serialized and broadcasted from driver to executors and each executor deserialize the object with `loadDefaults` false so HDFS related properties should be set before broadcasted. ## How was this patch tested? Tested manually on my standalone cluster. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #13738 from sarutak/SPARK-11227.
* [SPARK-16673][WEB UI] New Executor Page removed conditional for Logs and ↵Alex Bozarth2016-08-192-11/+34
| | | | | | | | | | | | | | | | | | | | Thread Dump columns ## What changes were proposed in this pull request? When #13670 switched `ExecutorsPage` to use JQuery DataTables it incidentally removed the conditional for the Logs and Thread Dump columns. I reimplemented the conditional display of the Logs and Thread dump columns as it was before the switch. ## How was this patch tested? Manually tested and dev/run-tests ![both](https://cloud.githubusercontent.com/assets/13952758/17186879/da8dd1a8-53eb-11e6-8b0c-d0ff0156a9a7.png) ![dump](https://cloud.githubusercontent.com/assets/13952758/17186881/dab08a04-53eb-11e6-8b1c-50ffd0bf2ae8.png) ![logs](https://cloud.githubusercontent.com/assets/13952758/17186880/dab04d00-53eb-11e6-8754-68dd64d6d9f4.png) Author: Alex Bozarth <ajbozart@us.ibm.com> Closes #14382 from ajbozarth/spark16673.