| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
This reverts commit adfed7086f10fa8db4eeac7996c84cf98f625e9a.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework. The only change is to push one half of the check in the original `if` condition inside. This is a trivial change, may or may not actually solve a problem, but I think it's all that makes sense to do for SPARK-4147.
Author: Sean Owen <sowen@cloudera.com>
Closes #4190 from srowen/SPARK-4147 and squashes the following commits:
4e99942 [Sean Owen] Defer use of log4j class until it's known that log4j 1.2 is being used. This may avoid dealing with log4j dependencies for callers that reroute slf4j to another logging framework.
(cherry picked from commit 54e7b456dd56c9e52132154e699abca87563465b)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
j.u.c.ConcurrentHashMap is more battle tested.
cc rxin JoshRosen pwendell
Author: Davies Liu <davies@databricks.com>
Closes #4208 from davies/safe-conf and squashes the following commits:
c2182dc [Davies Liu] address comments, fix tests
3a1d821 [Davies Liu] fix test
da14ced [Davies Liu] Merge branch 'master' of github.com:apache/spark into safe-conf
ae4d305 [Davies Liu] change to j.u.c.ConcurrentMap
f8fa1cf [Davies Liu] change to TrieMap
a1d769a [Davies Liu] make SparkConf thread-safe
(cherry picked from commit 142093179a4c40bdd90744191034de7b94a963ff)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
| |
file was renamed to completed file"
This reverts commit 8f55beeb51e6ea72e63af3f276497f61dd24d09b.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
renamed to completed file
`FsHistoryProvider` tries to update application status but if `checkForLogs` is called before `.inprogress` file is renamed to completed file, the file is not recognized as completed.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #4132 from sarutak/SPARK-5344 and squashes the following commits:
9658008 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-5344
d2c72b6 [Kousuke Saruta] Fixed update issue of FsHistoryProvider
(cherry picked from commit 8f5c827b01026bf45fc774ed7387f11a941abea8)
Signed-off-by: Andrew Or <andrew@databricks.com>
Conflicts:
core/src/test/scala/org/apache/spark/deploy/history/FsHistoryProviderSuite.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
also rename "slaveHostname" to "executorHostname"
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes #4195 from ryan-williams/exec and squashes the following commits:
e60a7bb [Ryan Williams] log executor ID at executor-construction time
(cherry picked from commit aea25482c370fbcf712a464501605bc16ee4ed5d)
Signed-off-by: Andrew Or <andrew@databricks.com>
Conflicts:
core/src/main/scala/org/apache/spark/executor/Executor.scala
|
|
|
|
|
|
|
|
| |
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes #4194 from ryan-williams/metrics and squashes the following commits:
7c5a33f [Ryan Williams] set executor ID before creating MetricsSystem
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds more helpful error messages for invalid programs that define nested RDDs, broadcast RDDs, perform actions inside of transformations (e.g. calling `count()` from inside of `map()`), and call certain methods on stopped SparkContexts. Currently, these invalid programs lead to confusing NullPointerExceptions at runtime and have been a major source of questions on the mailing list and StackOverflow.
In a few cases, I chose to log warnings instead of throwing exceptions in order to avoid any chance that this patch breaks programs that worked "by accident" in earlier Spark releases (e.g. programs that define nested RDDs but never run any jobs with them).
In SparkContext, the new `assertNotStopped()` method is used to check whether methods are being invoked on a stopped SparkContext. In some cases, user programs will not crash in spite of calling methods on stopped SparkContexts, so I've only added `assertNotStopped()` calls to methods that always throw exceptions when called on stopped contexts (e.g. by dereferencing a null `dagScheduler` pointer).
Author: Josh Rosen <joshrosen@databricks.com>
Closes #3884 from JoshRosen/SPARK-5063 and squashes the following commits:
a38774b [Josh Rosen] Fix spelling typo
a943e00 [Josh Rosen] Convert two exceptions into warnings in order to avoid breaking user programs in some edge-cases.
2d0d7f7 [Josh Rosen] Fix test to reflect 1.2.1 compatibility
3f0ea0c [Josh Rosen] Revert two unintentional formatting changes
8e5da69 [Josh Rosen] Remove assertNotStopped() calls for methods that were sometimes safe to call on stopped SC's in Spark 1.2
8cff41a [Josh Rosen] IllegalStateException fix
6ef68d0 [Josh Rosen] Fix Python line length issues.
9f6a0b8 [Josh Rosen] Add improved error messages to PySpark.
13afd0f [Josh Rosen] SparkException -> IllegalStateException
8d404f3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5063
b39e041 [Josh Rosen] Fix BroadcastSuite test which broadcasted an RDD
99cc09f [Josh Rosen] Guard against calling methods on stopped SparkContexts.
34833e8 [Josh Rosen] Add more descriptive error message.
57cc8a1 [Josh Rosen] Add error message when directly broadcasting RDD.
15b2e6b [Josh Rosen] [SPARK-5063] Useful error messages for nested RDDs and actions inside of transformations
(cherry picked from commit cef1f092a628ac20709857b4388bb10e0b5143b0)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The SparkConf is not thread-safe, but is accessed by many threads. The getAll() could return parts of the configs if another thread is access it.
This PR changes SparkConf.settings to a thread-safe TrieMap.
Author: Davies Liu <davies@databricks.com>
Closes #4143 from davies/safe-conf and squashes the following commits:
f8fa1cf [Davies Liu] change to TrieMap
a1d769a [Davies Liu] make SparkConf thread-safe
(cherry picked from commit 9bad062268676aaa66dcbddd1e0ab7f2d7742425)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
| |
Whenever a directory is created by the utility method, immediately restrict
its permissions so that only the owner has access to its contents.
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-5006
I think the issue is produced in https://github.com/apache/spark/pull/1777.
Not digging mesos's backend yet. Maybe should add same logic either.
Author: WangTaoTheTonic <barneystinson@aliyun.com>
Author: WangTao <barneystinson@aliyun.com>
Closes #3841 from WangTaoTheTonic/SPARK-5006 and squashes the following commits:
8cdf96d [WangTao] indent thing
2d86d65 [WangTaoTheTonic] fix line length
7cdfd98 [WangTaoTheTonic] fit for new HttpServer constructor
61a370d [WangTaoTheTonic] some minor fixes
bc6e1ec [WangTaoTheTonic] rebase
67bcb46 [WangTaoTheTonic] put conf at 3rd position, modify suite class, add comments
f450cd1 [WangTaoTheTonic] startServiceOnPort will use a SparkConf arg
29b751b [WangTaoTheTonic] rebase as ExecutorRunnableUtil changed to ExecutorRunnable
396c226 [WangTaoTheTonic] make the grammar more like scala
191face [WangTaoTheTonic] invalid value name
62ec336 [WangTaoTheTonic] spark.port.maxRetries doesn't work
Conflicts:
external/mqtt/src/test/scala/org/apache/spark/streaming/mqtt/MQTTStreamSuite.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Hi all - I've renamed the unhelpfully named variable and added a comment clarifying what's actually happening.
Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
Closes #3666 from ilganeli/SPARK-4569B and squashes the following commits:
1810394 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator
e2d2092 [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator
d7cefec [Ilya Ganelin] [SPARK-4569] Rename 'externalSorting' in Aggregator
5b3f39c [Ilya Ganelin] [SPARK-4569] Rename in Aggregator
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The driver hangs sometimes when we coalesce RDD partitions. See JIRA for more details and reproduction.
This is because our use of empty string as default preferred location in `CoalescedRDDPartition` causes the `TaskSetManager` to schedule the corresponding task on host `""` (empty string). The intended semantics here, however, is that the partition does not have a preferred location, and the TSM should schedule the corresponding task accordingly.
Author: Andrew Or <andrew@databricks.com>
Closes #3633 from andrewor14/coalesce-preferred-loc and squashes the following commits:
e520d6b [Andrew Or] Oops
3ebf8bd [Andrew Or] A few comments
f370a4e [Andrew Or] Fix tests
2f7dfb6 [Andrew Or] Avoid using empty string as default preferred location
(cherry picked from commit 4f93d0cabe5d1fc7c0fd0a33d992fd85df1fecb4)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
... by Piotr Kolaczkowski)
Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>
Closes #4113 from jacek-lewandowski/SPARK-4660-master and squashes the following commits:
a5e84ca [Jacek Lewandowski] SPARK-4660: Use correct class loader in JavaSerializer (copy of PR #3840 by Piotr Kolaczkowski)
(cherry picked from commit c93a57f0d6dc32b127aa68dbe4092ab0b22a9667)
Signed-off-by: Patrick Wendell <patrick@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
method
There is an int overflow in the ParallelCollectionRDD.slice method. That's originally reported by SaintBacchus.
```
sc.makeRDD(1 to (Int.MaxValue)).count // result = 0
sc.makeRDD(1 to (Int.MaxValue - 1)).count // result = 2147483646 = Int.MaxValue - 1
sc.makeRDD(1 until (Int.MaxValue)).count // result = 2147483646 = Int.MaxValue - 1
```
see https://github.com/apache/spark/pull/2874 for more details.
This pr try to fix the overflow. However, There's another issue I don't address.
```
val largeRange = Int.MinValue to Int.MaxValue
largeRange.length // throws java.lang.IllegalArgumentException: -2147483648 to 2147483647 by 1: seqs cannot contain more than Int.MaxValue elements.
```
So, the range we feed to sc.makeRDD cannot contain more than Int.MaxValue elements. This is the limitation of Scala. However I think we may want to support that kind of range. But the fix is beyond this pr.
srowen andrewor14 would you mind take a look at this pr?
Author: Ye Xianjin <advancedxy@gmail.com>
Closes #4002 from advancedxy/SPARk-5201 and squashes the following commits:
96265a1 [Ye Xianjin] Update slice method comment and some responding docs.
e143d7a [Ye Xianjin] Update inclusive range check for splitting inclusive range.
b3f5577 [Ye Xianjin] We can include the last element in the last slice in general for inclusive range, hence eliminate the need to check Int.MaxValue or Int.MinValue.
7d39b9e [Ye Xianjin] Convert the two cases pattern matching to one case.
651c959 [Ye Xianjin] rename sign to needsInclusiveRange. add some comments
196f8a8 [Ye Xianjin] Add test cases for ranges end with Int.MaxValue or Int.MinValue
e66e60a [Ye Xianjin] Deal with inclusive and exclusive ranges in one case. If the range is inclusive and the end of the range is (Int.MaxValue or Int.MinValue), we should use inclusive range instead of exclusive
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current spark lets you set the ip address using SPARK_LOCAL_IP, but then this is given to akka after doing a reverse DNS lookup. This makes it difficult to run spark in Docker. You can already change the hostname that is used programmatically, but it would be nice to be able to do this with an environment variable as well.
Author: Michael Armbrust <michael@databricks.com>
Closes #3893 from marmbrus/localHostnameEnv and squashes the following commits:
85045b6 [Michael Armbrust] Optionally read from SPARK_LOCAL_HOSTNAME
(cherry picked from commit a3978f3e156e0ca67e978f1795b238ddd69ff9a6)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CompressedMapStatus and HighlyCompressedMapStatus needs to be registered with Kryo, because they are subclass of MapStatus.
Author: lianhuiwang <lianhuiwang09@gmail.com>
Closes #4007 from lianhuiwang/SPARK-5102 and squashes the following commits:
9d2238a [lianhuiwang] remove register of MapStatus
05a285d [lianhuiwang] subclass of MapStatus needs to be registered with Kryo
(cherry picked from commit ef9224e08010420b570c21a0b9208d22792a24fe)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A few changes to fix this issue:
1. Handle the case that receiving `SparkListenerTaskStart` before `SparkListenerBlockManagerAdded`.
2. Don't add `executorId` to `removeTimes` when the executor is busy.
3. Use `HashMap.retain` to safely traverse the HashMap and remove items.
4. Use the same lock in ExecutorAllocationManager and ExecutorAllocationListener to fix the race condition in `totalPendingTasks`.
5. Move the blocking codes out of the message processing code in YarnSchedulerActor.
Author: zsxwing <zsxwing@gmail.com>
Closes #3783 from zsxwing/SPARK-4951 and squashes the following commits:
d51fa0d [zsxwing] Add comments
2e365ce [zsxwing] Remove expired executors from 'removeTimes' and add idle executors back when a new executor joins
49f61a9 [zsxwing] Eliminate duplicate executor registered warnings
d4c4e9a [zsxwing] Minor fixes for the code style
05f6238 [zsxwing] Move the blocking codes out of the message processing code
105ba3a [zsxwing] Fix the race condition in totalPendingTasks
d5c615d [zsxwing] Fix the issue that a busy executor may be killed
(cherry picked from commit 6942b974adad396cba2799eac1fa90448cea4da7)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
remaining even if application finished when external shuffle is enabled
When we enables external shuffle service, local directories in the driver of client-mode continue remaining even if application has finished.
I think local directories for drivers should be deleted.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #3811 from sarutak/SPARK-4973 and squashes the following commits:
ad944ab [Kousuke Saruta] Fixed DiskBlockManager to cleanup local directory if it's the driver
43770da [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973
88feecd [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-4973
d99718e [Kousuke Saruta] Fixed SparkSubmit.scala and DiskBlockManager.scala in order to delete local directories of the driver of local-mode when external shuffle service is enabled
(cherry picked from commit a00af6bec57b8df8b286aaa5897232475aef441c)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SPARK-5132:
stageInfoToJson: Stage Attempt Id
stageInfoFromJson: Attempt Id
Author: hushan[胡珊] <hushan@xiaomi.com>
Closes #3932 from suyanNone/json-stage and squashes the following commits:
41419ab [hushan[胡珊]] Correct stage Attempt Id key in stageInfofromJson
(cherry picked from commit d345ebebd554ac3faa4e870bd7800ed02e89da58)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
...nt at all.
- fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask
- See the Jira Issue for more details.
Author: Jongyoul Lee <jongyoul@gmail.com>
Closes #3741 from jongyoul/SPARK-4465 and squashes the following commits:
46ad71e [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed unused import
3d6631f [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - Removed comments and adjusted indentations
2343f13 [Jongyoul Lee] [SPARK-4465] runAsSparkUser doesn't affect TaskRunner in Mesos environment at all. - fixed a scope of runAsSparkUser from MesosExecutorDriver.run to MesosExecutorBackend.launchTask
(cherry picked from commit 1c0e7ce056c79e1db96f85b8c56a479b8b043970)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery.
Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists. SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat.
In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times. In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions. When output spec. validation is enabled, the second calls to these actions will fail due to existing output.
This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler. This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits:
36eaf35 [Josh Rosen] Add comment explaining use of transform() in test.
6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform()
7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide
bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming.
e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic.
762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
(cherry picked from commit 939ba1f8f6e32fef9026cc43fce55b36e4b9bfd1)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Dale <tigerquoll@outlook.com>
Closes #3809 from tigerquoll/SPARK-4787 and squashes the following commits:
5661e01 [Dale] [SPARK-4787] Ensure that call to stop() doesn't lose the exception by using a finally block.
2172578 [Dale] [SPARK-4787] Stop context properly if an exception occurs during DAGScheduler initialization.
(cherry picked from commit 3fddc9468fa50e7683caa973fec6c52e1132268d)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The job launched by DriverSuite should bind the web UI to an ephemeral port, since it looks like port contention in this test has caused a large number of Jenkins failures when many builds are started simultaneously. Our tests already disable the web UI, but this doesn't affect subprocesses launched by our tests. In this case, I've opted to bind to an ephemeral port instead of disabling the UI because disabling features in this test may mask its ability to catch certain bugs.
See also: e24d3a9
Author: Josh Rosen <joshrosen@databricks.com>
Closes #3873 from JoshRosen/driversuite-webui-port and squashes the following commits:
48cd05c [Josh Rosen] [HOTFIX] Bind web UI to ephemeral port in DriverSuite.
(cherry picked from commit 012839807c3dc6e7c8c41ac6e956d52a550bb031)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
| |
This should fix a major cause of build breaks when running many parallel tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Resolves a bug where the `Main-Class` from a .jar file wasn't being read in properly. This was caused by the fact that the `primaryResource` object was a URI and needed to be normalized through a call to `.getPath` before it could be passed into the `JarFile` object.
Author: Brennon York <brennon.york@capitalone.com>
Closes #3561 from brennonyork/SPARK-4298 and squashes the following commits:
5e0fce1 [Brennon York] Use string interpolation for error messages, moved comment line from original code to above its necessary code segment
14daa20 [Brennon York] pushed mainClass assignment into match statement, removed spurious spaces, removed { } from case statements, removed return values
c6dad68 [Brennon York] Set case statement to support multiple jar URI's and enabled the 'file' URI to load the main-class
8d20936 [Brennon York] updated to reset the error message back to the default
a043039 [Brennon York] updated to split the uri and jar vals
8da7cbf [Brennon York] fixes SPARK-4298
(cherry picked from commit 8e14c5eb551ab06c94859c7f6d8c6b62b4d00d59)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures.
This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself).
For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure. See the block comment at the top of the ResetSystemProperties class for more details.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits:
0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools
3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext
4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties
4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering.
0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite.
7a3d224 [Josh Rosen] Fix trait ordering
3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite
bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite
655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite
3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite
cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite
8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait.
633a84a [Josh Rosen] Remove use of system properties in FileServerSuite
25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite
1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite
dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite
b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite
e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite
5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite
0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite
c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite
51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite
60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite
14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite
628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite
9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite.
4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class.
(cherry picked from commit 352ed6bbe3c3b67e52e298e7c535ae414d96beca)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
KryoSerializer
This PR fixes an issue where PySpark broadcast variables caused NullPointerExceptions if KryoSerializer was used. The fix is to register PythonBroadcast with Kryo so that it's deserialized with a KryoJavaSerializer.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #3831 from JoshRosen/SPARK-4882 and squashes the following commits:
0466c7a [Josh Rosen] Register PythonBroadcast with Kryo.
d5b409f [Josh Rosen] Enable registrationRequired, which would have caught this bug.
069d8a7 [Josh Rosen] Add failing test for SPARK-4882
(cherry picked from commit efa80a531ecd485f6cf0cdc24ffa42ba17eea46d)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Zhang, Liye <liye.zhang@intel.com>
Closes #3769 from liyezhang556520/spark-4920_WebVersion and squashes the following commits:
3bb7e0d [Zhang, Liye] add version on master and worker page
(cherry picked from commit 9077e721cd36adfecd50cbd1fd7735d28e5be8b5)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
takeOrdered should skip reduce step in case mapped RDDs have no partitions. This prevents the mentioned exception :
4. run query
SELECT * FROM testTable WHERE market = 'market2' ORDER BY End_Time DESC LIMIT 100;
Error trace
java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:863)
at scala.Option.getOrElse(Option.scala:120)
at org.apache.spark.rdd.RDD.reduce(RDD.scala:863)
at org.apache.spark.rdd.RDD.takeOrdered(RDD.scala:1136)
Author: Yash Datta <Yash.Datta@guavus.com>
Closes #3830 from saucam/fix_takeorder and squashes the following commits:
5974d10 [Yash Datta] SPARK-4968: takeOrdered to skip reduce step in case mappers return no partitions
(cherry picked from commit 9bc0df6804f241aff24520d9c6ec54d9b11f5785)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SparkEnv.environmentDetails
Author: GuoQiang Li <witgo@qq.com>
Closes #3788 from witgo/SPARK-4952 and squashes the following commits:
d903529 [GuoQiang Li] Handle ConcurrentModificationExceptions in SparkEnv.environmentDetails
(cherry picked from commit 080ceb771a1e6b9f844cfd4f1baa01133c106888)
Signed-off-by: Patrick Wendell <pwendell@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #3460 from vanzin/SPARK-4606 and squashes the following commits:
031207d [Marcelo Vanzin] [SPARK-4606] Send EOF to child JVM when there's no more data to read.
(cherry picked from commit 7e2deb71c4239564631b19c748e95c3d1aa1c77d)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See https://issues.apache.org/jira/browse/SPARK-4730.
Author: Andrew Or <andrew@databricks.com>
Closes #3590 from andrewor14/yarn-settings and squashes the following commits:
36e0753 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-settings
dcd1316 [Andrew Or] Warn against deprecated YARN settings
(cherry picked from commit 27c5399f4dd542e36ea579956b8cb0613de25c6d)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Commit 7aacb7bfa added support for sharing downloaded files among multiple
executors of the same app. That works great in Yarn, since the app's directory
is cleaned up after the app is done.
But Spark standalone mode didn't do that, so the lock/cache files created
by that change were left around and could eventually fill up the disk hosting
/tmp.
To solve that, create app-specific directories under the local dirs when
launching executors. Multiple executors launched by the same Worker will
use the same app directories, so they should be able to share the downloaded
files. When the application finishes, a new message is sent to all workers
telling them the application has finished; once that message has been received,
and all executors registered for the application shut down, then those
directories will be cleaned up by the Worker.
Note: Unit testing this is hard (if even possible), since local-cluster mode
doesn't seem to leave the Master/Worker daemons running long enough after
`sc.stop()` is called for the clean up protocol to take effect.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #3705 from vanzin/SPARK-4834 and squashes the following commits:
b430534 [Marcelo Vanzin] Remove seemingly unnecessary synchronization.
50eb4b9 [Marcelo Vanzin] Review feedback.
c0e5ea5 [Marcelo Vanzin] [SPARK-4834] [standalone] Clean up application files after app finishes.
(cherry picked from commit dd155369a04d7dfbf6a5745cbb243e22218367dc)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In Scala, `map` and `flatMap` of `Iterable` will copy the contents of `Iterable` to a new `Seq`. Such as,
```Scala
val iterable = Seq(1, 2, 3).map(v => {
println(v)
v
})
println("Iterable map done")
val iterator = Seq(1, 2, 3).iterator.map(v => {
println(v)
v
})
println("Iterator map done")
```
outputed
```
1
2
3
Iterable map done
Iterator map done
```
So we should use 'iterator' to reduce memory consumed by join.
Found by Johannes Simon in http://mail-archives.apache.org/mod_mbox/spark-user/201412.mbox/%3C5BE70814-9D03-4F61-AE2C-0D63F2DE4446%40mail.de%3E
Author: zsxwing <zsxwing@gmail.com>
Closes #3671 from zsxwing/SPARK-4824 and squashes the following commits:
48ee7b9 [zsxwing] Remove the explicit types
95d59d6 [zsxwing] Add 'iterator' to reduce memory consumed by join
(cherry picked from commit c233ab3d8d75a33495298964fe73dbf7dd8fe305)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is not convenient to see the Spark version. We can keep the same style with Spark website.
![spark_version](https://cloud.githubusercontent.com/assets/7402327/5527025/1c8c721c-8a35-11e4-8d6a-2734f3c6bdf8.jpg)
Author: genmao.ygm <genmao.ygm@alibaba-inc.com>
Closes #3763 from uncleGen/master-clean-141222 and squashes the following commits:
0dcb9a9 [genmao.ygm] [SPARK-4920][UI]:current spark version in UI is not striking.
(cherry picked from commit de9d7d2b5b6d80963505571700e83779fd98f850)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
| |
backport #3740 for branch-1.2
Author: zsxwing <zsxwing@gmail.com>
Closes #3758 from zsxwing/SPARK-2075-branch-1.2 and squashes the following commits:
b57d440 [zsxwing] SPARK-2075 backport for branch-1.2
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and Hadoop 2.+
`NullWritable` is a `Comparable` rather than `Comparable[NullWritable]` in Hadoop 1.+, so the compiler cannot find an implicit Ordering for it. It will generate different anonymous classes for `saveAsTextFile` in Hadoop 1.+ and Hadoop 2.+. Therefore, here we provide an Ordering for NullWritable so that the compiler will generate same codes.
I used the following commands to confirm the generated byte codes are some.
```
mvn -Dhadoop.version=1.2.1 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop1.txt
mvn -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 -DskipTests clean package -pl core -am
javap -private -c -classpath core/target/scala-2.10/classes org.apache.spark.rdd.RDD > ~/hadoop2.txt
diff ~/hadoop1.txt ~/hadoop2.txt
```
However, the compiler will generate different codes for the classes which call methods of `JobContext/TaskAttemptContext`. `JobContext/TaskAttemptContext` is a class in Hadoop 1.+, and calling its method will use `invokevirtual`, while it's an interface in Hadoop 2.+, and will use `invokeinterface`.
To fix it, we can use reflection to call `JobContext/TaskAttemptContext.getConfiguration`.
Author: zsxwing <zsxwing@gmail.com>
Closes #3740 from zsxwing/SPARK-2075 and squashes the following commits:
39d9df2 [zsxwing] Fix the code style
e4ad8b5 [zsxwing] Use null for the implicit Ordering
734bac9 [zsxwing] Explicitly set the implicit parameters
ca03559 [zsxwing] Use reflection to access JobContext/TaskAttemptContext.getConfiguration
fa40db0 [zsxwing] Add an Ordering for NullWritable to make the compiler generate same byte codes for RDD
(cherry picked from commit 6ee6aa70b7d52408cc66bd1434cbeae3212e3f01)
Signed-off-by: Reynold Xin <rxin@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Mvn Build Failed: value defaultProperties not found .Maybe related to this pr:
https://github.com/apache/spark/commit/1d648123a77bbcd9b7a34cc0d66c14fa85edfecd
andrewor14 can you look at this problem?
Author: huangzhaowei <carlmartinmax@gmail.com>
Closes #3749 from SaintBacchus/Mvn-Build-Fail and squashes the following commits:
8e2917c [huangzhaowei] Build Failed: value defaultProperties not found
(cherry picked from commit a764960b3b6d842eef7fa4777c8fa99d3f60fa1e)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Since we can set spark executor memory and executor cores using property file, we must also be allowed to set the executor instances.
Author: Kanwaljit Singh <kanwaljit.singh@guavus.com>
Closes #1657 from kjsingh/branch-1.0 and squashes the following commits:
d8a5a12 [Kanwaljit Singh] SPARK-2641: Fixing how spark arguments are loaded from properties file for num executors
Conflicts:
core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes #2848 from ryan-williams/fetch-file and squashes the following commits:
c14daff [Ryan Williams] Fix copy that was changed to a move inadvertently
8e39c16 [Ryan Williams] code review feedback
788ed41 [Ryan Williams] don’t redundantly overwrite executor JAR deps
(cherry picked from commit 7981f969762e77f1752ef8f86c546d4fc32a1a4f)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Ryan Williams <ryan.blake.williams@gmail.com>
Closes #3736 from ryan-williams/hist and squashes the following commits:
421d8ff [Ryan Williams] add another random typo fix
76d6a4c [Ryan Williams] remove hdfs example
a2d0f82 [Ryan Williams] code review feedback
9ca7629 [Ryan Williams] [SPARK-4889] update history server example cmds
(cherry picked from commit cdb2c645ab769a8678dd81cff44a809fcfa4420b)
Signed-off-by: Andrew Or <andrew@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Sandy Ryza <sandy@cloudera.com>
Closes #3684 from sryza/sandy-spark-3428 and squashes the following commits:
cb827fe [Sandy Ryza] SPARK-3428. TaskMetrics for running tasks is missing GC time metrics
(cherry picked from commit 283263ffaa941e7e9ba147cf0ad377d9202d3761)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is such that the `ExecutorAllocationManager` does not take in the `SparkContext` with all of its dependencies as an argument. This prevents future developers of this class to tie down this class further with the `SparkContext`, which has really become quite a monstrous object.
cc'ing pwendell who originally suggested this, and JoshRosen who may have thoughts about the trait mix-in style of `SparkContext`.
Author: Andrew Or <andrew@databricks.com>
Closes #3614 from andrewor14/dynamic-allocation-sc and squashes the following commits:
187070d [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
59baf6c [Andrew Or] Merge branch 'master' of github.com:apache/spark into dynamic-allocation-sc
347a348 [Andrew Or] Refactor SparkContext into ExecutorAllocationClient
(cherry picked from commit 9804a759b68f56eceb8a2f4ea90f76a92b5f9f67)
Signed-off-by: Andrew Or <andrew@databricks.com>
Conflicts:
core/src/main/scala/org/apache/spark/SparkContext.scala
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is used in NioBlockTransferService here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/network/nio/NioBlockTransferService.scala#L66
Author: Aaron Davidson <aaron@databricks.com>
Closes #3688 from aarondav/SPARK-4837 and squashes the following commits:
ebd2007 [Aaron Davidson] [SPARK-4837] NettyBlockTransferService should use spark.blockManager.port config
(cherry picked from commit 105293a7d06b26e7b179a0447eb802074ee9c218)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Rewording was based on this discussion: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-data-flow-td9804.html
This is the associated JIRA ticket: https://issues.apache.org/jira/browse/SPARK-4884
Author: Madhu Siddalingaiah <madhu@madhu.com>
Closes #3722 from msiddalingaiah/master and squashes the following commits:
79e679f [Madhu Siddalingaiah] [DOC]: improve documentation
51d14b9 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
38faca4 [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
cbccbfe [Madhu Siddalingaiah] Documentation: replace <b> with <code> (again)
332f7a2 [Madhu Siddalingaiah] Documentation: replace <b> with <code>
cd2b05a [Madhu Siddalingaiah] Merge remote-tracking branch 'upstream/master'
0fc12d7 [Madhu Siddalingaiah] Documentation: add description for repartitionAndSortWithinPartitions
(cherry picked from commit d5a596d4188bfa85ff49ee85039f54255c19a4de)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This looked like perhaps a simple and important one. `combineByKey` looks like it should clean its arguments' closures, and that in turn covers apparently all remaining functions in `PairRDDFunctions` which delegate to it.
Author: Sean Owen <sowen@cloudera.com>
Closes #3690 from srowen/SPARK-785 and squashes the following commits:
8df68fe [Sean Owen] Clean context of most remaining functions in PairRDDFunctions, which ultimately call combineByKey
(cherry picked from commit 2a28bc61009a170af3853c78f7f36970898a6d56)
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
|