| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR fixes some typos in the following documentation files.
* `NOTICE`, `configuration.md`, and `hardware-provisioning.md`.
## How was the this patch tested?
manual tests
Author: Dongjoon Hyun <dongjoonapache.org>
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #11289 from dongjoon-hyun/minor_fix_typos_notice_and_confdoc.
|
|
|
|
|
|
|
|
| |
default is "python2.7"
Author: Christopher C. Aycock <chris@chrisaycock.com>
Closes #11239 from chrisaycock/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mechanism.
https://issues.apache.org/jira/browse/SPARK-11627
Spark Streaming backpressure mechanism has no initial input rate limit, it might cause OOM exception.
In the firest batch task ,receivers receive data at the maximum speed they can reach,it might exhaust executors memory resources. Add a initial input rate limit value can make sure the Streaming job execute success in the first batch,then the backpressure mechanism can adjust receiving rate adaptively.
Author: junhao <junhao@mogujie.com>
Closes #9593 from junhaoMg/junhao-dev.
|
|
|
|
|
|
|
|
|
|
|
| |
This JIRA is related to
https://github.com/apache/spark/pull/5852
Had to do some minor rework and test to make sure it
works with current version of spark.
Author: Sanket <schintap@untilservice-lm>
Closes #10838 from redsanket/limit-outbound-connections.
|
|
|
|
|
|
|
|
|
|
| |
Remove spark.closure.serializer option and use JavaSerializer always
CC andrewor14 rxin I see there's a discussion in the JIRA but just thought I'd offer this for a look at what the change would be.
Author: Sean Owen <sowen@cloudera.com>
Closes #11150 from srowen/SPARK-12414.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
grained mesos mode.
This is the next iteration of tnachen's previous PR: https://github.com/apache/spark/pull/4027
In that PR, we resolved with andrewor14 and pwendell to implement the Mesos scheduler's support of `spark.executor.cores` to be consistent with YARN and Standalone. This PR implements that resolution.
This PR implements two high-level features. These two features are co-dependent, so they're implemented both here:
- Mesos support for spark.executor.cores
- Multiple executors per slave
We at Mesosphere have been working with Typesafe on a Spark/Mesos integration test suite: https://github.com/typesafehub/mesos-spark-integration-tests, which passes for this PR.
The contribution is my original work and I license the work to the project under the project's open source license.
Author: Michael Gummelt <mgummelt@mesosphere.io>
Closes #10993 from mgummelt/executor_sizing.
|
|
|
|
|
|
| |
Author: Bill Chambers <bill@databricks.com>
Closes #11094 from anabranch/dynamic-docs.
|
|
|
|
|
|
|
|
|
|
| |
dir with mesos conf and add docs.
Fix zookeeper dir configuration used in cluster mode, and also add documentation around these settings.
Author: Timothy Chen <tnachen@gmail.com>
Closes #10057 from tnachen/fix_mesos_dir.
|
|
|
|
|
|
|
|
|
|
| |
cluster mode
JIRA 1680 added a property called spark.yarn.appMasterEnv. This PR draws users' attention to this special case by adding an explanation in configuration.html#environment-variables
Author: Andrew <weiner.andrew.j@gmail.com>
Closes #10869 from weineran/branch-yarn-docs.
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10854 from zsxwing/remove-akka.
|
|
|
|
|
|
|
|
|
|
| |
properties
Several Spark properties equivalent to Spark submit command line options are missing.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #10491 from felixcheung/sparksubmitdoc.
|
|
|
|
|
|
|
|
|
| |
Author: scwf <wangfei1@huawei.com>
Author: Marcelo Vanzin <vanzin@cloudera.com>
Author: WangTaoTheTonic <wangtao111@huawei.com>
Author: w00228970 <wangfei1@huawei.com>
Closes #10238 from vanzin/SPARK-2750.
|
|
|
|
|
|
|
|
|
|
| |
allowBatching configurations for Streaming
/cc tdas brkyvz
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10453 from zsxwing/streaming-conf.
|
|
|
|
|
|
|
|
| |
modify 'spark.memory.offHeap.enabled' default value to false
Author: zzcclp <xm_zzc@sina.com>
Closes #10633 from zzcclp/fix_spark.memory.offHeap.enabled_default_value.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.
Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.
For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10534 from JoshRosen/remove-ttl-based-cleaning.
|
|
|
|
|
|
|
|
| |
We switched to TorrentBroadcast in Spark 1.1, and HttpBroadcast has been undocumented since then. It's time to remove it in Spark 2.0.
Author: Reynold Xin <rxin@databricks.com>
Closes #10531 from rxin/SPARK-12588.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
According the benchmark [1], LZ4-java could be 80% (or 30%) faster than Snappy.
After changing the compressor to LZ4, I saw 20% improvement on end-to-end time for a TPCDS query (Q4).
[1] https://github.com/ning/jvm-compressor-benchmark/wiki
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes #10342 from davies/lz4.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.
davies Is this inconsistency intentional? Thanks!
Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.
Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.
Author: gatorsmile <gatorsmile@gmail.com>
Closes #10092 from gatorsmile/persistStorageLevel.
|
|
|
|
|
|
|
|
| |
Please help to review, thanks a lot.
Author: jerryshao <sshao@hortonworks.com>
Closes #10195 from jerryshao/SPARK-10123.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds documentation for Spark configurations that affect off-heap memory and makes some naming and validation improvements for those configs.
- Change `spark.memory.offHeapSize` to `spark.memory.offHeap.size`. This is fine because this configuration has not shipped in any Spark release yet (it's new in Spark 1.6).
- Deprecated `spark.unsafe.offHeap` in favor of a new `spark.memory.offHeap.enabled` configuration. The motivation behind this change is to gather all memory-related configurations under the same prefix.
- Add a check which prevents users from setting `spark.memory.offHeap.enabled=true` when `spark.memory.offHeap.size == 0`. After SPARK-11389 (#9344), which was committed in Spark 1.6, Spark enforces a hard limit on the amount of off-heap memory that it will allocate to tasks. As a result, enabling off-heap execution memory without setting `spark.memory.offHeap.size` will lead to immediate OOMs. The new configuration validation makes this scenario easier to diagnose, helping to avoid user confusion.
- Document these configurations on the configuration page.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #10237 from JoshRosen/SPARK-12251.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This avoids bringing up yet another HTTP server on the driver, and
instead reuses the file server already managed by the driver's
RpcEnv. As a bonus, the repl now inherits the security features of
the network library.
There's also a small change to create the directory for storing classes
under the root temp dir for the application (instead of directly
under java.io.tmpdir).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #9923 from vanzin/SPARK-11563.
|
|
|
|
|
|
| |
Author: rotems <roter>
Closes #10078 from Botnaim/KryoMultipleCustomRegistrators.
|
|
|
|
|
|
|
|
|
|
| |
The existing `spark.memory.fraction` (default 0.75) gives the system 25% of the space to work with. For small heaps, this is not enough: e.g. default 1GB leaves only 250MB system memory. This is especially a problem in local mode, where the driver and executor are crammed in the same JVM. Members of the community have reported driver OOM's in such cases.
**New proposal.** We now reserve 300MB before taking the 75%. For 1GB JVMs, this leaves `(1024 - 300) * 0.75 = 543MB` for execution and storage. This is proposal (1) listed in the [JIRA](https://issues.apache.org/jira/browse/SPARK-12081).
Author: Andrew Or <andrew@databricks.com>
Closes #10081 from andrewor14/unified-memory-small-heaps.
|
|
|
|
|
|
| |
Author: Jeff Zhang <zjffdu@apache.org>
Closes #9956 from zjffdu/dev_typo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change abstracts the code that serves jars / files to executors so that
each RpcEnv can have its own implementation; the akka version uses the existing
HTTP-based file serving mechanism, while the netty versions uses the new
stream support added to the network lib, which makes file transfers benefit
from the easier security configuration of the network library, and should also
reduce overhead overall.
The change includes a small fix to TransportChannelHandler so that it propagates
user events to downstream handlers.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #9530 from vanzin/SPARK-11140.
|
|
|
|
|
|
| |
Author: Andrew Or <andrew@databricks.com>
Closes #9676 from andrewor14/memory-management-docs.
|
|
|
|
|
|
|
|
|
|
|
| |
`<\code>` end tag missing backslash in
docs/configuration.md{L308-L339}
ref #8795
Author: Kai Jiang <jiangkai@gmail.com>
Closes #9715 from vectorijk/minor-typo-docs.
|
|
|
|
|
|
|
|
|
|
| |
Remove Hadoop third party distro page, and move Hadoop cluster config info to configuration page
CC pwendell
Author: Sean Owen <sowen@cloudera.com>
Closes #9298 from srowen/SPARK-11305.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add a new spark conf option "spark.sparkr.r.driver.command" to specify the executable for an R script in client modes.
The existing spark conf option "spark.sparkr.r.command" is used to specify the executable for an R script in cluster modes for both driver and workers. See also [launch R worker script](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/api/r/RRDD.scala#L395).
BTW, [envrionment variable "SPARKR_DRIVER_R"](https://github.com/apache/spark/blob/master/launcher/src/main/java/org/apache/spark/launcher/SparkSubmitCommandBuilder.java#L275) is used to locate R shell on the local host.
For your information, PYSPARK has two environment variables serving simliar purpose:
PYSPARK_PYTHON Python binary executable to use for PySpark in both driver and workers (default is `python`).
PYSPARK_DRIVER_PYTHON Python binary executable to use for PySpark in driver only (default is PYSPARK_PYTHON).
pySpark use the code [here](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/PythonRunner.scala#L41) to determine the python executable for a python script.
Author: Sun Rui <rui.sun@intel.com>
Closes #9179 from sun-rui/SPARK-10971.
|
|
|
|
|
|
|
|
| |
There's a lot of duplication between SortShuffleManager and UnsafeShuffleManager. Given that these now provide the same set of functionality, now that UnsafeShuffleManager supports large records, I think that we should replace SortShuffleManager's serialized shuffle implementation with UnsafeShuffleManager's and should merge the two managers together.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8829 from JoshRosen/consolidate-sort-shuffle-implementations.
|
|
|
|
|
|
|
|
|
|
| |
Add documentation for configuration:
- spark.sql.ui.retainedExecutions
- spark.streaming.ui.retainedBatches
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes #9052 from pnpritchard/SPARK-11039.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch unifies the memory management of the storage and execution regions such that either side can borrow memory from each other. When memory pressure arises, storage will be evicted in favor of execution. To avoid regressions in cases where storage is crucial, we dynamically allocate a fraction of space for storage that execution cannot evict. Several configurations are introduced:
- **spark.memory.fraction (default 0.75)**: fraction of the heap space used for execution and storage. The lower this is, the more frequently spills and cached data eviction occur. The purpose of this config is to set aside memory for internal metadata, user data structures, and imprecise size estimation in the case of sparse, unusually large records.
- **spark.memory.storageFraction (default 0.5)**: size of the storage region within the space set aside by `spark.memory.fraction`. Cached data may only be evicted if total storage exceeds this region.
- **spark.memory.useLegacyMode (default false)**: whether to use the memory management that existed in Spark 1.5 and before. This is mainly for backward compatibility.
For a detailed description of the design, see [SPARK-10000](https://issues.apache.org/jira/browse/SPARK-10000). This patch builds on top of the `MemoryManager` interface introduced in #9000.
Author: Andrew Or <andrew@databricks.com>
Closes #9084 from andrewor14/unified-memory-manager.
|
|
|
|
|
|
|
|
| |
1.4 docs noted that the units were MB - i have assumed this is still the case
Author: admackin <admackin@users.noreply.github.com>
Closes #9025 from admackin/master.
|
|
|
|
|
|
| |
Author: Bin Wang <wbin00@gmail.com>
Closes #8898 from wb14123/doc.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #8803 from vanzin/SPARK-10676.
|
|
|
|
|
|
|
|
|
|
| |
* Backticks are processed properly in Spark Properties table
* Removed unnecessary spaces
* See http://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/running-on-yarn.html
Author: Jacek Laskowski <jacek.laskowski@deepsense.io>
Closes #8795 from jaceklaskowski/docs-yarn-formatting.
|
|
|
|
|
|
|
|
|
|
| |
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #8812 from rxin/SPARK-9808-1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
implementing the sufficientResourcesRegistered method
spark.scheduler.minRegisteredResourcesRatio configuration parameter works for YARN mode but not for Mesos Coarse grained mode.
If the parameter specified default value of 0 will be set for spark.scheduler.minRegisteredResourcesRatio in base class and this method will always return true.
There are no existing test for YARN mode too. Hence not added test for the same.
Author: Akash Mishra <akash.mishra20@gmail.com>
Closes #8672 from SleepyThread/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
From JIRA:
Add documentation for tungsten-sort.
From the mailing list "I saw a new "spark.shuffle.manager=tungsten-sort" implemented in
https://issues.apache.org/jira/browse/SPARK-7081, but it can't be found its
corresponding description in
http://people.apache.org/~pwendell/spark-releases/spark-1.5.0-rc3-docs/configuration.html(Currenlty
there are only 'sort' and 'hash' two options)."
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8638 from holdenk/SPARK-10469-document-tungsten-sort.
|
|
|
|
|
|
|
|
|
|
| |
about rate limiting and backpressure
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #8656 from tdas/SPARK-10492 and squashes the following commits:
986cdd6 [Tathagata Das] Added information on backpressure
|
|
|
|
|
|
|
|
| |
We introduced the Netty network module for shuffle in Spark 1.2, and has turned it on by default for 3 releases. The old ConnectionManager is difficult to maintain. If we merge the patch now, by the time it is released, it would be 1 yr for which ConnectionManager is off by default. It's time to remove it.
Author: Reynold Xin <rxin@databricks.com>
Closes #8161 from rxin/SPARK-9767.
|
|
|
|
|
|
| |
Author: Tom Graves <tgraves@yahoo-inc.com>
Closes #8585 from tgravescs/SPARK-10432.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
SPARK-4223.
Currently we support setting view and modify acls but you have to specify a list of users. It would be nice to support * meaning all users have access.
Manual tests to verify that: "*" works for any user in:
a. Spark ui: view and kill stage. Done.
b. Spark history server. Done.
c. Yarn application killing. Done.
Author: zhuol <zhuol@yahoo-inc.com>
Closes #8398 from zhuoliu/4223.
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-10315
this parameter is not used any longer and there is some mistake in the current document , should be 'akka.remote.watch-failure-detector.threshold'
Author: CodingCat <zhunansjtu@gmail.com>
Closes #8483 from CodingCat/SPARK_10315.
|
|
|
|
|
|
|
|
| |
cc JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes #8245 from davies/python_doc.
|
|
|
|
|
|
|
|
| |
Deprecate NIO ConnectionManager in Spark 1.5.0, before removing it in Spark 1.6.0.
Author: Reynold Xin <rxin@databricks.com>
Closes #8162 from rxin/SPARK-9934.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Document spark.shuffle.service.{enabled,port}
CC sryza tgravescs
This is pretty minimal; is there more to say here about the service?
Author: Sean Owen <sowen@cloudera.com>
Closes #7991 from srowen/SPARK-9641 and squashes the following commits:
3bb946e [Sean Owen] Add link to docs for setup and config of external shuffle service
2302e01 [Sean Owen] Document spark.shuffle.service.{enabled,port}
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Worker
https://issues.apache.org/jira/browse/SPARK-9202
Author: CodingCat <zhunansjtu@gmail.com>
Closes #7714 from CodingCat/SPARK-9202 and squashes the following commits:
23977fb [CodingCat] add comments about why we don't synchronize finishedExecutors & finishedDrivers
dc9772d [CodingCat] addressing the comments
e125241 [CodingCat] stylistic fix
80bfe52 [CodingCat] fix JsonProtocolSuite
d7d9485 [CodingCat] styistic fix and respect insert ordering
031755f [CodingCat] add license info & stylistic fix
c3b5361 [CodingCat] test cases and docs
c557b3a [CodingCat] applications are fine
9cac751 [CodingCat] application is fine...
ad87ed7 [CodingCat] trimFinishedExecutorsAndDrivers
|
|
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #7651 from vanzin/SPARK-9327 and squashes the following commits:
2923e23 [Marcelo Vanzin] [SPARK-9327] [docs] Fix documentation about classpath config options.
|