[SPARK-13328][CORE] Poor read performance for broadcast variables with dynamic resource allocation - spark

diff options

author	Nezih Yigitbasi <nyigitbasi@netflix.com>	2016-03-11 11:11:53 -0800
committer	Andrew Or <andrew@databricks.com>	2016-03-11 11:11:53 -0800
commit	ff776b2fc1cd4c571fd542dbf807e6fa3373cb34 (patch)
tree	64e040c08ff39e3914130dc2fcb163fbf368f508 /python
parent	eb650a81f14fa7bc665856397e19ddf1a92ca3c5 (diff)
download	spark-ff776b2fc1cd4c571fd542dbf807e6fa3373cb34.tar.gz spark-ff776b2fc1cd4c571fd542dbf807e6fa3373cb34.tar.bz2 spark-ff776b2fc1cd4c571fd542dbf807e6fa3373cb34.zip

[SPARK-13328][CORE] Poor read performance for broadcast variables with dynamic resource allocation

When dynamic resource allocation is enabled fetching broadcast variables from removed executors were causing job failures and SPARK-9591 fixed this problem by trying all locations of a block before giving up. However, the locations of a block is retrieved only once from the driver in this process and the locations in this list can be stale due to dynamic resource allocation. This situation gets worse when running on a large cluster as the size of this location list can be in the order of several hundreds out of which there may be tens of stale entries. What we have observed is with the default settings of 3 max retries and 5s between retries (that's 15s per location) the time it takes to read a broadcast variable can be as high as ~17m (70 failed attempts * 15s/attempt) Author: Nezih Yigitbasi <nyigitbasi@netflix.com> Closes #11241 from nezihyigitbasi/SPARK-13328.

Diffstat (limited to 'python')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: