[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs - spark

diff options

author	Josh Rosen <joshrosen@databricks.com>	2015-08-18 22:30:13 -0700
committer	Reynold Xin <rxin@databricks.com>	2015-08-18 22:30:13 -0700
commit	010b03ed52f35fd4d426d522f8a9927ddc579209 (patch)
tree	bc0cc0d53ccde1ca78819ae635746d1993c2ad81 /streaming/src/main
parent	1c843e284818004f16c3f1101c33b510f80722e3 (diff)
download	spark-010b03ed52f35fd4d426d522f8a9927ddc579209.tar.gz spark-010b03ed52f35fd4d426d522f8a9927ddc579209.tar.bz2 spark-010b03ed52f35fd4d426d522f8a9927ddc579209.zip

[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs

In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf.

Diffstat (limited to 'streaming/src/main')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: