aboutsummaryrefslogtreecommitdiff
path: root/streaming/src/main
diff options
context:
space:
mode:
authorJosh Rosen <joshrosen@databricks.com>2015-08-18 22:30:13 -0700
committerReynold Xin <rxin@databricks.com>2015-08-18 22:30:13 -0700
commit010b03ed52f35fd4d426d522f8a9927ddc579209 (patch)
treebc0cc0d53ccde1ca78819ae635746d1993c2ad81 /streaming/src/main
parent1c843e284818004f16c3f1101c33b510f80722e3 (diff)
downloadspark-010b03ed52f35fd4d426d522f8a9927ddc579209.tar.gz
spark-010b03ed52f35fd4d426d522f8a9927ddc579209.tar.bz2
spark-010b03ed52f35fd4d426d522f8a9927ddc579209.zip
[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs
In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine): ```scala val numItems = 100000 val s = Seq.fill(numItems)(1) for (i <- 0 until numItems) s(i) ``` It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput. This patch fixes this by replacing `Seq` with `Array`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8178 from JoshRosen/dagscheduler-perf.
Diffstat (limited to 'streaming/src/main')
0 files changed, 0 insertions, 0 deletions