diff options
author | Josh Rosen <joshrosen@databricks.com> | 2015-08-18 22:30:13 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2015-08-18 22:30:13 -0700 |
commit | 010b03ed52f35fd4d426d522f8a9927ddc579209 (patch) | |
tree | bc0cc0d53ccde1ca78819ae635746d1993c2ad81 /streaming/src/main | |
parent | 1c843e284818004f16c3f1101c33b510f80722e3 (diff) | |
download | spark-010b03ed52f35fd4d426d522f8a9927ddc579209.tar.gz spark-010b03ed52f35fd4d426d522f8a9927ddc579209.tar.bz2 spark-010b03ed52f35fd4d426d522f8a9927ddc579209.zip |
[SPARK-9952] Fix N^2 loop when DAGScheduler.getPreferredLocsInternal accesses cacheLocs
In Scala, `Seq.fill` always seems to return a List. Accessing a list by index is an O(N) operation. Thus, the following code will be really slow (~10 seconds on my machine):
```scala
val numItems = 100000
val s = Seq.fill(numItems)(1)
for (i <- 0 until numItems) s(i)
```
It turns out that we had a loop like this in DAGScheduler code, although it's a little tricky to spot. In `getPreferredLocsInternal`, there's a call to `getCacheLocs(rdd)(partition)`. The `getCacheLocs` call returns a Seq. If this Seq is a List and the RDD contains many partitions, then indexing into this list will cost O(partitions). Thus, when we loop over our tasks to compute their individual preferred locations we implicitly perform an N^2 loop, reducing scheduling throughput.
This patch fixes this by replacing `Seq` with `Array`.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8178 from JoshRosen/dagscheduler-perf.
Diffstat (limited to 'streaming/src/main')
0 files changed, 0 insertions, 0 deletions