diff options
author | Takuya UESHIN <ueshin@happy-camper.st> | 2016-05-12 12:36:18 -0700 |
---|---|---|
committer | Kay Ousterhout <kayousterhout@gmail.com> | 2016-05-12 12:36:18 -0700 |
commit | a57aadae84aca27e5f02ac0bd64fd0ea34a64b61 (patch) | |
tree | f940f5796a40423b2a68faa9f2f8d90c200d6ce6 /R/pkg/inst/tests | |
parent | 81e3bfc16c6cfbf2f9f2c6c32ed651b8450795ba (diff) | |
download | spark-a57aadae84aca27e5f02ac0bd64fd0ea34a64b61.tar.gz spark-a57aadae84aca27e5f02ac0bd64fd0ea34a64b61.tar.bz2 spark-a57aadae84aca27e5f02ac0bd64fd0ea34a64b61.zip |
[SPARK-13902][SCHEDULER] Make DAGScheduler not to create duplicate stage.
## What changes were proposed in this pull request?
`DAGScheduler`sometimes generate incorrect stage graph.
Suppose you have the following DAG:
```
[A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D]
\ /
<-------------
```
Note: [] means an RDD, () means a shuffle dependency.
Here, RDD `B` has a shuffle dependency on RDD `A`, and RDD `C` has shuffle dependency on both `B` and `A`. The shuffle dependency IDs are numbers in the `DAGScheduler`, but to make the example easier to understand, let's call the shuffled data from `A` shuffle dependency ID `s_A` and the shuffled data from `B` shuffle dependency ID `s_B`.
The `getAncestorShuffleDependencies` method in `DAGScheduler` (incorrectly) does not check for duplicates when it's adding ShuffleDependencies to the parents data structure, so for this DAG, when `getAncestorShuffleDependencies` gets called on `C` (previous of the final RDD), `getAncestorShuffleDependencies` will return `s_A`, `s_B`, `s_A` (`s_A` gets added twice: once when the method "visit"s RDD `C`, and once when the method "visit"s RDD `B`). This is problematic because this line of code: https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289 then generates a new shuffle stage for each dependency returned by `getAncestorShuffleDependencies`, resulting in duplicate map stages that compute the map output from RDD `A`.
As a result, `DAGScheduler` generates the following stages and their parents for each shuffle:
| | stage | parents |
|----|----|----|
| s_A | ShuffleMapStage 2 | List() |
| s_B | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| s_C | ShuffleMapStage 3 | List(ShuffleMapStage 1, ShuffleMapStage 2) |
| - | ResultStage 4 | List(ShuffleMapStage 3) |
The stage for s_A should be `ShuffleMapStage 0`, but the stage for `s_A` is generated twice as `ShuffleMapStage 2` and `ShuffleMapStage 0` is overwritten by `ShuffleMapStage 2`, and the stage `ShuffleMap Stage1` keeps referring the old stage `ShuffleMapStage 0`.
This patch is fixing it.
## How was this patch tested?
I added the sample RDD graph to show the illegal stage graph to `DAGSchedulerSuite`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #12655 from ueshin/issues/SPARK-13902.
Diffstat (limited to 'R/pkg/inst/tests')
0 files changed, 0 insertions, 0 deletions