diff options
author | CodingCat <zhunansjtu@gmail.com> | 2014-11-26 16:52:04 -0800 |
---|---|---|
committer | Matei Zaharia <matei@databricks.com> | 2014-11-26 16:52:04 -0800 |
commit | 5af53ada65f62e6b5987eada288fb48e9211ef9d (patch) | |
tree | 41656128e02df28f5add223233442a2393cf4dc4 /docs | |
parent | 561d31d2f13cc7b1112ba9f9aa8f08bcd032aebb (diff) | |
download | spark-5af53ada65f62e6b5987eada288fb48e9211ef9d.tar.gz spark-5af53ada65f62e6b5987eada288fb48e9211ef9d.tar.bz2 spark-5af53ada65f62e6b5987eada288fb48e9211ef9d.zip |
[SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator
https://issues.apache.org/jira/browse/SPARK-3628
In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive
In this patch, I changed the way for the DAGScheduler to update the accumulator,
DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt...
Author: CodingCat <zhunansjtu@gmail.com>
Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits:
701a1e8 [CodingCat] roll back change on Accumulator.scala
1433e6f [CodingCat] make MIMA happy
b233737 [CodingCat] address Matei's comments
02261b8 [CodingCat] rollback some changes
6b0aff9 [CodingCat] update document
2b2e8cf [CodingCat] updateAccumulator
83b75f8 [CodingCat] style fix
84570d2 [CodingCat] re-enable the bad accumulator guard
1e9e14d [CodingCat] add NPE guard
21b6840 [CodingCat] simplify the patch
88d1f03 [CodingCat] fix rebase error
f74266b [CodingCat] add test case for resubmitted result stage
5cf586f [CodingCat] de-duplicate on task level
138f9b3 [CodingCat] make MIMA happy
67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
Diffstat (limited to 'docs')
-rw-r--r-- | docs/programming-guide.md | 6 |
1 files changed, 6 insertions, 0 deletions
diff --git a/docs/programming-guide.md b/docs/programming-guide.md index 49f319ba77..c60de6e970 100644 --- a/docs/programming-guide.md +++ b/docs/programming-guide.md @@ -1306,6 +1306,12 @@ vecAccum = sc.accumulator(Vector(...), VectorAccumulatorParam()) </div> +For accumulator updates performed inside <b>actions only</b>, Spark guarantees that each task's update to the accumulator +will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware +of that each task's update may be applied more than once if tasks or job stages are re-executed. + + + # Deploying to a Cluster The [application submission guide](submitting-applications.html) describes how to submit applications to a cluster. |