[SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator

https://issues.apache.org/jira/browse/SPARK-3628 In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive In this patch, I changed the way for the DAGScheduler to update the accumulator, DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt... Author: CodingCat <zhunansjtu@gmail.com> Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits: 701a1e8 [CodingCat] roll back change on Accumulator.scala 1433e6f [CodingCat] make MIMA happy b233737 [CodingCat] address Matei's comments 02261b8 [CodingCat] rollback some changes 6b0aff9 [CodingCat] update document 2b2e8cf [CodingCat] updateAccumulator 83b75f8 [CodingCat] style fix 84570d2 [CodingCat] re-enable the bad accumulator guard 1e9e14d [CodingCat] add NPE guard 21b6840 [CodingCat] simplify the patch 88d1f03 [CodingCat] fix rebase error f74266b [CodingCat] add test case for resubmitted result stage 5cf586f [CodingCat] de-duplicate on task level 138f9b3 [CodingCat] make MIMA happy 67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator (cherry picked from commit 5af53ada65f62e6b5987eada288fb48e9211ef9d) Signed-off-by: Matei Zaharia <matei@databricks.com>
author: CodingCat <zhunansjtu@gmail.com> 2014-11-26 16:52:04 -0800
committer: Matei Zaharia <matei@databricks.com> 2014-11-26 16:52:13 -0800
commit: 66cc2431462a5354bb50c196a59da0ffc258c466 (patch)
tree: 09d83f0ce94545d07149afc5375d5a0a6c135f58 /docs
parent: 69550f761c53da80343ae982db38780cd2ad956f (diff)
download: spark-66cc2431462a5354bb50c196a59da0ffc258c466.tar.gz
spark-66cc2431462a5354bb50c196a59da0ffc258c466.tar.bz2
spark-66cc2431462a5354bb50c196a59da0ffc258c466.zip
1 files changed, 6 insertions, 0 deletions
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 49f319ba77..c60de6e970 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1306,6 +1306,12 @@ vecAccum = sc.accumulator(Vector(...), VectorAccumulatorParam())
 
 </div>
 
+For accumulator updates performed inside <b>actions only</b>, Spark guarantees that each task's update to the accumulator 
+will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware 
+of that each task's update may be applied more than once if tasks or job stages are re-executed.
+
+
+
 # Deploying to a Cluster
 
 The [application submission guide](submitting-applications.html) describes how to submit applications to a cluster.
author	CodingCat <zhunansjtu@gmail.com>	2014-11-26 16:52:04 -0800
committer	Matei Zaharia <matei@databricks.com>	2014-11-26 16:52:13 -0800
commit	66cc2431462a5354bb50c196a59da0ffc258c466 (patch)
tree	09d83f0ce94545d07149afc5375d5a0a6c135f58 /docs
parent	69550f761c53da80343ae982db38780cd2ad956f (diff)
download	spark-66cc2431462a5354bb50c196a59da0ffc258c466.tar.gz spark-66cc2431462a5354bb50c196a59da0ffc258c466.tar.bz2 spark-66cc2431462a5354bb50c196a59da0ffc258c466.zip