aboutsummaryrefslogtreecommitdiff
path: root/docs/programming-guide.md
diff options
context:
space:
mode:
authorCodingCat <zhunansjtu@gmail.com>2014-11-26 16:52:04 -0800
committerMatei Zaharia <matei@databricks.com>2014-11-26 16:52:04 -0800
commit5af53ada65f62e6b5987eada288fb48e9211ef9d (patch)
tree41656128e02df28f5add223233442a2393cf4dc4 /docs/programming-guide.md
parent561d31d2f13cc7b1112ba9f9aa8f08bcd032aebb (diff)
downloadspark-5af53ada65f62e6b5987eada288fb48e9211ef9d.tar.gz
spark-5af53ada65f62e6b5987eada288fb48e9211ef9d.tar.bz2
spark-5af53ada65f62e6b5987eada288fb48e9211ef9d.zip
[SPARK-732][SPARK-3628][CORE][RESUBMIT] eliminate duplicate update on accmulator
https://issues.apache.org/jira/browse/SPARK-3628 In current implementation, the accumulator will be updated for every successfully finished task, even the task is from a resubmitted stage, which makes the accumulator counter-intuitive In this patch, I changed the way for the DAGScheduler to update the accumulator, DAGScheduler maintains a HashTable, mapping the stage id to the received <accumulator_id , value> pairs. Only when the stage becomes independent, (no job needs it any more), we accumulate the values of the <accumulator_id , value> pairs, when a task finished, we check if the HashTable has contained such stageId, it saves the accumulator_id, value only when the task is the first finished task of a new stage or the stage is running for the first attempt... Author: CodingCat <zhunansjtu@gmail.com> Closes #2524 from CodingCat/SPARK-732-1 and squashes the following commits: 701a1e8 [CodingCat] roll back change on Accumulator.scala 1433e6f [CodingCat] make MIMA happy b233737 [CodingCat] address Matei's comments 02261b8 [CodingCat] rollback some changes 6b0aff9 [CodingCat] update document 2b2e8cf [CodingCat] updateAccumulator 83b75f8 [CodingCat] style fix 84570d2 [CodingCat] re-enable the bad accumulator guard 1e9e14d [CodingCat] add NPE guard 21b6840 [CodingCat] simplify the patch 88d1f03 [CodingCat] fix rebase error f74266b [CodingCat] add test case for resubmitted result stage 5cf586f [CodingCat] de-duplicate on task level 138f9b3 [CodingCat] make MIMA happy 67593d2 [CodingCat] make if allowing duplicate update as an option of accumulator
Diffstat (limited to 'docs/programming-guide.md')
-rw-r--r--docs/programming-guide.md6
1 files changed, 6 insertions, 0 deletions
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 49f319ba77..c60de6e970 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1306,6 +1306,12 @@ vecAccum = sc.accumulator(Vector(...), VectorAccumulatorParam())
</div>
+For accumulator updates performed inside <b>actions only</b>, Spark guarantees that each task's update to the accumulator
+will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware
+of that each task's update may be applied more than once if tasks or job stages are re-executed.
+
+
+
# Deploying to a Cluster
The [application submission guide](submitting-applications.html) describes how to submit applications to a cluster.