[SPARK-733] Add documentation on use of accumulators in lazy transformation

I've added documentation clarifying the particular lack of clarity highlighted in the relevant JIRA. I've also added code examples for this issue to clarify the explanation. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #4022 from ilganeli/SPARK-733 and squashes the following commits: 587def5 [Ilya Ganelin] Updated to clarify verbage df3afd7 [Ilya Ganelin] Revert "Partially updated task metrics to make some vars private" 3f6c512 [Ilya Ganelin] Revert "Completed refactoring to make vars in TaskMetrics class private" 58034fb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733 4dc2cdb [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-733 3a38db1 [Ilya Ganelin] Verified documentation update by building via jekyll 33b5a2d [Ilya Ganelin] Added code examples for java and python 1fd59b2 [Ilya Ganelin] Updated documentation for accumulators to highlight lazy evaluation issue 5525c20 [Ilya Ganelin] Completed refactoring to make vars in TaskMetrics class private c64da4f [Ilya Ganelin] Partially updated task metrics to make some vars private (cherry picked from commit fd3a8a1d15ad516ea056089e30d6fd14e2f2d9a1) Signed-off-by: Imran Rashid <irashid@cloudera.com>
author: Ilya Ganelin <ilya.ganelin@capitalone.com> 2015-01-16 13:25:17 -0800
committer: Imran Rashid <irashid@cloudera.com> 2015-01-16 13:25:47 -0800
commit: 4a550acb28530ed69c0b5d84f850eb94e61968e1 (patch)
tree: b1073a9c83367cee8769860e02a2995729189a69
parent: 473777ef221f402c1c895327ab94b0101a1308e9 (diff)
download: spark-4a550acb28530ed69c0b5d84f850eb94e61968e1.tar.gz
spark-4a550acb28530ed69c0b5d84f850eb94e61968e1.tar.bz2
spark-4a550acb28530ed69c0b5d84f850eb94e61968e1.zip
1 files changed, 28 insertions, 0 deletions
diff --git a/docs/programming-guide.md b/docs/programming-guide.md
index 0211bbabc1..2443fc29b4 100644
--- a/docs/programming-guide.md
+++ b/docs/programming-guide.md
@@ -1316,7 +1316,35 @@ For accumulator updates performed inside <b>actions only</b>, Spark guarantees t
 will only be applied once, i.e. restarted tasks will not update the value. In transformations, users should be aware 
 of that each task's update may be applied more than once if tasks or job stages are re-executed.
 
+Accumulators do not change the lazy evaluation model of Spark. If they are being updated within an operation on an RDD, their value is only updated once that RDD is computed as part of an action. Consequently, accumulator updates are not guaranteed to be executed when made within a lazy transformation like `map()`. The below code fragment demonstrates this property:
 
+<div class="codetabs">
+
+<div data-lang="scala"  markdown="1">
+{% highlight scala %}
+val acc = sc.accumulator(0)
+data.map(x => acc += x; f(x))
+// Here, acc is still 0 because no actions have cause the `map` to be computed.
+{% endhighlight %}
+</div>
+
+<div data-lang="java"  markdown="1">
+{% highlight java %}
+Accumulator<Integer> accum = sc.accumulator(0);
+data.map(x -> accum.add(x); f(x););
+// Here, accum is still 0 because no actions have cause the `map` to be computed.
+{% endhighlight %}
+</div>
+
+<div data-lang="python"  markdown="1">
+{% highlight python %}
+accum = sc.accumulator(0)
+data.map(lambda x => acc.add(x); f(x))
+# Here, acc is still 0 because no actions have cause the `map` to be computed.
+{% endhighlight %}
+</div>
+
+</div>
 
 # Deploying to a Cluster
author	Ilya Ganelin <ilya.ganelin@capitalone.com>	2015-01-16 13:25:17 -0800
committer	Imran Rashid <irashid@cloudera.com>	2015-01-16 13:25:47 -0800
commit	4a550acb28530ed69c0b5d84f850eb94e61968e1 (patch)
tree	b1073a9c83367cee8769860e02a2995729189a69
parent	473777ef221f402c1c895327ab94b0101a1308e9 (diff)
download	spark-4a550acb28530ed69c0b5d84f850eb94e61968e1.tar.gz spark-4a550acb28530ed69c0b5d84f850eb94e61968e1.tar.bz2 spark-4a550acb28530ed69c0b5d84f850eb94e61968e1.zip