[SPARK-13071] Coalescing HadoopRDD overwrites existing input metrics - spark

diff options

author	Andrew Or <andrew@databricks.com>	2016-01-29 18:03:04 -0800
committer	Andrew Or <andrew@databricks.com>	2016-01-29 18:03:08 -0800
commit	12252d1da90fa7d2dffa3a7c249ecc8821dee130 (patch)
tree	afac517a71e5639ba7796d55a3339167dd5a4f05 /dev
parent	70e69fc4dd619654f5d24b8b84f6a94f7705c59b (diff)
download	spark-12252d1da90fa7d2dffa3a7c249ecc8821dee130.tar.gz spark-12252d1da90fa7d2dffa3a7c249ecc8821dee130.tar.bz2 spark-12252d1da90fa7d2dffa3a7c249ecc8821dee130.zip

[SPARK-13071] Coalescing HadoopRDD overwrites existing input metrics

This issue is causing tests to fail consistently in master with Hadoop 2.6 / 2.7. This is because for Hadoop 2.5+ we overwrite existing values of `InputMetrics#bytesRead` in each call to `HadoopRDD#compute`. In the case of coalesce, e.g. ``` sc.textFile(..., 4).coalesce(2).count() ``` we will call `compute` multiple times in the same task, overwriting `bytesRead` values from previous calls to `compute`. For a regression test, see `InputOutputMetricsSuite.input metrics for old hadoop with coalesce`. I did not add a new regression test because it's impossible without significant refactoring; there's a lot of existing duplicate code in this corner of Spark. This was caused by #10835. Author: Andrew Or <andrew@databricks.com> Closes #10973 from andrewor14/fix-input-metrics-coalesce.

Diffstat (limited to 'dev')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: