[SPARK-15726][SQL] Make DatasetBenchmark fairer among Dataset, DataFrame and RDD - spark

diff options

author	Hiroshi Inoue <inouehrs@jp.ibm.com>	2016-08-05 16:00:25 +0800
committer	Wenchen Fan <wenchen@databricks.com>	2016-08-05 16:00:25 +0800
commit	faaefab26ffea3a5edfeaff42db222c8cd3ff5f1 (patch)
tree	f5dce60b9767e3e7d5f0dfdce4261af902784d8c /sql/catalyst/src/main/scala/org/apache
parent	1fa644497aed0a6d22f5fc7bf8e752508053b75b (diff)
download	spark-faaefab26ffea3a5edfeaff42db222c8cd3ff5f1.tar.gz spark-faaefab26ffea3a5edfeaff42db222c8cd3ff5f1.tar.bz2 spark-faaefab26ffea3a5edfeaff42db222c8cd3ff5f1.zip

[SPARK-15726][SQL] Make DatasetBenchmark fairer among Dataset, DataFrame and RDD

## What changes were proposed in this pull request? DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while running the same operations. However, there are two problems that make the comparisons unfair. 1) In backToBackMap test case, only DataFrame implementation executes less work compared to RDD or Dataset implementations. This test case processes Long+String pairs, but the output from the DataFrame implementation does not include String part while RDD or Dataset generates Long+String pairs as output. This difference significantly changes the performance characteristics due to the String manipulation and creation overheads. 2) In back-to-back map and back-to-back filter test cases, `map` or `filter` operation is executed only once regardless of `numChains` parameter for RDD. Hence the execution times for RDD have been largely underestimated. Of course, these issues do not affect Spark users, but it may confuse Spark developers. ## How was this patch tested? By executing the DatasetBenchmark Author: Hiroshi Inoue <inouehrs@jp.ibm.com> Closes #13459 from inouehrs/fix_benchmark_fairness.

Diffstat (limited to 'sql/catalyst/src/main/scala/org/apache')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: