aboutsummaryrefslogtreecommitdiff
path: root/sql/catalyst/src/main/scala/org/apache
diff options
context:
space:
mode:
authorHiroshi Inoue <inouehrs@jp.ibm.com>2016-08-05 16:00:25 +0800
committerWenchen Fan <wenchen@databricks.com>2016-08-05 16:00:25 +0800
commitfaaefab26ffea3a5edfeaff42db222c8cd3ff5f1 (patch)
treef5dce60b9767e3e7d5f0dfdce4261af902784d8c /sql/catalyst/src/main/scala/org/apache
parent1fa644497aed0a6d22f5fc7bf8e752508053b75b (diff)
downloadspark-faaefab26ffea3a5edfeaff42db222c8cd3ff5f1.tar.gz
spark-faaefab26ffea3a5edfeaff42db222c8cd3ff5f1.tar.bz2
spark-faaefab26ffea3a5edfeaff42db222c8cd3ff5f1.zip
[SPARK-15726][SQL] Make DatasetBenchmark fairer among Dataset, DataFrame and RDD
## What changes were proposed in this pull request? DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while running the same operations. However, there are two problems that make the comparisons unfair. 1) In backToBackMap test case, only DataFrame implementation executes less work compared to RDD or Dataset implementations. This test case processes Long+String pairs, but the output from the DataFrame implementation does not include String part while RDD or Dataset generates Long+String pairs as output. This difference significantly changes the performance characteristics due to the String manipulation and creation overheads. 2) In back-to-back map and back-to-back filter test cases, `map` or `filter` operation is executed only once regardless of `numChains` parameter for RDD. Hence the execution times for RDD have been largely underestimated. Of course, these issues do not affect Spark users, but it may confuse Spark developers. ## How was this patch tested? By executing the DatasetBenchmark Author: Hiroshi Inoue <inouehrs@jp.ibm.com> Closes #13459 from inouehrs/fix_benchmark_fairness.
Diffstat (limited to 'sql/catalyst/src/main/scala/org/apache')
0 files changed, 0 insertions, 0 deletions