[SPARK-17480][SQL] Improve performance by removing or caching List.length which is O(n) - spark

diff options

author	Ergin Seyfe <eseyfe@fb.com>	2016-09-14 09:51:14 +0100
committer	Sean Owen <sowen@cloudera.com>	2016-09-14 09:51:14 +0100
commit	4cea9da2ae88b40a5503111f8f37051e2372163e (patch)
tree	e040fbb42d09904fe1123fc3af2069c0a8cbfde2 /NOTICE
parent	18b4f035f40359b3164456d0dab52dbc762ea3b4 (diff)
download	spark-4cea9da2ae88b40a5503111f8f37051e2372163e.tar.gz spark-4cea9da2ae88b40a5503111f8f37051e2372163e.tar.bz2 spark-4cea9da2ae88b40a5503111f8f37051e2372163e.zip

[SPARK-17480][SQL] Improve performance by removing or caching List.length which is O(n)

## What changes were proposed in this pull request? Scala's List.length method is O(N) and it makes the gatherCompressibilityStats function O(N^2). Eliminate the List.length calls by writing it in Scala way. https://github.com/scala/scala/blob/2.10.x/src/library/scala/collection/LinearSeqOptimized.scala#L36 As suggested. Extended the fix to HiveInspectors and AggregationIterator classes as well. ## How was this patch tested? Profiled a Spark job and found that CompressibleColumnBuilder is using 39% of the CPU. Out of this 39% CompressibleColumnBuilder->gatherCompressibilityStats is using 23% of it. 6.24% of the CPU is spend on List.length which is called inside gatherCompressibilityStats. After this change we started to save 6.24% of the CPU. Author: Ergin Seyfe <eseyfe@fb.com> Closes #15032 from seyfe/gatherCompressibilityStats.

Diffstat (limited to 'NOTICE')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: