[SPARK-15764][SQL] Replace N^2 loop in BindReferences - spark

diff options

author	Josh Rosen <joshrosen@databricks.com>	2016-06-06 11:44:51 -0700
committer	Josh Rosen <joshrosen@databricks.com>	2016-06-06 11:44:51 -0700
commit	0b8d694999b43ada4833388cad6c285c7757cbf7 (patch)
tree	136fd92aeee3b91daf1c0c4423b0bcba11c6659b /core
parent	4c74ee8d8e1c3139d3d322ae68977f2ab53295df (diff)
download	spark-0b8d694999b43ada4833388cad6c285c7757cbf7.tar.gz spark-0b8d694999b43ada4833388cad6c285c7757cbf7.tar.bz2 spark-0b8d694999b43ada4833388cad6c285c7757cbf7.zip

[SPARK-15764][SQL] Replace N^2 loop in BindReferences

BindReferences contains a n^2 loop which causes performance issues when operating over large schemas: to determine the ordinal of an attribute reference, we perform a linear scan over the `input` array. Because input can sometimes be a `List`, the call to `input(ordinal).nullable` can also be O(n). Instead of performing a linear scan, we can convert the input into an array and build a hash map to map from expression ids to ordinals. The greater up-front cost of the map construction is offset by the fact that an expression can contain multiple attribute references, so the cost of the map construction is amortized across a number of lookups. Perf. benchmarks to follow. /cc ericl Author: Josh Rosen <joshrosen@databricks.com> Closes #13505 from JoshRosen/bind-references-improvement.

Diffstat (limited to 'core')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: