[SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition. - spark

diff options

author	蒋星博 <jiangxingbo@meituan.com>	2016-07-14 00:21:27 +0800
committer	Cheng Lian <lian@databricks.com>	2016-07-14 00:21:27 +0800
commit	f376c37268848dbb4b2fb57677e22ef2bf207b49 (patch)
tree	bc4fc046291880943c4d2c5ad37625a7548baa84 /python/pyspark/mllib/fpm.py
parent	ea06e4ef34c860219a9aeec81816ef53ada96253 (diff)
download	spark-f376c37268848dbb4b2fb57677e22ef2bf207b49.tar.gz spark-f376c37268848dbb4b2fb57677e22ef2bf207b49.tar.bz2 spark-f376c37268848dbb4b2fb57677e22ef2bf207b49.zip

[SPARK-16343][SQL] Improve the PushDownPredicate rule to pushdown predicates correctly in non-deterministic condition.

## What changes were proposed in this pull request? Currently our Optimizer may reorder the predicates to run them more efficient, but in non-deterministic condition, change the order between deterministic parts and non-deterministic parts may change the number of input rows. For example: ```SELECT a FROM t WHERE rand() < 0.1 AND a = 1``` And ```SELECT a FROM t WHERE a = 1 AND rand() < 0.1``` may call rand() for different times and therefore the output rows differ. This PR improved this condition by checking whether the predicate is placed before any non-deterministic predicates. ## How was this patch tested? Expanded related testcases in FilterPushdownSuite. Author: 蒋星博 <jiangxingbo@meituan.com> Closes #14012 from jiangxb1987/ppd.

Diffstat (limited to 'python/pyspark/mllib/fpm.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: