diff options
author | Davies Liu <davies@databricks.com> | 2017-01-20 16:11:40 -0800 |
---|---|---|
committer | Herman van Hovell <hvanhovell@databricks.com> | 2017-01-20 16:11:40 -0800 |
commit | 9b7a03f15ac45e5f7dcf118d1e7ce1556339aa46 (patch) | |
tree | 67c03bb4a69f9631e845156ca6eaef25746bb02d /python | |
parent | 552e5f08841828e55f5924f1686825626da8bcd0 (diff) | |
download | spark-9b7a03f15ac45e5f7dcf118d1e7ce1556339aa46.tar.gz spark-9b7a03f15ac45e5f7dcf118d1e7ce1556339aa46.tar.bz2 spark-9b7a03f15ac45e5f7dcf118d1e7ce1556339aa46.zip |
[SPARK-18589][SQL] Fix Python UDF accessing attributes from both side of join
## What changes were proposed in this pull request?
PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.
This PR fix this issue by checking the expression is evaluable or not before pushing it into Join.
## How was this patch tested?
Add a regression test.
Author: Davies Liu <davies@databricks.com>
Closes #16581 from davies/pyudf_join.
Diffstat (limited to 'python')
-rw-r--r-- | python/pyspark/sql/tests.py | 9 |
1 files changed, 9 insertions, 0 deletions
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 73a5df65e0..4bfe6e9eb3 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -342,6 +342,15 @@ class SQLTests(ReusedPySparkTestCase): df = df.withColumn('b', udf(lambda x: 'x')(df.a)) self.assertEqual(df.filter('b = "x"').collect(), [Row(a=1, b='x')]) + def test_udf_in_filter_on_top_of_join(self): + # regression test for SPARK-18589 + from pyspark.sql.functions import udf + left = self.spark.createDataFrame([Row(a=1)]) + right = self.spark.createDataFrame([Row(b=1)]) + f = udf(lambda a, b: a == b, BooleanType()) + df = left.crossJoin(right).filter(f("a", "b")) + self.assertEqual(df.collect(), [Row(a=1, b=1)]) + def test_udf_without_arguments(self): self.spark.catalog.registerFunction("foo", lambda: "bar") [row] = self.spark.sql("SELECT foo()").collect() |