diff options
author | Yanbo Liang <ybliang8@gmail.com> | 2015-10-27 11:28:59 +0100 |
---|---|---|
committer | Michael Armbrust <michael@databricks.com> | 2015-10-27 11:28:59 +0100 |
commit | 360ed832f5213b805ac28cf1d2828be09480f2d6 (patch) | |
tree | 6375913415e936a7a847522fa03a27d8be94f312 /sql/core | |
parent | 958a0ec8fa58ff091f595db2b574a7aa3ff41253 (diff) | |
download | spark-360ed832f5213b805ac28cf1d2828be09480f2d6.tar.gz spark-360ed832f5213b805ac28cf1d2828be09480f2d6.tar.bz2 spark-360ed832f5213b805ac28cf1d2828be09480f2d6.zip |
[SPARK-11303][SQL] filter should not be pushed down into sample
When sampling and then filtering DataFrame, the SQL Optimizer will push down filter into sample and produce wrong result. This is due to the sampler is calculated based on the original scope rather than the scope after filtering.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9294 from yanboliang/spark-11303.
Diffstat (limited to 'sql/core')
-rw-r--r-- | sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala | 10 |
1 files changed, 10 insertions, 0 deletions
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala index 298c322906..f5ae3ae49b 100644 --- a/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala +++ b/sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala @@ -1860,4 +1860,14 @@ class SQLQuerySuite extends QueryTest with SharedSQLContext { Row(1)) } } + + test("SPARK-11303: filter should not be pushed down into sample") { + val df = sqlContext.range(100) + List(true, false).foreach { withReplacement => + val sampled = df.sample(withReplacement, 0.1, 1) + val sampledOdd = sampled.filter("id % 2 != 0") + val sampledEven = sampled.filter("id % 2 = 0") + assert(sampled.count() == sampledOdd.count() + sampledEven.count()) + } + } } |