aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/sql/dataframe.py
diff options
context:
space:
mode:
authorMichael Armbrust <michael@databricks.com>2016-02-24 19:43:00 -0800
committerYin Huai <yhuai@databricks.com>2016-02-24 19:43:00 -0800
commit2b042577fb077865c3fce69c9d4eda22fde92673 (patch)
tree201ba584e3e4ef81be598e8c4a28a9e4db261e06 /python/pyspark/sql/dataframe.py
parent5a7af9e7ac85e04aa4a420bc2887207bfa18f792 (diff)
downloadspark-2b042577fb077865c3fce69c9d4eda22fde92673.tar.gz
spark-2b042577fb077865c3fce69c9d4eda22fde92673.tar.bz2
spark-2b042577fb077865c3fce69c9d4eda22fde92673.zip
[SPARK-13092][SQL] Add ExpressionSet for constraint tracking
This PR adds a new abstraction called an `ExpressionSet` which attempts to canonicalize expressions to remove cosmetic differences. Deterministic expressions that are in the set after canonicalization will always return the same answer given the same input (i.e. false positives should not be possible). However, it is possible that two canonical expressions that are not equal will in fact return the same answer given any input (i.e. false negatives are possible). ```scala val set = AttributeSet('a + 1 :: 1 + 'a :: Nil) set.iterator => Iterator('a + 1) set.contains('a + 1) => true set.contains(1 + 'a) => true set.contains('a + 2) => false ``` Other relevant changes include: - Since this concept overlaps with the existing `semanticEquals` and `semanticHash`, those functions are also ported to this new infrastructure. - A memoized `canonicalized` version of the expression is added as a `lazy val` to `Expression` and is used by both `semanticEquals` and `ExpressionSet`. - A set of unit tests for `ExpressionSet` are added - Tests which expect `semanticEquals` to be less intelligent than it now is are updated. As a followup, we should consider auditing the places where we do `O(n)` `semanticEquals` operations and replace them with `ExpressionSet`. We should also consider consolidating `AttributeSet` as a specialized factory for an `ExpressionSet.` Author: Michael Armbrust <michael@databricks.com> Closes #11338 from marmbrus/expressionSet.
Diffstat (limited to 'python/pyspark/sql/dataframe.py')
0 files changed, 0 insertions, 0 deletions