diff options
author | Reynold Xin <rxin@databricks.com> | 2015-02-02 19:01:47 -0800 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2015-02-02 19:01:47 -0800 |
commit | 554403fd913685da879cf6a280c58a9fad19448a (patch) | |
tree | b3a63382e7385fa1480b54707b348b0bde02190d /python/pyspark/tests.py | |
parent | eccb9fbb2d1bf6f7c65fb4f017e9205bb3034ec6 (diff) | |
download | spark-554403fd913685da879cf6a280c58a9fad19448a.tar.gz spark-554403fd913685da879cf6a280c58a9fad19448a.tar.bz2 spark-554403fd913685da879cf6a280c58a9fad19448a.zip |
[SQL] Improve DataFrame API error reporting
1. Throw UnsupportedOperationException if a Column is not computable.
2. Perform eager analysis on DataFrame so we can catch errors when they happen (not when an action is run).
Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes #4296 from rxin/col-computability and squashes the following commits:
6527b86 [Reynold Xin] Merge pull request #8 from davies/col-computability
fd92bc7 [Reynold Xin] Merge branch 'master' into col-computability
f79034c [Davies Liu] fix python tests
5afe1ff [Reynold Xin] Fix scala test.
17f6bae [Reynold Xin] Various fixes.
b932e86 [Reynold Xin] Added eager analysis for error reporting.
e6f00b8 [Reynold Xin] [SQL][API] ComputableColumn vs IncomputableColumn
Diffstat (limited to 'python/pyspark/tests.py')
-rw-r--r-- | python/pyspark/tests.py | 6 |
1 files changed, 4 insertions, 2 deletions
diff --git a/python/pyspark/tests.py b/python/pyspark/tests.py index bec1961f26..fef6c92875 100644 --- a/python/pyspark/tests.py +++ b/python/pyspark/tests.py @@ -1029,9 +1029,11 @@ class SQLTests(ReusedPySparkTestCase): g = df.groupBy() self.assertEqual([99, 100], sorted(g.agg({'key': 'max', 'value': 'count'}).collect()[0])) self.assertEqual([Row(**{"AVG(key#0)": 49.5})], g.mean().collect()) - # TODO(davies): fix aggregators + from pyspark.sql import Aggregator as Agg - # self.assertEqual((0, '100'), tuple(g.agg(Agg.first(df.key), Agg.last(df.value)).first())) + self.assertEqual((0, u'99'), tuple(g.agg(Agg.first(df.key), Agg.last(df.value)).first())) + self.assertTrue(95 < g.agg(Agg.approxCountDistinct(df.key)).first()[0]) + self.assertEqual(100, g.agg(Agg.countDistinct(df.value)).first()[0]) def test_help_command(self): # Regression test for SPARK-5464 |