diff options
author | Reynold Xin <rxin@databricks.com> | 2016-03-22 23:43:09 -0700 |
---|---|---|
committer | Reynold Xin <rxin@databricks.com> | 2016-03-22 23:43:09 -0700 |
commit | 926a93e54b83f1ee596096f3301fef015705b627 (patch) | |
tree | 97817dcf1069bcc8f148f996873bef5bb6643126 /python | |
parent | 1a22cf1e9b6447005c9a329856d734d80a496a06 (diff) | |
download | spark-926a93e54b83f1ee596096f3301fef015705b627.tar.gz spark-926a93e54b83f1ee596096f3301fef015705b627.tar.bz2 spark-926a93e54b83f1ee596096f3301fef015705b627.zip |
[SPARK-14088][SQL] Some Dataset API touch-up
## What changes were proposed in this pull request?
1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
4. Remove "subtract" function since it is just an alias for "except".
## How was this patch tested?
All changes should be covered by existing tests. Also added couple test cases to cover "name".
Author: Reynold Xin <rxin@databricks.com>
Closes #11908 from rxin/SPARK-14088.
Diffstat (limited to 'python')
-rw-r--r-- | python/pyspark/sql/column.py | 2 | ||||
-rw-r--r-- | python/pyspark/sql/dataframe.py | 14 |
2 files changed, 14 insertions, 2 deletions
diff --git a/python/pyspark/sql/column.py b/python/pyspark/sql/column.py index 19ec6fcc5d..43e9baece2 100644 --- a/python/pyspark/sql/column.py +++ b/python/pyspark/sql/column.py @@ -315,6 +315,8 @@ class Column(object): sc = SparkContext._active_spark_context return Column(getattr(self._jc, "as")(_to_seq(sc, list(alias)))) + name = copy_func(alias, sinceversion=2.0, doc=":func:`name` is an alias for :func:`alias`.") + @ignore_unicode_prefix @since(1.3) def cast(self, dataType): diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py index 7e1854c43b..5cfc348a69 100644 --- a/python/pyspark/sql/dataframe.py +++ b/python/pyspark/sql/dataframe.py @@ -911,14 +911,24 @@ class DataFrame(object): """ return self.groupBy().agg(*exprs) + @since(2.0) + def union(self, other): + """ Return a new :class:`DataFrame` containing union of rows in this + frame and another frame. + + This is equivalent to `UNION ALL` in SQL. To do a SQL-style set union + (that does deduplication of elements), use this function followed by a distinct. + """ + return DataFrame(self._jdf.unionAll(other._jdf), self.sql_ctx) + @since(1.3) def unionAll(self, other): """ Return a new :class:`DataFrame` containing union of rows in this frame and another frame. - This is equivalent to `UNION ALL` in SQL. + .. note:: Deprecated in 2.0, use union instead. """ - return DataFrame(self._jdf.unionAll(other._jdf), self.sql_ctx) + return self.union(other) @since(1.3) def intersect(self, other): |