[SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equi-Join

After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code. For example, users can do the Equi-Join like ```df.join(df2, 'name', 'outer').select('name', 'height').collect()``` - There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`). - After a PR: https://github.com/apache/spark/pull/8600, the 1.6 does not have such an issue, but the description has not been updated. Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join. Author: gatorsmile <gatorsmile@gmail.com> Closes #10477 from gatorsmile/pyOuterJoin.
author: gatorsmile <gatorsmile@gmail.com> 2015-12-27 23:18:48 -0800
committer: Davies Liu <davies.liu@gmail.com> 2015-12-27 23:18:48 -0800
commit: 9ab296ecdceef88ebca523ed62848fbeb5df353b (patch)
tree: c6f011243273ebebc2246cff2851bedc6bb9c469 /python/pyspark/sql/dataframe.py
parent: 1e97813951674aa5419744b455a4c7340462ac59 (diff)
download: spark-9ab296ecdceef88ebca523ed62848fbeb5df353b.tar.gz
spark-9ab296ecdceef88ebca523ed62848fbeb5df353b.tar.bz2
spark-9ab296ecdceef88ebca523ed62848fbeb5df353b.zip
1 files changed, 4 insertions, 1 deletions
diff --git a/python/pyspark/sql/dataframe.py b/python/pyspark/sql/dataframe.py
index 4b3791e1b8..ad621df910 100644
--- a/python/pyspark/sql/dataframe.py
+++ b/python/pyspark/sql/dataframe.py
@@ -608,13 +608,16 @@ class DataFrame(object):
         :param on: a string for join column name, a list of column names,
             , a join expression (Column) or a list of Columns.
             If `on` is a string or a list of string indicating the name of the join column(s),
-            the column(s) must exist on both sides, and this performs an inner equi-join.
+            the column(s) must exist on both sides, and this performs an equi-join.
         :param how: str, default 'inner'.
             One of `inner`, `outer`, `left_outer`, `right_outer`, `leftsemi`.
 
         >>> df.join(df2, df.name == df2.name, 'outer').select(df.name, df2.height).collect()
         [Row(name=None, height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]
 
+        >>> df.join(df2, 'name', 'outer').select('name', 'height').collect()
+        [Row(name=u'Tom', height=80), Row(name=u'Alice', height=None), Row(name=u'Bob', height=85)]
+
         >>> cond = [df.name == df3.name, df.age == df3.age]
         >>> df.join(df3, cond, 'outer').select(df.name, df3.age).collect()
         [Row(name=u'Bob', age=5), Row(name=u'Alice', age=2)]
author	gatorsmile <gatorsmile@gmail.com>	2015-12-27 23:18:48 -0800
committer	Davies Liu <davies.liu@gmail.com>	2015-12-27 23:18:48 -0800
commit	9ab296ecdceef88ebca523ed62848fbeb5df353b (patch)
tree	c6f011243273ebebc2246cff2851bedc6bb9c469 /python/pyspark/sql/dataframe.py
parent	1e97813951674aa5419744b455a4c7340462ac59 (diff)
download	spark-9ab296ecdceef88ebca523ed62848fbeb5df353b.tar.gz spark-9ab296ecdceef88ebca523ed62848fbeb5df353b.tar.bz2 spark-9ab296ecdceef88ebca523ed62848fbeb5df353b.zip