| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
astype/drop_duplicates.
## What changes were proposed in this pull request?
We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases.
## How was this patch tested?
Existing PySpark unit tests.
Closes #11543.
Author: Reynold Xin <rxin@databricks.com>
Closes #11698 from rxin/SPARK-10380.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR introduces several major changes:
1. Replacing `Expression.prettyString` with `Expression.sql`
The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples:
Expression | `prettyString` | `sql` | Note
------------------ | -------------- | ---------- | ---------------
`a && b` | `a && b` | `a AND b` |
`a.getField("f")` | `a[f]` | `a.f` | `a` is a struct
1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
`NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
Author: Cheng Lian <lian@databricks.com>
Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
|
|
|
|
|
|
|
|
|
|
|
|
| |
"conditions" and "values"
This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.
Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.
Author: Reynold Xin <rxin@databricks.com>
Closes #10734 from rxin/simplify-case.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10559 from rxin/remove-deprecated-sql.
|
|
|
|
|
|
|
|
| |
They should use the existing SQLContext.
Author: Davies Liu <davies@databricks.com>
Closes #9914 from davies/create_udf.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014]
Added functionality: `Column` object in Python now supports exponential operator `**`
Example:
```
from pyspark.sql import *
df = sqlContext.createDataFrame([Row(a=2)])
df.select(3**df.a,df.a**3,df.a**df.a).collect()
```
Outputs:
```
[Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)]
```
Author: 0x0FFF <programmerag@gmail.com>
Closes #8658 from 0x0FFF/SPARK-9014.
|
|
|
|
|
|
|
|
| |
cc mengxr
Author: Davies Liu <davies@databricks.com>
Closes #8657 from davies/move_since.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)
Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```
Author: 0x0FFF <programmerag@gmail.com>
Closes #8574 from 0x0FFF/SPARK-10417.
|
|
|
|
|
|
|
|
|
|
|
|
| |
to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes #8033 from srowen/SPARK-9613.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Inspiration drawn from this blog post: https://lab.getbase.com/pandarize-spark-dataframes/
Author: Reynold Xin <rxin@databricks.com>
Closes #7977 from rxin/isin and squashes the following commits:
9b1d3d6 [Reynold Xin] Added return.
2197d37 [Reynold Xin] Fixed test case.
7c1b6cf [Reynold Xin] Import warnings.
4f4a35d [Reynold Xin] [SPARK-9659][SQL] Rename inSet to isin to match Pandas function.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in booelan expression
It's a common mistake that user will put Column in a boolean expression (together with `and` , `or`), which does not work as expected, we should raise a exception in that case, and suggest user to use `&`, `|` instead.
Author: Davies Liu <davies@databricks.com>
Closes #6961 from davies/column_bool and squashes the following commits:
9f19beb [Davies Liu] update message
af74bd6 [Davies Liu] fix tests
07dff84 [Davies Liu] address comments, fix tests
f70c08e [Davies Liu] raise Exception if column is used in booelan expression
|
|
|
|
|
|
|
|
|
|
|
|
| |
Thanks ogirardot, closes #6580
cc rxin JoshRosen
Author: Davies Liu <davies@databricks.com>
Closes #6590 from davies/when and squashes the following commits:
c0f2069 [Davies Liu] fix Column.when() and otherwise()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
updates
1. ntile should take an integer as parameter.
2. Added Python API (based on #6364)
3. Update documentation of various DataFrame Python functions.
Author: Davies Liu <davies@databricks.com>
Author: Reynold Xin <rxin@databricks.com>
Closes #6374 from rxin/window-final and squashes the following commits:
69004c7 [Reynold Xin] Style fix.
288cea9 [Reynold Xin] Update documentaiton.
7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
66092b4 [Davies Liu] update docs
ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
8936ade [Davies Liu] fix maxint in python 3
2649358 [Davies Liu] update docs
778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
|
|
|
|
|
|
|
|
|
|
| |
Author: kaka1992 <kaka_1992@163.com>
Closes #6313 from kaka1992/astype and squashes the following commits:
73dfd0b [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
ad8feb2 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
4f328b7 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add version info for public Python SQL API.
cc rxin
Author: Davies Liu <davies@databricks.com>
Closes #6295 from davies/versions and squashes the following commits:
cfd91e6 [Davies Liu] add more version for DataFrame API
600834d [Davies Liu] add version to SQL API docs
|
|
dataframe.py is splited into column.py, group.py and dataframe.py:
```
360 column.py
1223 dataframe.py
183 group.py
```
Author: Davies Liu <davies@databricks.com>
Closes #6201 from davies/split_df and squashes the following commits:
fc8f5ab [Davies Liu] split dataframe.py into multiple files
|