| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
PySpark's AFTSurvivalRegression
If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang
Author: Xiangrui Meng <meng@databricks.com>
Closes #9001 from mengxr/SPARK-10957.
|
|
|
|
|
|
|
|
| |
Implement Python API for AFTSurvivalRegression
Author: vectorijk <jiangkai@gmail.com>
Closes #8926 from vectorijk/spark-10688.
|
|
|
|
|
|
|
|
| |
Documentation for dropDuplicates() and drop_duplicates() is one and the same. Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases.
Author: asokadiggs <asoka.diggs@intel.com>
Closes #8930 from asokadiggs/jira-10782.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add method to easily convert a StatCounter instance into a Python dict
https://issues.apache.org/jira/browse/SPARK-6919
Note: This is my original work and the existing Spark license applies.
Author: Erik Shilts <erik.shilts@opower.com>
Closes #5516 from eshilts/statcounter-asdict.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These are CSS/JavaScript changes changes to make navigation in the PySpark API a bit simpler by adding the following to the sidebar:
* Classes
* Functions
* Tags to highlight experimental features
![screen shot 2015-09-02 at 08 50 12](https://cloud.githubusercontent.com/assets/11915197/9634781/301f853a-518b-11e5-8d5c-fda202f6202f.png)
Online example here: https://dl.dropboxusercontent.com/u/20821334/pyspark-api-nav-enhance/pyspark.mllib.html
(The contribution is my original work and that I license the work to the project under the project's open source license)
Author: noelsmith <mail@noelsmith.com>
Closes #8571 from noel-smith/pyspark-api-nav-enhance.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #8830 from ericl/interaction-2.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Python DataFrame.
Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take).
This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion.
Author: Reynold Xin <rxin@databricks.com>
Closes #8876 from rxin/SPARK-10731.
|
|
|
|
|
|
|
|
|
|
|
|
| |
usingColumns
JIRA: https://issues.apache.org/jira/browse/SPARK-10446
Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #8600 from viirya/usingcolumns_df.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-10577
Author: Jian Feng <jzhang.chs@gmail.com>
Closes #8801 from Jianfeng-chs/master.
|
|
|
|
|
|
|
|
|
|
| |
on OS X due to hidden file
Remove ._SUCCESS.crc hidden file that may cause problems in distribution tar archive, and is not used
Author: Sean Owen <sowen@cloudera.com>
Closes #8846 from srowen/SPARK-10716.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
from the issue:
In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e:
![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png)
.. looks like this once newline is added:
![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png)
Author: noelsmith <mail@noelsmith.com>
Closes #8851 from noel-smith/docstring-missing-newline-fix.
|
|
|
|
|
|
|
|
| |
From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
|
|
|
|
|
|
|
|
| |
There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts.
Author: vinodkc <vinod.kc.in@gmail.com>
Closes #8834 from vinodkc/fix_SPARK-10631.
|
|
|
|
|
|
|
|
|
|
| |
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
|
|
|
|
|
|
|
|
| |
As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8814 from yanboliang/spark-10615.
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-10642
When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #8796 from viirya/fix-pyrdd-lookup.
|
|
|
|
|
|
|
|
| |
pyspark.ml.recommendation
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8692 from yu-iskw/SPARK-10282.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8665 from yu-iskw/SPARK-10274.
|
|
|
|
|
|
|
|
| |
pyspark.mllib.util
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8689 from yu-iskw/SPARK-10279.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8685 from yu-iskw/SPARK-10278.
|
|
|
|
|
|
|
|
| |
pyspark.ml.clustering
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8691 from yu-iskw/SPARK-10281.
|
|
|
|
|
|
|
|
| |
pyspark.ml.regression
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8693 from yu-iskw/SPARK-10283.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8694 from yu-iskw/SPARK-10284.
|
|
|
|
|
|
|
|
| |
pyspark.mllib.recommendation
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8677 from yu-iskw/SPARK-10276.
|
|
|
|
|
|
| |
Author: Vinod K C <vinod.kc@huawei.com>
Closes #8682 from vinodkc/fix_SPARK-10516.
|
|
|
|
|
|
|
|
| |
Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.
Author: noelsmith <mail@noelsmith.com>
Closes #8773 from noel-smith/mllib-random-versionadded-fix.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8666 from yu-iskw/SPARK-10275.
|
|
|
|
|
|
|
|
|
|
| |
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).
Author: noelsmith <mail@noelsmith.com>
Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.
|
|
|
|
|
|
|
|
|
|
|
| |
__eq__ and __hash__ correctly
PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector.
Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8166 from yanboliang/spark-9793.
|
|
|
|
|
|
| |
Author: Davies Liu <davies@databricks.com>
Closes #8707 from davies/fix_namedtuple.
|
|
|
|
|
|
|
|
|
|
| |
in Python
[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8457 from yanboliang/spark-10194.
|
|
|
|
|
|
|
|
|
|
|
| |
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
Closes #6297 from JihongMA/SPARK-SQL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014]
Added functionality: `Column` object in Python now supports exponential operator `**`
Example:
```
from pyspark.sql import *
df = sqlContext.createDataFrame([Row(a=2)])
df.select(3**df.a,df.a**3,df.a**df.a).collect()
```
Outputs:
```
[Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)]
```
Author: 0x0FFF <programmerag@gmail.com>
Closes #8658 from 0x0FFF/SPARK-9014.
|
|
|
|
|
|
|
|
| |
Just fixing a typo in exception message, raised when attempting to pickle SparkContext.
Author: Icaro Medeiros <icaro.medeiros@gmail.com>
Closes #8724 from icaromedeiros/master.
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-8530
add python API for MinMaxScaler
jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #7150 from hhbyyh/pythonMinMax.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Changes:
* Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore
CC: holdenk mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8679 from jkbradley/doc-fixes-1.5.
|
|
|
|
|
|
|
|
| |
Add Python API for ```MultilayerPerceptronClassifier```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8067 from yanboliang/SPARK-9773.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PySpark
LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here:
```scala
HasElasticNetParam
HasFitIntercept
HasStandardization
HasThresholds
```
Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8508 from yanboliang/spark-10026.
|
|
|
|
|
|
|
|
|
|
|
| |
Missing method of ml.feature are listed here:
```StringIndexer``` lacks of parameter ```handleInvalid```.
```StringIndexerModel``` lacks of method ```labels```.
```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8313 from yanboliang/spark-10027.
|
|
|
|
|
|
|
|
| |
pyspark.sql.types.Row implements ```__getitem__```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8333 from yanboliang/spark-7544.
|
|
|
|
|
|
|
|
| |
Add Python API for ml.feature.VectorSlicer.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8102 from yanboliang/SPARK-9772.
|
|
|
|
|
|
|
|
| |
Adds IndexToString to PySpark.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
|
|
|
|
|
|
|
|
| |
Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental.
Author: noelsmith <mail@noelsmith.com>
Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
|
|
|
|
|
|
|
|
| |
cc mengxr
Author: Davies Liu <davies@databricks.com>
Closes #8657 from davies/move_since.
|
|
|
|
|
|
|
|
|
|
|
| |
guides and python docs
- Fixed information around Python API tags in streaming programming guides
- Added missing stuff in python docs
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #8595 from tdas/SPARK-10440.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)
Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```
Author: 0x0FFF <programmerag@gmail.com>
Closes #8574 from 0x0FFF/SPARK-10417.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392)
The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement
Issue reproduction on master:
```
>>> from pyspark.sql.types import *
>>> a = DateType()
>>> a.fromInternal(0)
0
>>> a.fromInternal(1)
datetime.date(1970, 1, 2)
```
Author: 0x0FFF <programmerag@gmail.com>
Closes #8556 from 0x0FFF/SPARK-10392.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
function
This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
* Timezone information of this datetime is ignored
* This datetime is assumed to be in local timezone, which depends on the OS timezone setting
Fix includes both code change and regression test. Problem reproduction code on master:
```python
import pytz
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
sqc = SQLContext(sc)
df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))
m1 = pytz.timezone('UTC')
m2 = pytz.timezone('Etc/GMT+3')
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
```
It gives the same timestamp ignoring time zone:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946713600000000)
Scan PhysicalRDD[dt#0]
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946713600000000)
Scan PhysicalRDD[dt#0]
```
After the fix:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946684800000000)
Scan PhysicalRDD[dt#0]
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946695600000000)
Scan PhysicalRDD[dt#0]
```
PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo
Author: 0x0FFF <programmerag@gmail.com>
Closes #8555 from 0x0FFF/SPARK-10162.
|
|
|
|
|
|
|
|
| |
Add a python API for the Stop Words Remover.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
|