| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
| |
usingColumns
JIRA: https://issues.apache.org/jira/browse/SPARK-10446
Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #8600 from viirya/usingcolumns_df.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-10577
Author: Jian Feng <jzhang.chs@gmail.com>
Closes #8801 from Jianfeng-chs/master.
|
|
|
|
|
|
|
|
|
|
| |
on OS X due to hidden file
Remove ._SUCCESS.crc hidden file that may cause problems in distribution tar archive, and is not used
Author: Sean Owen <sowen@cloudera.com>
Closes #8846 from srowen/SPARK-10716.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
from the issue:
In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e:
![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png)
.. looks like this once newline is added:
![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png)
Author: noelsmith <mail@noelsmith.com>
Closes #8851 from noel-smith/docstring-missing-newline-fix.
|
|
|
|
|
|
|
|
| |
From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
|
|
|
|
|
|
|
|
| |
There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts.
Author: vinodkc <vinod.kc.in@gmail.com>
Closes #8834 from vinodkc/fix_SPARK-10631.
|
|
|
|
|
|
|
|
|
|
| |
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
|
|
|
|
|
|
|
|
| |
As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8814 from yanboliang/spark-10615.
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-10642
When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #8796 from viirya/fix-pyrdd-lookup.
|
|
|
|
|
|
|
|
| |
pyspark.ml.recommendation
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8692 from yu-iskw/SPARK-10282.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8665 from yu-iskw/SPARK-10274.
|
|
|
|
|
|
|
|
| |
pyspark.mllib.util
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8689 from yu-iskw/SPARK-10279.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8685 from yu-iskw/SPARK-10278.
|
|
|
|
|
|
|
|
| |
pyspark.ml.clustering
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8691 from yu-iskw/SPARK-10281.
|
|
|
|
|
|
|
|
| |
pyspark.ml.regression
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8693 from yu-iskw/SPARK-10283.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8694 from yu-iskw/SPARK-10284.
|
|
|
|
|
|
|
|
| |
pyspark.mllib.recommendation
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8677 from yu-iskw/SPARK-10276.
|
|
|
|
|
|
| |
Author: Vinod K C <vinod.kc@huawei.com>
Closes #8682 from vinodkc/fix_SPARK-10516.
|
|
|
|
|
|
|
|
| |
Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.
Author: noelsmith <mail@noelsmith.com>
Closes #8773 from noel-smith/mllib-random-versionadded-fix.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8666 from yu-iskw/SPARK-10275.
|
|
|
|
|
|
|
|
|
|
| |
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).
Author: noelsmith <mail@noelsmith.com>
Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.
|
|
|
|
|
|
|
|
|
|
|
| |
__eq__ and __hash__ correctly
PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector.
Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8166 from yanboliang/spark-9793.
|
|
|
|
|
|
| |
Author: Davies Liu <davies@databricks.com>
Closes #8707 from davies/fix_namedtuple.
|
|
|
|
|
|
|
|
|
|
| |
in Python
[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8457 from yanboliang/spark-10194.
|
|
|
|
|
|
|
|
|
|
|
| |
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
Closes #6297 from JihongMA/SPARK-SQL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014]
Added functionality: `Column` object in Python now supports exponential operator `**`
Example:
```
from pyspark.sql import *
df = sqlContext.createDataFrame([Row(a=2)])
df.select(3**df.a,df.a**3,df.a**df.a).collect()
```
Outputs:
```
[Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)]
```
Author: 0x0FFF <programmerag@gmail.com>
Closes #8658 from 0x0FFF/SPARK-9014.
|
|
|
|
|
|
|
|
| |
Just fixing a typo in exception message, raised when attempting to pickle SparkContext.
Author: Icaro Medeiros <icaro.medeiros@gmail.com>
Closes #8724 from icaromedeiros/master.
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-8530
add python API for MinMaxScaler
jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #7150 from hhbyyh/pythonMinMax.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Changes:
* Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore
CC: holdenk mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8679 from jkbradley/doc-fixes-1.5.
|
|
|
|
|
|
|
|
| |
Add Python API for ```MultilayerPerceptronClassifier```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8067 from yanboliang/SPARK-9773.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PySpark
LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here:
```scala
HasElasticNetParam
HasFitIntercept
HasStandardization
HasThresholds
```
Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8508 from yanboliang/spark-10026.
|
|
|
|
|
|
|
|
|
|
|
| |
Missing method of ml.feature are listed here:
```StringIndexer``` lacks of parameter ```handleInvalid```.
```StringIndexerModel``` lacks of method ```labels```.
```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8313 from yanboliang/spark-10027.
|
|
|
|
|
|
|
|
| |
pyspark.sql.types.Row implements ```__getitem__```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8333 from yanboliang/spark-7544.
|
|
|
|
|
|
|
|
| |
Add Python API for ml.feature.VectorSlicer.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8102 from yanboliang/SPARK-9772.
|
|
|
|
|
|
|
|
| |
Adds IndexToString to PySpark.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
|
|
|
|
|
|
|
|
| |
Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental.
Author: noelsmith <mail@noelsmith.com>
Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
|
|
|
|
|
|
|
|
| |
cc mengxr
Author: Davies Liu <davies@databricks.com>
Closes #8657 from davies/move_since.
|
|
|
|
|
|
|
|
|
|
|
| |
guides and python docs
- Fixed information around Python API tags in streaming programming guides
- Added missing stuff in python docs
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes #8595 from tdas/SPARK-10440.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)
Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```
Author: 0x0FFF <programmerag@gmail.com>
Closes #8574 from 0x0FFF/SPARK-10417.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392)
The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement
Issue reproduction on master:
```
>>> from pyspark.sql.types import *
>>> a = DateType()
>>> a.fromInternal(0)
0
>>> a.fromInternal(1)
datetime.date(1970, 1, 2)
```
Author: 0x0FFF <programmerag@gmail.com>
Closes #8556 from 0x0FFF/SPARK-10392.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
function
This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
* Timezone information of this datetime is ignored
* This datetime is assumed to be in local timezone, which depends on the OS timezone setting
Fix includes both code change and regression test. Problem reproduction code on master:
```python
import pytz
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
sqc = SQLContext(sc)
df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))
m1 = pytz.timezone('UTC')
m2 = pytz.timezone('Etc/GMT+3')
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
```
It gives the same timestamp ignoring time zone:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946713600000000)
Scan PhysicalRDD[dt#0]
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946713600000000)
Scan PhysicalRDD[dt#0]
```
After the fix:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946684800000000)
Scan PhysicalRDD[dt#0]
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946695600000000)
Scan PhysicalRDD[dt#0]
```
PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo
Author: 0x0FFF <programmerag@gmail.com>
Closes #8555 from 0x0FFF/SPARK-10162.
|
|
|
|
|
|
|
|
| |
Add a python API for the Stop Words Remover.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
|
|
|
|
|
|
|
|
| |
Add Python API for SQLTransformer
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8527 from yanboliang/spark-10355.
|
|
|
|
|
|
|
|
| |
Add Python API for ml.feature.DCT.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8485 from yanboliang/spark-8472.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Added isLargerBetter() method to Pyspark Evaluator to match the Scala version.
* JavaEvaluator delegates isLargerBetter() to underlying Scala object.
* Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax.
* Added test cases for where smaller is better (RMSE) and larger is better (R-Squared).
(This contribution is my original work and that I license the work to the project under Sparks' open source license)
Author: noelsmith <mail@noelsmith.com>
Closes #8399 from noel-smith/pyspark-rmse-xval-fix.
|
|
|
|
|
|
|
|
|
|
|
| |
for JSON
PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path.
If this PR is merged, it should be duplicated to cover the other input types (not just JSON).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8444 from yanboliang/spark-9964.
|
|
|
|
|
|
|
|
| |
cc jkbradley
Author: Davies Liu <davies@databricks.com>
Closes #8470 from davies/fix_create_df.
|
|
|
|
|
|
|
|
|
|
|
|
| |
to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes #8033 from srowen/SPARK-9613.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build.
I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests.
Author: zsxwing <zsxwing@gmail.com>
Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits:
e0b5818 [zsxwing] Fix the sbt build
c697627 [zsxwing] Add the jar pathes to the exception message
be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
|