| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
[SPARK-7685](https://issues.apache.org/jira/browse/SPARK-7685) and [SPARK-9642](https://issues.apache.org/jira/browse/SPARK-9642) have already supported setting weight column for ```LogisticRegression``` and ```LinearRegression```. It's a very important feature, PySpark should also support. mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9811 from yanboliang/spark-11820.
|
|
|
|
|
|
|
|
| |
return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null.
Author: JihongMa <linlin200605@gmail.com>
Closes #9705 from JihongMA/SPARK-11720.
|
|
|
|
|
|
|
|
| |
…ion in PySpark
Author: Jeff Zhang <zjffdu@apache.org>
Closes #9791 from zjffdu/SPARK-11804.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fixed the merge conflicts in #7410
Closes #7410
Author: Shixiong Zhu <shixiong@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Author: jerryshao <sshao@hortonworks.com>
Closes #9742 from zsxwing/pr7410.
|
|
|
|
|
|
|
|
| |
We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #9707 from zsxwing/fix-checkpoint.
|
|
|
|
|
|
| |
Author: Daniel Jalova <djalova@us.ibm.com>
Closes #9186 from djalova/SPARK-6328.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds the following options to the JSON data source, for dealing with non-standard JSON files:
* `allowComments` (default `false`): ignores Java/C++ style comment in JSON records
* `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names
* `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012)
To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options.
Also updated documentation to explain these options.
Scala
![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png)
Python
![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png)
Author: Reynold Xin <rxin@databricks.com>
Closes #9724 from rxin/SPARK-11745.
|
|
|
|
|
|
|
|
| |
This PR adds pivot to the python api of GroupedData with the same syntax as Scala/Java.
Author: Andrew Ray <ray.andrew@gmail.com>
Closes #9653 from aray/sql-pivot-python.
|
|
|
|
|
|
|
|
|
|
| |
report failures
This PR just checks the test results and returns 1 if the test fails, so that `run-tests.py` can mark it fail.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #9669 from zsxwing/streaming-python-tests.
|
|
|
|
|
|
| |
Author: Chris Snow <chsnow123@gmail.com>
Closes #9640 from snowch/patch-3.
|
|
|
|
|
|
|
|
| |
Example for sqlContext.createDataDrame from pandas.DataFrame has a typo
Author: Chris Snow <chsnow123@gmail.com>
Closes #9639 from snowch/patch-2.
|
|
|
|
|
|
|
|
| |
switched stddev support from DeclarativeAggregate to ImperativeAggregate.
Author: JihongMa <linlin200605@gmail.com>
Closes #9380 from JihongMA/SPARK-11420.
|
|
|
|
|
|
|
|
| |
Only install signal in main thread, or it will fail to create context in not-main thread.
Author: Davies Liu <davies@databricks.com>
Closes #9574 from davies/python_signal.
|
|
|
|
|
|
|
|
|
|
| |
Python
cc jkbradley
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #9534 from yu-iskw/SPARK-11566.
|
|
|
|
|
|
|
|
|
|
| |
like `df.agg(corr("col1", "col2")`
davies
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9536 from felixcheung/pyfunc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
evaluate AggregateExpression1s
https://issues.apache.org/jira/browse/SPARK-9830
This PR contains the following main changes.
* Removing `AggregateExpression1`.
* Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
* Removing planner rule used to plan `Aggregate`.
* Linking `MultipleDistinctRewriter` to analyzer.
* Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
* Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
* Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).
Author: Yin Huai <yhuai@databricks.com>
Closes #9556 from yhuai/removeAgg1.
|
|
|
|
|
|
|
|
|
|
| |
in Python more specific
cc jkbradley
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #9577 from yu-iskw/SPARK-11610.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For now they are thin wrappers around the corresponding Hive UDAFs.
One limitation with these in Hive 0.13.0 is they only support aggregating primitive types.
I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns.
Do we also want to add these to `functions.py`?
This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089
marmbrus rxin
Author: Nick Buroojy <nick.buroojy@civitaslearning.com>
Closes #9526 from nburoojy/nick/udaf-alias.
(cherry picked from commit a6ee4f989d020420dd08b97abb24802200ff23b2)
Signed-off-by: Michael Armbrust <michael@databricks.com>
|
|
|
|
|
|
|
|
| |
pyspark.ml.classification
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8690 from yu-iskw/SPARK-10280.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Could jkbradley and davies review it?
- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.
[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8643 from yu-iskw/SPARK-8467-2.
|
|
|
|
|
|
|
|
| |
#9527 missed updating the python tests.
Author: Michael Armbrust <michael@databricks.com>
Closes #9533 from marmbrus/hotfixTextValue.
|
|
|
|
|
|
|
|
| |
…ithinPartitions.
Author: Nong Li <nong@databricks.com>
Closes #9504 from nongli/spark-11410.
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes #8314 from squito/SPARK-10116.
|
|
|
|
|
|
|
|
|
|
| |
normal equation solver
Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9485 from yanboliang/spark-11473.
|
|
|
|
|
|
|
|
|
|
| |
coefficients/intercept/scale
PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9492 from yanboliang/spark-11527.
|
|
|
|
|
|
|
|
|
|
|
|
| |
return properly
This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`.
tdas zsxwing
Author: Nick Evans <me@nicolasevans.org>
Closes #9336 from manygrams/fix_await_termination_or_timeout.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #9469 from yu-iskw/SPARK-10028.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions.
That is to say, after this change, we won't support
```scala
df.groupBy("key").kurtosis("colA", "colB")
```
However, we will still support
```scala
df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB")))
```
Author: Reynold Xin <rxin@databricks.com>
Closes #9446 from rxin/SPARK-11489.
|
|
|
|
|
|
|
|
| |
Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis
Author: Davies Liu <davies@databricks.com>
Closes #9424 from davies/py_var.
|
|
|
|
|
|
|
|
|
|
| |
in ML models
Deprecated in `LogisticRegression` and `LinearRegression`
Author: vectorijk <jiangkai@gmail.com>
Closes #9311 from vectorijk/spark-10592.
|
|
|
|
|
|
|
|
| |
and pyspark.ml.*
Author: lihao <lihaowhu@gmail.com>
Closes #9275 from lidinghao/SPARK-10286.
|
|
|
|
|
|
|
|
|
|
| |
This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation.
cc: srowen
Author: Xiangrui Meng <meng@databricks.com>
Closes #9322 from mengxr/SPARK-11358.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
provided schema
When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided.
Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect.
marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606
Author: Jason White <jason.white@shopify.com>
Closes #9392 from JasonMWhite/createDataFrame_without_take.
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-11322
As reported by JoshRosen in [databricks/spark-redshift/issues/89](https://github.com/databricks/spark-redshift/issues/89#issuecomment-149828308), the exception-masking behavior sometimes makes debugging harder. To deal with this issue, we should keep full stack trace in the captured exception.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #9283 from viirya/py-exception-stacktrace.
|
|
|
|
|
|
|
|
| |
Adds DataFrameReader.text and DataFrameWriter.text.
Author: Reynold Xin <rxin@databricks.com>
Closes #9259 from rxin/SPARK-11292.
|
|
|
|
|
|
|
|
| |
[SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs")
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9328 from yanboliang/spark-11367.
|
|
|
|
|
|
|
|
|
|
|
|
| |
returns incorrect answer in some cases
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.
Supersedes https://github.com/apache/spark/pull/9293
Author: Sean Owen <sowen@cloudera.com>
Closes #9309 from srowen/SPARK-11302.2.
|
|
|
|
|
|
|
|
|
| |
implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API
in pyspark/ml/{classification, regression}.py
Author: vectorijk <jiangkai@gmail.com>
Closes #9233 from vectorijk/spark-10024.
|
|
|
|
|
|
|
|
|
|
| |
BlockMatrix
This PR adds addition and multiplication to PySpark's `BlockMatrix` class via `add` and `multiply` functions.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes #9139 from dusenberrymw/SPARK-6488_Add_Addition_and_Multiplication_to_PySpark_BlockMatrix.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
from the Kafka Streaming API
jerryshao tdas
I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise.
Instead of doing something like:
```
assert topic_and_partition_instance._topic == "foo"
assert topic_and_partition_instance._partition == 0
```
You can do something like:
```
assert topic_and_partition_instance == TopicAndPartition("foo", 0)
```
Before:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
False
```
After:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
True
```
I couldn't find any tests - am I missing something?
Author: Nick Evans <me@nicolasevans.org>
Closes #9236 from manygrams/topic_and_partition_equality.
|
|
|
|
|
|
|
|
|
|
| |
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).
Author: noelsmith <mail@noelsmith.com>
Closes #8627 from noel-smith/SPARK-10271-since-mllib-clustering.
|
|
|
|
|
|
| |
Author: Jeff Zhang <zjffdu@apache.org>
Closes #9248 from zjffdu/SPARK-11279.
|
|
|
|
|
|
|
|
| |
pyspark.mllib.regression
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8684 from yu-iskw/SPARK-10277.
|
|
|
|
|
|
|
|
| |
WIP
Author: Gábor Lipták <gliptak@gmail.com>
Closes #8323 from gliptak/SPARK-7021.
|
|
|
|
|
|
|
|
|
|
| |
…rint in python
No test needed. Verify it manually in pyspark shell
Author: Jeff Zhang <zjffdu@apache.org>
Closes #9177 from zjffdu/SPARK-11205.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Without an empty line, sphinx will treat doctest as docstring. holdenk
~~~
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])".
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])".
~~~
Author: Xiangrui Meng <meng@databricks.com>
Closes #9188 from mengxr/py-count-vec-doc-fix.
|
|
|
|
|
|
|
|
| |
Namely "." shows up in some places in the template when using the param docstring and not in others
Author: Holden Karau <holden@pigscanfly.ca>
Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
pyspark.mllib.classification
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes derived from the file history.
Note - some methods are inherited from the regression module (i.e. LinearModel.intercept) so these won't have version numbers in the API docs until that model is updated.
Author: noelsmith <mail@noelsmith.com>
Closes #8626 from noel-smith/SPARK-10269-since-mlib-classification.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to public methods + "versionadded::" to classes (derived from the git file history in pyspark).
Note - I added also the tags to MultilabelMetrics even though it isn't declared as public in the __all__ statement... if that's incorrect - I'll remove.
Author: noelsmith <mail@noelsmith.com>
Closes #8628 from noel-smith/SPARK-10272-since-mllib-evalutation.
|
|
|
|
|
|
|
|
|
| |
Upgrade to Py4j0.9
Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>
Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
|