| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and other tuning for Word2Vec
add support of arbitrary length sentence by using the nature representation of sentences in the input.
add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience
need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?
jira link: https://issues.apache.org/jira/browse/SPARK-12153
Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>
Closes #10152 from ygcao/improvementForSentenceBoundary.
|
|
|
|
|
|
|
|
|
|
|
|
| |
outside of the doctests
Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
In part this is a follow up to https://github.com/apache/spark/pull/10999
Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.
Author: Holden Karau <holden@us.ibm.com>
Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
|
|
|
|
|
|
|
|
| |
Add Python API for spark.ml bisecting k-means.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10889 from yanboliang/spark-12974.
|
|
|
|
|
|
|
|
|
|
|
|
| |
parameter
Fix this defect by check default value exist or not.
yanboliang Please help to review.
Author: Tommy YU <tummyyu@163.com>
Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.
In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'
However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false
cc holdenk
Author: sethah <seth.hendrickson16@gmail.com>
Closes #10962 from sethah/SPARK-13047.
|
|
|
|
|
|
|
|
| |
PySpark ml.clustering support export/import.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10999 from yanboliang/spark-13035.
|
|
|
|
|
|
|
|
|
| |
Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10975 from yanboliang/clustering-cleanup.
|
|
|
|
|
|
|
|
| |
PySpark ml.recommendation support export/import.
Author: Kai Jiang <jiangkai@gmail.com>
Closes #11044 from vectorijk/spark-13037.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LinearRegression as example
* Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
* Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #10469 from yanboliang/spark-11939.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12780
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10724 from yinxusen/SPARK-12780.
|
|
|
|
|
|
|
|
| |
The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
Author: Holden Karau <holden@us.ibm.com>
Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11923
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10186 from yinxusen/SPARK-11923.
|
|
|
|
|
|
|
|
|
|
|
| |
Add Python API for ml.feature.QuantileDiscretizer.
One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc brkyvz & mengxr
Author: Holden Karau <holden@us.ibm.com>
Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
|
|
|
|
|
|
|
|
|
|
| |
```PCAModel``` can output ```explainedVariance``` at Python side.
cc mengxr srowen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10830 from yanboliang/spark-12905.
|
|
|
|
|
|
|
|
|
| |
This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.
Author: Gábor Lipták <gliptak@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Closes #10850 from mengxr/SPARK-11295.
|
|
|
|
| |
This reverts commit c6f971b4aeca7265ab374fa46c5c452461d9b6a7.
|
|
|
|
|
|
|
|
|
|
| |
prediction column
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10472 from BenFradet/SPARK-9716.
|
|
|
|
|
|
|
|
|
|
| |
SPARK-11295 Add packages to JUnit output for Python tests
This improves grouping/display of test case results.
Author: Gábor Lipták <gliptak@gmail.com>
Closes #9263 from gliptak/SPARK-11295.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark 1.6 QA
Add PySpark missing methods and params for ml.feature:
* ```RegexTokenizer``` should support setting ```toLowercase```.
* ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
* ```PCAModel``` should support output ```pc```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9908 from yanboliang/spark-11925.
|
|
|
|
|
|
|
|
|
|
| |
DecisionTreeRegressor should support setSeed
PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9807 from yanboliang/spark-11815.
|
|
|
|
|
|
|
|
| |
Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9931 from yanboliang/SPARK-11945.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
From JIRA:
Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.
A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.
This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.
Author: Holden Karau <holden@us.ibm.com>
Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
|
|
|
|
|
|
|
|
|
|
| |
No jira is created since this is a trivial change.
davies Please help review it
Author: Jeff Zhang <zjffdu@apache.org>
Closes #10143 from zjffdu/pyspark_typo.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Extend CrossValidator with HasSeed in PySpark.
This PR replaces [https://github.com/apache/spark/pull/7997]
CC: yanboliang thunterdb mmenestret Would one of you mind taking a look? Thanks!
Author: Joseph K. Bradley <joseph@databricks.com>
Author: Martin MENESTRET <mmenestret@ippon.fr>
Closes #10268 from jkbradley/pyspark-cv-seed.
|
|
|
|
|
|
|
|
|
| |
Use ```coefficients``` replace ```weights```, I wish they are the last two.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10065 from yanboliang/coefficients.
|
|
|
|
|
|
|
|
|
| |
* Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint.
* Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9856 from yanboliang/spark-11875.
|
|
|
|
|
|
|
|
| |
[SPARK-7685](https://issues.apache.org/jira/browse/SPARK-7685) and [SPARK-9642](https://issues.apache.org/jira/browse/SPARK-9642) have already supported setting weight column for ```LogisticRegression``` and ```LinearRegression```. It's a very important feature, PySpark should also support. mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9811 from yanboliang/spark-11820.
|
|
|
|
|
|
|
|
| |
pyspark.ml.classification
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8690 from yu-iskw/SPARK-10280.
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes #8314 from squito/SPARK-10116.
|
|
|
|
|
|
|
|
|
|
| |
normal equation solver
Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9485 from yanboliang/spark-11473.
|
|
|
|
|
|
|
|
|
|
| |
coefficients/intercept/scale
PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9492 from yanboliang/spark-11527.
|
|
|
|
|
|
|
|
|
|
| |
in ML models
Deprecated in `LogisticRegression` and `LinearRegression`
Author: vectorijk <jiangkai@gmail.com>
Closes #9311 from vectorijk/spark-10592.
|
|
|
|
|
|
|
|
| |
and pyspark.ml.*
Author: lihao <lihaowhu@gmail.com>
Closes #9275 from lidinghao/SPARK-10286.
|
|
|
|
|
|
|
|
| |
[SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs")
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9328 from yanboliang/spark-11367.
|
|
|
|
|
|
|
|
|
| |
implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API
in pyspark/ml/{classification, regression}.py
Author: vectorijk <jiangkai@gmail.com>
Closes #9233 from vectorijk/spark-10024.
|
|
|
|
|
|
|
|
| |
WIP
Author: Gábor Lipták <gliptak@gmail.com>
Closes #8323 from gliptak/SPARK-7021.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Without an empty line, sphinx will treat doctest as docstring. holdenk
~~~
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])".
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])".
~~~
Author: Xiangrui Meng <meng@databricks.com>
Closes #9188 from mengxr/py-count-vec-doc-fix.
|
|
|
|
|
|
|
|
| |
Namely "." shows up in some places in the template when using the param docstring and not in others
Author: Holden Karau <holden@pigscanfly.ca>
Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.
|
|
|
|
|
|
|
|
| |
Add the Python API for isotonicregression.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.
|
|
|
|
|
|
|
|
|
|
| |
PySpark's AFTSurvivalRegression
If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang
Author: Xiangrui Meng <meng@databricks.com>
Closes #9001 from mengxr/SPARK-10957.
|
|
|
|
|
|
|
|
| |
Implement Python API for AFTSurvivalRegression
Author: vectorijk <jiangkai@gmail.com>
Closes #8926 from vectorijk/spark-10688.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #8830 from ericl/interaction-2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e:
![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png)
.. looks like this once newline is added:
![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png)
Author: noelsmith <mail@noelsmith.com>
Closes #8851 from noel-smith/docstring-missing-newline-fix.
|
|
|
|
|
|
|
|
| |
From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
|
|
|
|
|
|
|
|
| |
As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8814 from yanboliang/spark-10615.
|
|
|
|
|
|
|
|
| |
pyspark.ml.recommendation
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8692 from yu-iskw/SPARK-10282.
|
|
|
|
|
|
|
|
| |
pyspark.ml.clustering
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8691 from yu-iskw/SPARK-10281.
|
|
|
|
|
|
|
|
| |
pyspark.ml.regression
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8693 from yu-iskw/SPARK-10283.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8694 from yu-iskw/SPARK-10284.
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-8530
add python API for MinMaxScaler
jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #7150 from hhbyyh/pythonMinMax.
|