aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/ml
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence ↵Yong Gang Cao2016-02-221-6/+6
| | | | | | | | | | | | | | | | | | and other tuning for Word2Vec add support of arbitrary length sentence by using the nature representation of sentences in the input. add new similarity functions and add normalization option for distances in synonym finding add new accessor for internal structure(the vocabulary and wordindex) for convenience need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3? jira link: https://issues.apache.org/jira/browse/SPARK-12153 Author: Yong Gang Cao <ygcao@amazon.com> Author: Yong-Gang Cao <ygcao@users.noreply.github.com> Closes #10152 from ygcao/improvementForSentenceBoundary.
* [SPARK-13302][PYSPARK][TESTS] Move the temp file creation and cleanup ↵Holden Karau2016-02-203-33/+42
| | | | | | | | | | | | outside of the doctests Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking. In part this is a follow up to https://github.com/apache/spark/pull/10999 Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time. Author: Holden Karau <holden@us.ibm.com> Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
* [SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-meansYanbo Liang2016-02-121-1/+124
| | | | | | | | Add Python API for spark.ml bisecting k-means. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10889 from yanboliang/spark-12974.
* [SPARK-13153][PYSPARK] ML persistence failed when handle no default value ↵Tommy YU2016-02-111-2/+3
| | | | | | | | | | | | parameter Fix this defect by check default value exist or not. yanboliang Please help to review. Author: Tommy YU <tummyyu@163.com> Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
* [SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw an errorsethah2016-02-112-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality. In Python: ```python from pyspark.ml.classification import NaiveBayes nb = NaiveBayes() print nb.hasParam("smoothing") print nb.hasParam("notAParam") ``` produces: > True > AttributeError: 'NaiveBayes' object has no attribute 'notAParam' However, in Scala: ```scala import org.apache.spark.ml.classification.NaiveBayes val nb = new NaiveBayes() nb.hasParam("smoothing") nb.hasParam("notAParam") ``` produces: > true > false cc holdenk Author: sethah <seth.hendrickson16@gmail.com> Closes #10962 from sethah/SPARK-13047.
* [SPARK-13035][ML][PYSPARK] PySpark ml.clustering support export/importYanbo Liang2016-02-111-4/+25
| | | | | | | | PySpark ml.clustering support export/import. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10999 from yanboliang/spark-13035.
* [MINOR][ML][PYSPARK] Cleanup test cases of clustering.pyYanbo Liang2016-02-112-15/+9
| | | | | | | | | Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode). cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10975 from yanboliang/clustering-cleanup.
* [SPARK-13037][ML][PYSPARK] PySpark ml.recommendation support export/importKai Jiang2016-02-111-4/+27
| | | | | | | | PySpark ml.recommendation support export/import. Author: Kai Jiang <jiangkai@gmail.com> Closes #11044 from vectorijk/spark-13037.
* [SPARK-13032][ML][PYSPARK] PySpark support model export/import and take ↵Yanbo Liang2016-01-295-29/+236
| | | | | | | | | | | | | | LinearRegression as example * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark. * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #10469 from yanboliang/spark-11939.
* [SPARK-12780] Inconsistency returning value of ML python models' propertiesXusen Yin2016-01-261-2/+3
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes #10724 from yinxusen/SPARK-12780.
* [SPARK-10509][PYSPARK] Reduce excessive param boiler plate codeHolden Karau2016-01-2612-317/+43
| | | | | | | | The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <holden@us.ibm.com> Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
* [SPARK-11923][ML] Python API for ml.feature.ChiSqSelectorXusen Yin2016-01-261-1/+97
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11923 Author: Xusen Yin <yinxusen@gmail.com> Closes #10186 from yinxusen/SPARK-11923.
* [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizerHolden Karau2016-01-251-4/+85
| | | | | | | | | | | Add Python API for ml.feature.QuantileDiscretizer. One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model. cc brkyvz & mengxr Author: Holden Karau <holden@us.ibm.com> Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
* [SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySparkYanbo Liang2016-01-251-0/+11
| | | | | | | | | | ```PCAModel``` can output ```explainedVariance``` at Python side. cc mengxr srowen Author: Yanbo Liang <ybliang8@gmail.com> Closes #10830 from yanboliang/spark-12905.
* [SPARK-11295][PYSPARK] Add packages to JUnit output for Python testsGábor Lipták2016-01-201-0/+1
| | | | | | | | | This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test. Author: Gábor Lipták <gliptak@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10850 from mengxr/SPARK-11295.
* Revert "[SPARK-11295] Add packages to JUnit output for Python tests"Xiangrui Meng2016-01-191-1/+0
| | | | This reverts commit c6f971b4aeca7265ab374fa46c5c452461d9b6a7.
* [SPARK-9716][ML] BinaryClassificationEvaluator should accept Double ↵BenFradet2016-01-191-2/+3
| | | | | | | | | | prediction column This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10472 from BenFradet/SPARK-9716.
* [SPARK-11295] Add packages to JUnit output for Python testsGábor Lipták2016-01-191-0/+1
| | | | | | | | | | SPARK-11295 Add packages to JUnit output for Python tests This improves grouping/display of test case results. Author: Gábor Lipták <gliptak@gmail.com> Closes #9263 from gliptak/SPARK-11295.
* [SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during ↵Yanbo Liang2016-01-151-10/+62
| | | | | | | | | | | | | Spark 1.6 QA Add PySpark missing methods and params for ml.feature: * ```RegexTokenizer``` should support setting ```toLowercase```. * ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```. * ```PCAModel``` should support output ```pc```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9908 from yanboliang/spark-11925.
* [SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & ↵Yanbo Liang2016-01-062-10/+17
| | | | | | | | | | DecisionTreeRegressor should support setSeed PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9807 from yanboliang/spark-11815.
* [SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.mlYanbo Liang2016-01-061-0/+10
| | | | | | | | Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9931 from yanboliang/SPARK-11945.
* [SPARK-7675][ML][PYSPARK] sparkml params type conversionHolden Karau2016-01-064-103/+148
| | | | | | | | | | | | | From JIRA: Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method. A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available. This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future. Author: Holden Karau <holden@us.ibm.com> Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
* [PYSPARK] Pyspark typo & Add missing abstractmethod annotationJeff Zhang2015-12-212-2/+3
| | | | | | | | | | No jira is created since this is a trivial change. davies Please help review it Author: Jeff Zhang <zjffdu@apache.org> Closes #10143 from zjffdu/pyspark_typo.
* [SPARK-9690][ML][PYTHON] pyspark CrossValidator random seedMartin Menestret2015-12-161-7/+13
| | | | | | | | | | | | | Extend CrossValidator with HasSeed in PySpark. This PR replaces [https://github.com/apache/spark/pull/7997] CC: yanboliang thunterdb mmenestret Would one of you mind taking a look? Thanks! Author: Joseph K. Bradley <joseph@databricks.com> Author: Martin MENESTRET <mmenestret@ippon.fr> Closes #10268 from jkbradley/pyspark-cv-seed.
* [MINOR][ML] Use coefficients replace weightsYanbo Liang2015-12-032-2/+2
| | | | | | | | | Use ```coefficients``` replace ```weights```, I wish they are the last two. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #10065 from yanboliang/coefficients.
* [SPARK-11875][ML][PYSPARK] Update doc for PySpark HasCheckpointIntervalYanbo Liang2015-11-192-9/+11
| | | | | | | | | * Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint. * Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9856 from yanboliang/spark-11875.
* [SPARK-11820][ML][PYSPARK] PySpark LiR & LoR should support weightColYanbo Liang2015-11-182-16/+17
| | | | | | | | [SPARK-7685](https://issues.apache.org/jira/browse/SPARK-7685) and [SPARK-9642](https://issues.apache.org/jira/browse/SPARK-9642) have already supported setting weight column for ```LogisticRegression``` and ```LinearRegression```. It's a very important feature, PySpark should also support. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9811 from yanboliang/spark-11820.
* [SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-11-091-0/+56
| | | | | | | | pyspark.ml.classification Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8690 from yu-iskw/SPARK-10280.
* [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bitsImran Rashid2015-11-062-13/+13
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.
* [SPARK-11473][ML] R-like summary statistics with intercept for OLS via ↵Yanbo Liang2015-11-051-8/+8
| | | | | | | | | | normal equation solver Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9485 from yanboliang/spark-11473.
* [SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose ↵Yanbo Liang2015-11-051-0/+24
| | | | | | | | | | coefficients/intercept/scale PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk Author: Yanbo Liang <ybliang8@gmail.com> Closes #9492 from yanboliang/spark-11527.
* [SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead ↵vectorijk2015-11-022-0/+25
| | | | | | | | | | in ML models Deprecated in `LogisticRegression` and `LinearRegression` Author: vectorijk <jiangkai@gmail.com> Closes #9311 from vectorijk/spark-10592.
* [SPARK-10286][ML][PYSPARK][DOCS] Add @since annotation to pyspark.ml.param ↵lihao2015-11-024-0/+230
| | | | | | | | and pyspark.ml.* Author: lihao <lihaowhu@gmail.com> Closes #9275 from lidinghao/SPARK-10286.
* [SPARK-11367][ML][PYSPARK] Python LinearRegression should support setting solverYanbo Liang2015-10-283-22/+37
| | | | | | | | [SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs") Author: Yanbo Liang <ybliang8@gmail.com> Closes #9328 from yanboliang/spark-11367.
* [SPARK-10024][PYSPARK] Python API RF and GBT related params clear upvectorijk2015-10-272-338/+168
| | | | | | | | | implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API in pyspark/ml/{classification, regression}.py Author: vectorijk <jiangkai@gmail.com> Closes #9233 from vectorijk/spark-10024.
* [SPARK-7021] Add JUnit output for Python unit testsGábor Lipták2015-10-221-1/+8
| | | | | | | | WIP Author: Gábor Lipták <gliptak@gmail.com> Closes #8323 from gliptak/SPARK-7021.
* [MINOR][ML] fix doc warningsXiangrui Meng2015-10-201-0/+1
| | | | | | | | | | | | | Without an empty line, sphinx will treat doctest as docstring. holdenk ~~~ /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])". /Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])". ~~~ Author: Xiangrui Meng <meng@databricks.com> Closes #9188 from mengxr/py-count-vec-doc-fix.
* [SPARK-10767][PYSPARK] Make pyspark shared params codegen more consistentHolden Karau2015-10-203-65/+65
| | | | | | | | Namely "." shows up in some places in the template when using the param docstring and not in others Author: Holden Karau <holden@pigscanfly.ca> Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.
* [SPARK-9774] [ML] [PYSPARK] Add python api for ml regression isotonicregressionHolden Karau2015-10-073-1/+149
| | | | | | | | Add the Python API for isotonicregression. Author: Holden Karau <holden@pigscanfly.ca> Closes #8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.
* [SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in ↵Xiangrui Meng2015-10-061-5/+1
| | | | | | | | | | PySpark's AFTSurvivalRegression If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #9001 from mengxr/SPARK-10957.
* [SPARK-10688] [ML] [PYSPARK] Python API for AFTSurvivalRegressionvectorijk2015-10-061-2/+169
| | | | | | | | Implement Python API for AFTSurvivalRegression Author: vectorijk <jiangkai@gmail.com> Closes #8926 from vectorijk/spark-10688.
* [SPARK-9681] [ML] Support R feature interactions in RFormulaEric Liang2015-09-251-1/+1
| | | | | | | | | | | | This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`). To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8830 from ericl/interaction-2.
* [DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formattingnoelsmith2015-09-214-0/+9
| | | | | | | | | | | | | | Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e: ![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png) .. looks like this once newline is added: ![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png) Author: noelsmith <mail@noelsmith.com> Closes #8851 from noel-smith/docstring-missing-newline-fix.
* [SPARK-9769] [ML] [PY] add python api for countvectorizermodelHolden Karau2015-09-211-6/+142
| | | | | | | | From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel Author: Holden Karau <holden@pigscanfly.ca> Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
* [SPARK-10615] [PYSPARK] change assertEquals to assertEqualYanbo Liang2015-09-181-8/+8
| | | | | | | | As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8814 from yanboliang/spark-10615.
* [SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-0/+28
| | | | | | | | pyspark.ml.recommendation Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8692 from yu-iskw/SPARK-10282.
* [SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-0/+13
| | | | | | | | pyspark.ml.clustering Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8691 from yu-iskw/SPARK-10281.
* [SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-0/+65
| | | | | | | | pyspark.ml.regression Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8693 from yu-iskw/SPARK-10283.
* [SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuningYu ISHIKAWA2015-09-171-0/+28
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8694 from yu-iskw/SPARK-10284.
* [SPARK-8530] [ML] add python API for MinMaxScalerYuhao Yang2015-09-111-5/+99
| | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-8530 add python API for MinMaxScaler jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514 Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7150 from hhbyyh/pythonMinMax.