aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-13429][MLLIB] Unify Logistic Regression convergence tolerance of ML & ↵Yanbo Liang2016-02-221-2/+2
| | | | | | | | | | | | | | MLlib ## What changes were proposed in this pull request? In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```. cc dbtsai ## How was the this patch tested? unit tests Author: Yanbo Liang <ybliang8@gmail.com> Closes #11299 from yanboliang/spark-13429.
* [SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent ↵Bryan Cutler2016-02-222-34/+102
| | | | | | | | | | | | | | format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the fpm and recommendation modules. Closes #10602 Closes #10897 Author: Bryan Cutler <cutlerb@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
* Correct SparseVector.parse documentationMiles Yucht2016-02-161-1/+1
| | | | | | | | There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect. Author: Miles Yucht <miles@databricks.com> Closes #11213 from mgyucht/fix-sparsevector-docs.
* [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed testLiang-Chi Hsieh2016-02-131-6/+19
| | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10539 from viirya/fix-poweriter.
* [SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to ↵vijaykiran2016-02-121-118/+143
| | | | | | | | | | | consistent format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module. Author: vijaykiran <mail@vijaykiran.com> Author: Bryan Cutler <cutlerb@gmail.com> Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
* [SPARK-12986][DOC] Fix pydoc warnings in mllib/regression.pyNam Pham2016-02-081-13/+21
| | | | | | | | I have fixed the warnings by running "make html" under "python/docs/". They are caused by not having blank lines around indented paragraphs. Author: Nam Pham <phamducnam@gmail.com> Closes #11025 from nampham2/SPARK-12986.
* [SPARK-12631][PYSPARK][DOC] PySpark clustering parameter desc to consistent ↵Bryan Cutler2016-02-021-74/+191
| | | | | | | | | | format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the clustering module. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
* [SPARK-10086][MLLIB][STREAMING][PYSPARK] ignore StreamingKMeans test in ↵Xiangrui Meng2016-01-251-0/+1
| | | | | | | | | | | | | | PySpark for now I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086. gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"? cc: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #10909 from mengxr/SPARK-10086.
* [SPARK-11295][PYSPARK] Add packages to JUnit output for Python testsGábor Lipták2016-01-201-11/+15
| | | | | | | | | This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test. Author: Gábor Lipták <gliptak@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10850 from mengxr/SPARK-11295.
* Revert "[SPARK-11295] Add packages to JUnit output for Python tests"Xiangrui Meng2016-01-191-14/+10
| | | | This reverts commit c6f971b4aeca7265ab374fa46c5c452461d9b6a7.
* [SPARK-11295] Add packages to JUnit output for Python testsGábor Lipták2016-01-191-10/+14
| | | | | | | | | | SPARK-11295 Add packages to JUnit output for Python tests This improves grouping/display of test case results. Author: Gábor Lipták <gliptak@gmail.com> Closes #9263 from gliptak/SPARK-11295.
* [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k meansHolden Karau2016-01-192-5/+142
| | | | | | | | From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans. Author: Holden Karau <holden@us.ibm.com> Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
* [SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support ↵Yanbo Liang2016-01-111-13/+22
| | | | | | | | | | single instance predict/predictSoft PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10552 from yanboliang/spark-12603.
* [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 editionSean Owen2016-01-081-1/+1
| | | | | | | | Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs. Author: Sean Owen <sowen@cloudera.com> Closes #10570 from srowen/SPARK-12618.
* [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not Nonezero3232016-01-072-1/+13
| | | | | | | | If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #10644 from zero323/SPARK-12006.
* Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None"Yin Huai2016-01-062-13/+1
| | | | | | | | This reverts commit fcd013cf70e7890aa25a8fe3cb6c8b36bf0e1f04. Author: Yin Huai <yhuai@databricks.com> Closes #10632 from yhuai/pythonStyle.
* [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not Nonezero3232016-01-062-1/+13
| | | | | | | | If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9986 from zero323/SPARK-12006.
* [SPARK-11531][ML] SparseVector error MsgJoshi2016-01-061-1/+3
| | | | | | | | | PySpark SparseVector should have "Found duplicate indices" error message Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #9525 from rekhajoshm/SPARK-11531.
* [SPARK-12041][ML][PYSPARK] Add columnSimilarities to IndexedRowMatrixKai Jiang2016-01-051-0/+14
| | | | | | | | Add `columnSimilarities` to IndexedRowMatrix for PySpark spark.mllib.linalg. Author: Kai Jiang <jiangkai@gmail.com> Closes #10158 from vectorijk/spark-12041.
* [SPARK-12296][PYSPARK][MLLIB] Feature parity for pyspark mllib standard ↵Holden Karau2015-12-221-0/+40
| | | | | | | | | | scaler model Some methods are missing, such as ways to access the std, mean, etc. This PR is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel. Author: Holden Karau <holden@us.ibm.com> Closes #10298 from holdenk/SPARK-12296-feature-parity-pyspark-mllib-StandardScalerModel.
* [SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDsBryan Cutler2015-12-201-0/+17
| | | | | | | | Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647." Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
* [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllibDavies Liu2015-12-163-11/+9
| | | | | | | | MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext. Author: Davies Liu <davies@databricks.com> Closes #10338 from davies/create_context.
* [SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pysparkLiang-Chi Hsieh2015-12-141-1/+5
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12016 We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10100 from viirya/fix-load-py-wordvecmodel.
* [SPARK-10560][PYSPARK][MLLIB][DOCS] Make StreamingLogisticRegressionWithSGD ↵Bryan Cutler2015-11-232-23/+46
| | | | | | | | | | | | | | Python API equal to Scala one This is to bring the API documentation of StreamingLogisticReressionWithSGD and StreamingLinearRegressionWithSGC in line with the Scala versions. -Fixed the algorithm descriptions -Added default values to parameter descriptions -Changed StreamingLogisticRegressionWithSGD regParam to default to 0, as in the Scala version Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9141 from BryanCutler/StreamingLogisticRegressionWithSGD-python-api-sync.
* [SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in ↵Yu ISHIKAWA2015-11-101-1/+1
| | | | | | | | | | Python cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9534 from yu-iskw/SPARK-11566.
* [SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics ↵Yu ISHIKAWA2015-11-091-0/+6
| | | | | | | | | | in Python more specific cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9577 from yu-iskw/SPARK-11610.
* [SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in PythonYu ISHIKAWA2015-11-061-15/+18
| | | | | | | | | | | | | Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8643 from yu-iskw/SPARK-8467-2.
* [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bitsImran Rashid2015-11-061-2/+2
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.
* [SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpanYu ISHIKAWA2015-11-041-1/+68
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9469 from yu-iskw/SPARK-10028.
* [SPARK-11358][MLLIB] deprecate runs in k-meansXiangrui Meng2015-11-021-0/+4
| | | | | | | | | | This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation. cc: srowen Author: Xiangrui Meng <meng@databricks.com> Closes #9322 from mengxr/SPARK-11358.
* [SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix ↵Sean Owen2015-10-271-2/+2
| | | | | | | | | | | | returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes https://github.com/apache/spark/pull/9293 Author: Sean Owen <sowen@cloudera.com> Closes #9309 from srowen/SPARK-11302.2.
* [SPARK-6488][MLLIB][PYTHON] Support addition/multiplication in PySpark's ↵Mike Dusenberry2015-10-271-0/+68
| | | | | | | | | | BlockMatrix This PR adds addition and multiplication to PySpark's `BlockMatrix` class via `add` and `multiply` functions. Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9139 from dusenberrymw/SPARK-6488_Add_Addition_and_Multiplication_to_PySpark_BlockMatrix.
* [SPARK-10271][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.clusteringnoelsmith2015-10-261-1/+68
| | | | | | | | | | Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark). Author: noelsmith <mail@noelsmith.com> Closes #8627 from noel-smith/SPARK-10271-since-mllib-clustering.
* [SPARK-10277] [MLLIB] [PYSPARK] Add @since annotation to ↵Yu ISHIKAWA2015-10-231-1/+101
| | | | | | | | pyspark.mllib.regression Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8684 from yu-iskw/SPARK-10277.
* [SPARK-7021] Add JUnit output for Python unit testsGábor Lipták2015-10-221-1/+8
| | | | | | | | WIP Author: Gábor Lipták <gliptak@gmail.com> Closes #8323 from gliptak/SPARK-7021.
* [SPARK-10269][PYSPARK][MLLIB] Add @since annotation to ↵noelsmith2015-10-201-4/+66
| | | | | | | | | | | | | | pyspark.mllib.classification Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to methods + "versionadded::" to classes derived from the file history. Note - some methods are inherited from the regression module (i.e. LinearModel.intercept) so these won't have version numbers in the API docs until that model is updated. Author: noelsmith <mail@noelsmith.com> Closes #8626 from noel-smith/SPARK-10269-since-mlib-classification.
* [SPARK-10272][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.evaluationnoelsmith2015-10-201-0/+41
| | | | | | | | | | | | Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings). Added since to public methods + "versionadded::" to classes (derived from the git file history in pyspark). Note - I added also the tags to MultilabelMetrics even though it isn't declared as public in the __all__ statement... if that's incorrect - I'll remove. Author: noelsmith <mail@noelsmith.com> Closes #8628 from noel-smith/SPARK-10272-since-mllib-evalutation.
* [SPARK-11084] [ML] [PYTHON] Check if index can contain non-zero value before ↵zero3232015-10-162-2/+12
| | | | | | | | | | binary search At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`. Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9098 from zero323/sparse_vector_getitem_improved.
* [SPARK-11050] [MLLIB] PySpark SparseVector can return wrong index in e…Bhargav Mangipudi2015-10-161-2/+3
| | | | | | | | | | | | …rror message For negative indices in the SparseVector, we update the index value. If we have an incorrect index at this point, the error message has the incorrect *updated* index instead of the original one. This change contains the fix for the same. Author: Bhargav Mangipudi <bhargav.mangipudi@gmail.com> Closes #9069 from bhargav/spark-10759.
* [SPARK-10535] Sync up API for matrix factorization model between Scala and ↵Vladimir Vladimirov2015-10-091-4/+28
| | | | | | | | | | PySpark Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com> Closes #8700 from smartkiwi/SPARK-10535_.
* [SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train ↵Bryan Cutler2015-10-082-2/+3
| | | | | | | | | | with given regParam and convergenceTol parameters These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training. Same with StreamingLinearRegressionWithSGD. I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value. Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.
* [SPARK-10973] [ML] [PYTHON] __gettitem__ method throws IndexError exception ↵zero3232015-10-082-5/+10
| | | | | | | | | | | | | | | | | | | | | | | when we… __gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry from pyspark.mllib.linalg import Vectors sv = Vectors.sparse(5, {1: 3}) sv[0] ## 0.0 sv[1] ## 3.0 sv[2] ## Traceback (most recent call last): ## File "<stdin>", line 1, in <module> ## File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__ ## row_ind = inds[insert_index] ## IndexError: index out of bounds Author: zero323 <matthew.szymkiewicz@gmail.com> Closes #9009 from zero323/sparse_vector_index_error.
* [SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark ↵Evan Chen2015-10-071-2/+15
| | | | | | | | | | (spark.mllib) Provide initialModel param for pyspark.mllib.clustering.KMeans Author: Evan Chen <chene@us.ibm.com> Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.
* [DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formattingnoelsmith2015-09-212-1/+2
| | | | | | | | | | | | | | Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e: ![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png) .. looks like this once newline is added: ![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png) Author: noelsmith <mail@noelsmith.com> Closes #8851 from noel-smith/docstring-missing-newline-fix.
* [SPARK-10631] [DOCUMENTATION, MLLIB, PYSPARK] Added documentation for few APIsvinodkc2015-09-201-5/+17
| | | | | | | | There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts. Author: vinodkc <vinod.kc.in@gmail.com> Closes #8834 from vinodkc/fix_SPARK-10631.
* [SPARK-10615] [PYSPARK] change assertEquals to assertEqualYanbo Liang2015-09-181-81/+81
| | | | | | | | As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8814 from yanboliang/spark-10615.
* [SPARK-10274] [MLLIB] Add @since annotation to pyspark.mllib.fpmYu ISHIKAWA2015-09-171-1/+9
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8665 from yu-iskw/SPARK-10274.
* [SPARK-10279] [MLLIB] [PYSPARK] [DOCS] Add @since annotation to ↵Yu ISHIKAWA2015-09-171-2/+26
| | | | | | | | pyspark.mllib.util Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8689 from yu-iskw/SPARK-10279.
* [SPARK-10278] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.treeYu ISHIKAWA2015-09-171-1/+35
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8685 from yu-iskw/SPARK-10278.
* [SPARK-10276] [MLLIB] [PYSPARK] Add @since annotation to ↵Yu ISHIKAWA2015-09-161-1/+35
| | | | | | | | pyspark.mllib.recommendation Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8677 from yu-iskw/SPARK-10276.