spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc	krishnakalyan3	2016-07-27	1	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Updated ML pipeline Cross Validation Scaladoc & PyDoc. ## How was this patch tested? Documentation update (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #13894 from krishnakalyan3/kfold-cv.
*	[SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default ↵	WeichenXu	2016-07-25	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	to 1e-6 ## What changes were proposed in this pull request? replace ANN convergence tolerance param default from 1e-4 to 1e-6 so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer. ## How was this patch tested? Existing Test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14286 from WeichenXu123/update_ann_tol.
*	[PYSPARK] add picklable SparseMatrix in pyspark.ml.common	WeichenXu	2016-07-24	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add `SparseMatrix` class whick support pickler. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14265 from WeichenXu123/picklable_py.
*	[SPARK-16494][ML] Upgrade breeze version to 0.12	Yanbo Liang	2016-07-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes. One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case. We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12. For more features, improvements and bug fixes of breeze 0.12, you can refer the following link: https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c ## How was this patch tested? No new tests, should pass the existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14150 from yanboliang/spark-16494.
*	[SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guide	Joseph K. Bradley	2016-07-15	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * Reviewers: please check this carefully * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * Reviewers: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * Reviewers: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.
*	[SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit ↵	Joseph K. Bradley	2016-07-13	4	-117/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for ML ## What changes were proposed in this pull request? General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml Annotate Estimator-Model pairs of classes and companion objects the same way. For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new: * Model Summary classes * MLWriter, MLReader, MLWritable, MLReadable * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency. * RFormula: Its behavior may need to change slightly to match R in edge cases. * AFTSurvivalRegression * MultilayerPerceptronClassifier DeveloperApi changes: * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi ## How was this patch tested? N/A Note to reviewers: * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental. * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14147 from jkbradley/experimental-audit.
*	[SPARK-16348][ML][MLLIB][PYTHON] Use full classpaths for pyspark ML JVM calls	Joseph K. Bradley	2016-07-05	2	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Issue: Omitting the full classpath can cause problems when calling JVM methods or classes from pyspark. This PR: Changed all uses of jvm.X in pyspark.ml and pyspark.mllib to use full classpath for X ## How was this patch tested? Existing unit tests. Manual testing in an environment where this was an issue. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14023 from jkbradley/SPARK-16348.
*	[SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg	Nick Pentreath	2016-06-22	2	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \|	[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
*	[SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs	Holden Karau	2016-06-22	1	-2/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc. ## How was this patch tested? Built docs locally & PySpark SQL tests Author: Holden Karau <holden@us.ibm.com> Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
*	[SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to None	Bryan Cutler	2016-06-21	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Several places set the seed Param default value to None which will translate to a zero value on the Scala side. This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation. These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero. ## How was this patch tested? Ran PySpark tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
*	[SPARK-10258][DOC][ML] Add @Since annotations to ml.feature	Nick Pentreath	2016-06-21	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds missing `Since` annotations to `ml.feature` package. Closes #8505. ## How was this patch tested? Existing tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13641 from MLnick/add-since-annotations.
*	[SPARK-16079][PYSPARK][ML] Added missing import for ↵	Bryan Cutler	2016-06-20	2	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DecisionTreeRegressionModel used in GBTClassificationModel ## What changes were proposed in this pull request? Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method. ## How was this patch tested? Local tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
*	[SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ↵	Liang-Chi Hsieh	2016-06-13	13	-16/+153
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.
*	[SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar ↵	Bryan Cutler	2016-06-10	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	to Scala API ## What changes were proposed in this pull request? Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review. ## How was this patch tested? run pyspark-ml tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
*	[SPARK-15837][ML][PYSPARK] Word2vec python add maxsentence parameter	WeichenXu	2016-06-10	1	-5/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Word2vec python add maxsentence parameter. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13578 from WeichenXu123/word2vec_python_add_maxsentence.
*	[SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property	Jeff Zhang	2016-06-09	1	-0/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add method idf to IDF in pyspark ## How was this patch tested? add unit test Author: Jeff Zhang <zjffdu@apache.org> Closes #13540 from zjffdu/SPARK-15788.
*	[SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ↵	Yanbo Liang	2016-06-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ML examples ## What changes were proposed in this pull request? Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken. ```python pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.' ``` We should use ```accuracy``` to replace ```precision``` in these examples. ## How was this patch tested? Offline tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13519 from yanboliang/spark-15771.
*	[MINOR] Fix Typos 'an -> a'	Zheng RuiFeng	2016-06-06	2	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' \| xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.
*	[SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" ↵	Ruifeng Zheng	2016-06-04	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f1_score ## What changes were proposed in this pull request? 1, del precision,recall in `ml.MulticlassClassificationEvaluator` 2, update user guide for `mlllib.weightedFMeasure` ## How was this patch tested? local build Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #13390 from zhengruifeng/clarify_f1.
*	[SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifier	Holden Karau	2016-06-03	1	-9/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params. Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 ) ## How was this patch tested? Doc tests Author: Holden Karau <holden@us.ibm.com> Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
*	[SPARK-15092][SPARK-15139][PYSPARK][ML] Pyspark TreeEnsemble missing methods	Holden Karau	2016-06-02	2	-1/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add `toDebugString` and `totalNumNodes` to `TreeEnsembleModels` and add `toDebugString` to `DecisionTreeModel` ## How was this patch tested? Extended doc tests. Author: Holden Karau <holden@us.ibm.com> Closes #12919 from holdenk/SPARK-15139-pyspark-treeEnsemble-missing-methods.
*	[SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.feature	Yanbo Liang	2016-06-01	1	-16/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include: * Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless. * Scala API docs update. * Sync Scala and Python API docs for these changes. ## How was this patch tested? Exist tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13410 from yanboliang/spark-15587.
*	[MINOR][DOC][ML] ml.clustering scala & python api doc sync	Yanbo Liang	2016-05-31	1	-10/+25
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since we done Scala API audit for ml.clustering at #13148, we should also fix and update the corresponding Python API docs to keep them in sync. ## How was this patch tested? Docs change, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13291 from yanboliang/spark-15361-followup.
*	[SPARK-15008][ML][PYSPARK] Add integration test for OneVsRest	yinxusen	2016-05-27	1	-23/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1. Add `_transfer_param_map_to/from_java` for OneVsRest; 2. Add `_compare_params` in ml/tests.py to help compare params. 3. Add `test_onevsrest` as the integration test for OneVsRest. ## How was this patch tested? Python unit test. Author: yinxusen <yinxusen@gmail.com> Closes #12875 from yinxusen/SPARK-15008.
*	[SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS	Nick Pentreath	2016-05-25	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice. We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields. Tests N/A. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
*	[SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression ↵	Holden Karau	2016-05-24	1	-13/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pydoc & doc build insturctions ## What changes were proposed in this pull request? PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways. User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands. User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more). ## How was this patch tested? built pydocs locally, tested new user build instructions Author: Holden Karau <holden@us.ibm.com> Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
*	[SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark ↵	Nick Pentreath	2016-05-24	1	-15/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	QuantileDiscretizer This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala. Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`). Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`. ## How was this patch tested? A little doctest and built API docs locally to check HTML doc generation. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
*	[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext ↵	WeichenXu	2016-05-23	7	-107/+121
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with SparkSession using builder pattern in python test code ## What changes were proposed in this pull request? Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
*	[SPARK-15444][PYSPARK][ML][HOTFIX] Default value mismatch of param ↵	Liang-Chi Hsieh	2016-05-20	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	linkPredictionCol for GeneralizedLinearRegression ## What changes were proposed in this pull request? Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression between PySpark and Scala. That is because default value conflict between #13106 and #13129. This causes ml.tests failed. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13220 from viirya/hotfix-regresstion.
*	[MINOR][ML][PYSPARK] ml.evaluation Scala and Python API sync	Yanbo Liang	2016-05-19	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ```ml.evaluation``` Scala and Python API sync. ## How was this patch tested? Only API docs change, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13195 from yanboliang/evaluation-doc.
*	[SPARK-15316][PYSPARK][ML] Add linkPredictionCol to GeneralizedLinearRegression	Holden Karau	2016-05-19	1	-11/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add linkPredictionCol to GeneralizedLinearRegression and fix the PyDoc to generate the bullet list ## How was this patch tested? doctests & built docs locally Author: Holden Karau <holden@us.ibm.com> Closes #13106 from holdenk/SPARK-15316-add-linkPredictionCol-toGeneralizedLinearRegression.
*	[DOC][MINOR] ml.feature Scala and Python API sync	Bryan Cutler	2016-05-19	1	-13/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? I reviewed Scala and Python APIs for ml.feature and corrected discrepancies. ## How was this patch tested? Built docs locally, ran style checks Author: Bryan Cutler <cutlerb@gmail.com> Closes #13159 from BryanCutler/ml.feature-api-sync.
*	[SPARK-14891][ML] Add schema validation for ALS	Nick Pentreath	2016-05-18	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown. With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue). ## How was this patch tested? New test cases in `ALSSuite`. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
*	[SPARK-14978][PYSPARK] PySpark TrainValidationSplitModel should support ↵	Takuya Kuwahara	2016-05-18	2	-10/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	validationMetrics ## What changes were proposed in this pull request? This pull request includes supporting validationMetrics for TrainValidationSplitModel with Python and test for it. ## How was this patch tested? test in `python/pyspark/ml/tests.py` Author: Takuya Kuwahara <taakuu19@gmail.com> Closes #12767 from taku-k/spark-14978.
*	[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based ↵	DB Tsai	2016-05-17	8	-112/+94
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.
*	[SPARK-14906][ML] Copy linalg in PySpark to new ML package	Xiangrui Meng	2016-05-17	2	-45/+1556
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Copy the linalg (Vector/Matrix and VectorUDT/MatrixUDT) in PySpark to new ML package. ## How was this patch tested? Existing tests. Author: Xiangrui Meng <meng@databricks.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #13099 from viirya/move-pyspark-vector-matrix-udt4.
*	[SPARK-15181][ML][PYSPARK] Python API for GLR summaries.	sethah	2016-05-13	2	-2/+238
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs. ## How was this patch tested? Added a unit test to `pyspark.ml.tests` Author: sethah <seth.hendrickson16@gmail.com> Closes #12961 from sethah/GLR_summary.
*	[MINOR][PYSPARK] update _shared_params_code_gen.py	Zheng RuiFeng	2016-05-13	3	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1, add arg-checkings for `tol` and `stepSize` to keep in line with `SharedParamsCodeGen.scala` 2, fix one typo ## How was this patch tested? local build Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12996 from zhengruifeng/py_args_checking.
*	[SPARK-15188] Add missing thresholds param to NaiveBayes in PySpark	Holden Karau	2016-05-13	1	-5/+10
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add missing thresholds param to NiaveBayes ## How was this patch tested? doctests Author: Holden Karau <holden@us.ibm.com> Closes #12963 from holdenk/SPARK-15188-add-missing-naive-bayes-param.
*	[SPARK-15281][PYSPARK][ML][TRIVIAL] Add impurity param to GBTRegressor & add ↵	Holden Karau	2016-05-12	1	-8/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	experimental inside of regression.py ## What changes were proposed in this pull request? Add impurity param to GBTRegressor and mark the of the models & regressors in regression.py as experimental to match Scaladoc. ## How was this patch tested? Added default value to init, tested with unit/doc tests. Author: Holden Karau <holden@us.ibm.com> Closes #13071 from holdenk/SPARK-15281-GBTRegressor-impurity.
*	[SPARK-15037] [SQL] [MLLIB] Part2: Use SparkSession instead of SQLContext in ↵	Sandeep Singh	2016-05-11	1	-53/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Python TestSuites ## What changes were proposed in this pull request? Use SparkSession instead of SQLContext in Python TestSuites ## How was this patch tested? Existing tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13044 from techaddict/SPARK-15037-python.
*	[SPARK-15189][PYSPARK][DOCS] Update ml.evaluation PyDoc	Holden Karau	2016-05-11	1	-1/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix doctest issue, short param description, and tag items as Experimental ## How was this patch tested? build docs locally & doctests Author: Holden Karau <holden@us.ibm.com> Closes #12964 from holdenk/SPARK-15189-ml.Evaluation-PyDoc-issues.
*	[SPARK-15195][PYSPARK][DOCS] Update ml.tuning PyDocs	Holden Karau	2016-05-10	1	-1/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Tag classes in ml.tuning as experimental, add docs for kfolds avg metric, and copy TrainValidationSplit scaladoc for more detailed explanation. ## How was this patch tested? built docs locally Author: Holden Karau <holden@us.ibm.com> Closes #12967 from holdenk/SPARK-15195-pydoc-ml-tuning.
*	[SPARK-15136][PYSPARK][DOC] Fix links to sphinx style and add a default ↵	Holden Karau	2016-05-09	4	-22/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	param doc note ## What changes were proposed in this pull request? PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. Also add a note about default value in one place. Copy some extended docs from scala for GBT ## How was this patch tested? Built docs locally. Author: Holden Karau <holden@us.ibm.com> Closes #12918 from holdenk/SPARK-15137-linkify-pyspark-ml-classification.
*	[SPARK-14050][ML] Add multiple languages support and additional methods for ↵	Burak Köse	2016-05-06	2	-16/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Stop Words Remover ## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak Köse <burakks41@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Burak KOSE <burakks41@gmail.com> Closes #12843 from mengxr/SPARK-14050.
*	[SPARK-15106][PYSPARK][ML] Add PySpark package doc for ML component & remove ↵	Holden Karau	2016-05-05	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"BETA" ## What changes were proposed in this pull request? Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA). ## How was this patch tested? Python documentation built locally as HTML and text and verified output. Author: Holden Karau <holden@us.ibm.com> Closes #12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
*	[SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean up	Yanbo Liang	2016-05-03	10	-219/+110
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? PySpark ML Params setter code clean up. For examples, ```setInputCol``` can be simplified from ``` self._set(inputCol=value) return self ``` to: ``` return self._set(inputCol=value) ``` This is a pretty big sweeps, and we cleaned wherever possible. ## How was this patch tested? Exist unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12749 from yanboliang/spark-14971.
*	[SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in ↵	Xusen Yin	2016-05-01	5	-12/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spark and PySpark - update ## What changes were proposed in this pull request? This PR is an update for [https://github.com/apache/spark/pull/12738] which: * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side * Various fixes for bugs found * This includes changing classes taking weightCol to treat unset and empty String Param values the same way. Defaults changed: * Scala * LogisticRegression: weightCol defaults to not set (instead of empty string) * StringIndexer: labels default to not set (instead of empty array) * GeneralizedLinearRegression: * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver) * weightCol defaults to not set (instead of empty string) * LinearRegression: weightCol defaults to not set (instead of empty string) * Python * MultilayerPerceptron: layers default to not set (instead of [1,1]) * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set) ## How was this patch tested? Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test. Author: Joseph K. Bradley <joseph@databricks.com> Author: yinxusen <yinxusen@gmail.com> Closes #12816 from jkbradley/yinxusen-SPARK-14931.
*	[SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0	Herman van Hovell	2016-04-30	2	-19/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR removes three methods the were deprecated in 1.6.0: - `PortableDataStream.close()` - `LinearRegression.weights` - `LogisticRegression.weights` The rationale for doing this is that the impact is small and that Spark 2.0 is a major release. #### How was this patch tested? Compilation succeded. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12732 from hvanhovell/SPARK-14952.
*	[SPARK-13289][MLLIB] Fix infinite distances between word vectors in ↵	Junyang	2016-04-30	1	-7/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Word2VecModel ## What changes were proposed in this pull request? This PR fixes the bug that generates infinite distances between word vectors. For example, Before this PR, we have ``` val synonyms = model.findSynonyms("who", 40) ``` will give the following results: ``` to Infinity and Infinity that Infinity with Infinity ``` With this PR, the distance between words is a value between 0 and 1, as follows: ``` scala> model.findSynonyms("who", 10) res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006)) scala> model.findSynonyms("king", 10) res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043)) scala> model.findSynonyms("queen", 10) res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199)) ``` ### There are two places changed in this PR: - Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once. - Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation ## How was this patch tested? Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors. Author: Junyang <fly.shenjy@gmail.com> Author: flyskyfly <fly.shenjy@gmail.com> Closes #11812 from flyjy/TVec.