spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports ↵	Yanbo Liang	2017-03-08	1	-8/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tweedie distribution. ## What changes were proposed in this pull request? PySpark ```GeneralizedLinearRegression``` supports tweedie distribution. ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17146 from yanboliang/spark-19806.
*	[SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe	Bryan Cutler	2017-03-03	1	-14/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `keyword_only` decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code. This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize. ## How was this patch tested? Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
*	[SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across ↵	hyukjinkwon	2016-11-22	1	-16/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Python API documentation ## What changes were proposed in this pull request? It seems in Python, there are - `Note:` - `NOTE:` - `Note that` - `.. note::` This PR proposes to fix those to `.. note::` to be consistent. Before <img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png"> <img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png"> After <img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png"> <img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png"> ## How was this patch tested? The notes were found via ```bash grep -r "Note: " . grep -r "NOTE: " . grep -r "Note that " . ``` And then fixed one by one comparing with API documentation. After that, manually tested via `make html` under `./python/docs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15947 from HyukjinKwon/SPARK-18447.
*	[SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM	sethah	2016-11-21	1	-4/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <seth.hendrickson16@gmail.com> Closes #15777 from sethah/pyspark_cluster_summaries.
*	[SPARK-18239][SPARKR] Gradient Boosted Tree for R	Felix Cheung	2016-11-08	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Gradient Boosted Tree in R. With a few minor improvements to RandomForest in R. Since this is relatively isolated I'd like to target this for branch-2.1 ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15746 from felixcheung/rgbt.
*	[SPARK-18110][PYTHON][ML] add missing parameter in Python for RandomForest ↵	Felix Cheung	2016-10-30	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	regression and classification ## What changes were proposed in this pull request? Add subsmaplingRate to randomForestClassifier Add varianceCol to randomForestRegressor In Python ## How was this patch tested? manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15638 from felixcheung/pyrandomforest.
*	[SPARK-17281][ML][MLLIB] Add treeAggregateDepth parameter for ↵	WeichenXu	2016-09-22	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AFTSurvivalRegression ## What changes were proposed in this pull request? Add treeAggregateDepth parameter for AFTSurvivalRegression to keep consistent with LiR/LoR. ## How was this patch tested? Existing tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14851 from WeichenXu123/add_treeAggregate_param_for_survival_regression.
*	[SPARK-17197][ML][PYSPARK] PySpark LiR/LoR supports tree aggregation level ↵	Yanbo Liang	2016-08-25	1	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	configurable. ## What changes were proposed in this pull request? [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) makes tree aggregation level in LiR/LoR configurable, this PR makes PySpark support this function. ## How was this patch tested? Since ```aggregationDepth``` is an expert param, I'm not prefer to test it in doctest which is also used for example. Here is the offline test result: ![image](https://cloud.githubusercontent.com/assets/1962026/17879457/f83d7760-68a6-11e6-9936-d0a884d5d6ec.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14766 from yanboliang/spark-17197.
*	[SPARK-15113][PYSPARK][ML] Add missing num features num classes	Holden Karau	2016-08-22	1	-5/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc. ## How was this patch tested? Extended doctests Author: Holden Karau <holden@us.ibm.com> Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
*	[MINOR][ML] Rename TreeEnsembleModels to TreeEnsembleModel for PySpark	Yanbo Liang	2016-08-11	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix the typo of ```TreeEnsembleModels``` for PySpark, it should ```TreeEnsembleModel``` which will be consistent with Scala. What's more, it represents a tree ensemble model, so ```TreeEnsembleModel``` should be more reasonable. This should not be used public, so it will not involve breaking change. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14454 from yanboliang/TreeEnsembleModel.
*	[SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit ↵	Joseph K. Bradley	2016-07-13	1	-27/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for ML ## What changes were proposed in this pull request? General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml Annotate Estimator-Model pairs of classes and companion objects the same way. For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new: * Model Summary classes * MLWriter, MLReader, MLWritable, MLReadable * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency. * RFormula: Its behavior may need to change slightly to match R in edge cases. * AFTSurvivalRegression * MultilayerPerceptronClassifier DeveloperApi changes: * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi ## How was this patch tested? N/A Note to reviewers: * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental. * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14147 from jkbradley/experimental-audit.
*	[SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg	Nick Pentreath	2016-06-22	1	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \|	[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
*	[SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to None	Bryan Cutler	2016-06-21	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Several places set the seed Param default value to None which will translate to a zero value on the Scala side. This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation. These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero. ## How was this patch tested? Ran PySpark tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
*	[SPARK-16079][PYSPARK][ML] Added missing import for ↵	Bryan Cutler	2016-06-20	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DecisionTreeRegressionModel used in GBTClassificationModel ## What changes were proposed in this pull request? Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method. ## How was this patch tested? Local tests Author: Bryan Cutler <cutlerb@gmail.com> Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
*	[SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ↵	Liang-Chi Hsieh	2016-06-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.
*	[SPARK-15092][SPARK-15139][PYSPARK][ML] Pyspark TreeEnsemble missing methods	Holden Karau	2016-06-02	1	-1/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add `toDebugString` and `totalNumNodes` to `TreeEnsembleModels` and add `toDebugString` to `DecisionTreeModel` ## How was this patch tested? Extended doc tests. Author: Holden Karau <holden@us.ibm.com> Closes #12919 from holdenk/SPARK-15139-pyspark-treeEnsemble-missing-methods.
*	[SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression ↵	Holden Karau	2016-05-24	1	-13/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	pydoc & doc build insturctions ## What changes were proposed in this pull request? PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways. User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands. User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more). ## How was this patch tested? built pydocs locally, tested new user build instructions Author: Holden Karau <holden@us.ibm.com> Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
*	[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext ↵	WeichenXu	2016-05-23	1	-22/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with SparkSession using builder pattern in python test code ## What changes were proposed in this pull request? Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
*	[SPARK-15444][PYSPARK][ML][HOTFIX] Default value mismatch of param ↵	Liang-Chi Hsieh	2016-05-20	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	linkPredictionCol for GeneralizedLinearRegression ## What changes were proposed in this pull request? Default value mismatch of param linkPredictionCol for GeneralizedLinearRegression between PySpark and Scala. That is because default value conflict between #13106 and #13129. This causes ml.tests failed. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13220 from viirya/hotfix-regresstion.
*	[SPARK-15316][PYSPARK][ML] Add linkPredictionCol to GeneralizedLinearRegression	Holden Karau	2016-05-19	1	-11/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add linkPredictionCol to GeneralizedLinearRegression and fix the PyDoc to generate the bullet list ## How was this patch tested? doctests & built docs locally Author: Holden Karau <holden@us.ibm.com> Closes #13106 from holdenk/SPARK-15316-add-linkPredictionCol-toGeneralizedLinearRegression.
*	[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based ↵	DB Tsai	2016-05-17	1	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.
*	[SPARK-15181][ML][PYSPARK] Python API for GLR summaries.	sethah	2016-05-13	1	-1/+200
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs. ## How was this patch tested? Added a unit test to `pyspark.ml.tests` Author: sethah <seth.hendrickson16@gmail.com> Closes #12961 from sethah/GLR_summary.
*	[SPARK-15281][PYSPARK][ML][TRIVIAL] Add impurity param to GBTRegressor & add ↵	Holden Karau	2016-05-12	1	-8/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	experimental inside of regression.py ## What changes were proposed in this pull request? Add impurity param to GBTRegressor and mark the of the models & regressors in regression.py as experimental to match Scaladoc. ## How was this patch tested? Added default value to init, tested with unit/doc tests. Author: Holden Karau <holden@us.ibm.com> Closes #13071 from holdenk/SPARK-15281-GBTRegressor-impurity.
*	[SPARK-15136][PYSPARK][DOC] Fix links to sphinx style and add a default ↵	Holden Karau	2016-05-09	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	param doc note ## What changes were proposed in this pull request? PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. Also add a note about default value in one place. Copy some extended docs from scala for GBT ## How was this patch tested? Built docs locally. Author: Holden Karau <holden@us.ibm.com> Closes #12918 from holdenk/SPARK-15137-linkify-pyspark-ml-classification.
*	[SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean up	Yanbo Liang	2016-05-03	1	-24/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? PySpark ML Params setter code clean up. For examples, ```setInputCol``` can be simplified from ``` self._set(inputCol=value) return self ``` to: ``` return self._set(inputCol=value) ``` This is a pretty big sweeps, and we cleaned wherever possible. ## How was this patch tested? Exist unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12749 from yanboliang/spark-14971.
*	[SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in ↵	Xusen Yin	2016-05-01	1	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Spark and PySpark - update ## What changes were proposed in this pull request? This PR is an update for [https://github.com/apache/spark/pull/12738] which: * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side * Various fixes for bugs found * This includes changing classes taking weightCol to treat unset and empty String Param values the same way. Defaults changed: * Scala * LogisticRegression: weightCol defaults to not set (instead of empty string) * StringIndexer: labels default to not set (instead of empty array) * GeneralizedLinearRegression: * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver) * weightCol defaults to not set (instead of empty string) * LinearRegression: weightCol defaults to not set (instead of empty string) * Python * MultilayerPerceptron: layers default to not set (instead of [1,1]) * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set) ## How was this patch tested? Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test. Author: Joseph K. Bradley <joseph@databricks.com> Author: yinxusen <yinxusen@gmail.com> Closes #12816 from jkbradley/yinxusen-SPARK-14931.
*	[SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0	Herman van Hovell	2016-04-30	1	-9/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR removes three methods the were deprecated in 1.6.0: - `PortableDataStream.close()` - `LinearRegression.weights` - `LogisticRegression.weights` The rationale for doing this is that the impact is small and that Spark 2.0 is a major release. #### How was this patch tested? Compilation succeded. Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12732 from hvanhovell/SPARK-14952.
*	[MINOR][ML][PYSPARK] Fix omissive params which should use TypeConverter	Yanbo Liang	2016-04-20	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? #11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```. ## How was this patch tested? Existing tests. cc jkbradley sethah Author: Yanbo Liang <ybliang8@gmail.com> Closes #12529 from yanboliang/typeConverter.
*	[SPARK-14555] First cut of Python API for Structured Streaming	Burak Yavuz	2016-04-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes: - ContinuousQuery - Trigger - ProcessingTime in pyspark under `pyspark.sql.streaming`. In addition, it contains the new methods added under: - `DataFrameWriter` a) `startStream` b) `trigger` c) `queryName` - `DataFrameReader` a) `stream` - `DataFrame` a) `isStreaming` This PR doesn't contain all methods exposed for `ContinuousQuery`, for example: - `exception` - `sourceStatuses` - `sinkStatus` They may be added in a follow up. This PR also contains some very minor doc fixes in the Scala side. ## How was this patch tested? Python doc tests TODO: - [ ] verify Python docs look good Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> Closes #12320 from brkyvz/stream-python.
*	[SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` method	sethah	2016-04-15	1	-12/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens. Additional changes: * [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here. ## How was this patch tested? Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR. Author: sethah <seth.hendrickson16@gmail.com> Closes #11939 from sethah/SPARK-14104.
*	[SPARK-14374][ML][PYSPARK] PySpark ml GBTClassifier, Regressor support ↵	Yanbo Liang	2016-04-14	1	-2/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	export/import ## What changes were proposed in this pull request? PySpark ml GBTClassifier, Regressor support export/import. ## How was this patch tested? Doc test. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12383 from yanboliang/spark-14374.
*	[SPARK-14573][PYSPARK][BUILD] Fix PyDoc Makefile & highlighting issues	Holden Karau	2016-04-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced. ## How was this patch tested? manual local export & make Author: Holden Karau <holden@us.ibm.com> Closes #12336 from holdenk/SPARK-14573-fix-pydoc-makefile.
*	[SPARK-14472][PYSPARK][ML] Cleanup ML JavaWrapper and related class hierarchy	Bryan Cutler	2016-04-13	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls. This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls. Also, renames Java wrapper classes to better reflect their purpose. Ran existing Python ml tests and generated documentation to test this change. Author: Bryan Cutler <cutlerb@gmail.com> Closes #12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
*	[SPARK-13597][PYSPARK][ML] Python API for GeneralizedLinearRegression	Kai Jiang	2016-04-12	1	-0/+145
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Python API for GeneralizedLinearRegression JIRA: https://issues.apache.org/jira/browse/SPARK-13597 ## How was this patch tested? The patch is tested with Python doctest. Author: Kai Jiang <jiangkai@gmail.com> Closes #11468 from vectorijk/spark-13597.
*	[SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs	Joseph K. Bradley	2016-04-08	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Cleanups to documentation. No changes to code. * GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor * GLM regParam: needs doc saying it is for L2 only * TrainValidationSplitModel: add .. versionadded:: 2.0.0 * Rename “_transformer_params_from_java” to “_transfer_params_from_java” * LogReg Summary classes: “probability” col should not say “calibrated” * LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values * approxCountDistinct: Document meaning of “rsd" argument. * LDA: note which params are for online LDA only ## How was this patch tested? Doc build Author: Joseph K. Bradley <joseph@databricks.com> Closes #12266 from jkbradley/ml-doc-cleanups.
*	[SPARK-12569][PYSPARK][ML] DecisionTreeRegressor: provide variance of ↵	wm624@hotmail.com	2016-04-08	1	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	prediction: Python API ## What changes were proposed in this pull request? A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code. This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor. ## How was this patch tested? ./dev/lint-python PEP8 checks passed. rm -rf _build/* pydoc checks passed. ./python/run-tests --python-executables=python2.7 --modules=pyspark-ml Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log Will test against the following Python executables: ['python2.7'] Will test the following Python modules: ['pyspark-ml'] Finished test(python2.7): pyspark.ml.evaluation (12s) Finished test(python2.7): pyspark.ml.clustering (18s) Finished test(python2.7): pyspark.ml.classification (30s) Finished test(python2.7): pyspark.ml.recommendation (28s) Finished test(python2.7): pyspark.ml.feature (43s) Finished test(python2.7): pyspark.ml.regression (31s) Finished test(python2.7): pyspark.ml.tuning (19s) Finished test(python2.7): pyspark.ml.tests (34s) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12116 from wangmiao1981/fix_api.
*	[SPARK-14373][PYSPARK] PySpark RandomForestClassifier, Regressor support ↵	Kai Jiang	2016-04-08	1	-2/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	export/import ## What changes were proposed in this pull request? supporting `RandomForest{Classifier, Regressor}` save/load for Python API. [JIRA](https://issues.apache.org/jira/browse/SPARK-14373) ## How was this patch tested? doctest Author: Kai Jiang <jiangkai@gmail.com> Closes #12238 from vectorijk/spark-14373.
*	[SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and ↵	Bryan Cutler	2016-04-06	1	-2/+243
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	logistic regression ## What changes were proposed in this pull request? Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML. ## How was this patch tested? Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
*	[SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark	sethah	2016-03-31	1	-10/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Feature importances are exposed in the python API for GBTs. Other changes: * Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it. ## How was this patch tested? Python doc tests were updated to validate GBT feature importance. Author: sethah <seth.hendrickson16@gmail.com> Closes #12056 from sethah/Pyspark_GBT_feature_importance.
*	[SPARK-13949][ML][PYTHON] PySpark ml DecisionTreeClassifier, Regressor ↵	GayathriMurali	2016-03-24	1	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	support export/import ## What changes were proposed in this pull request? Added MLReadable and MLWritable to Decision Tree Classifier and Regressor. Added doctests. ## How was this patch tested? Python Unit tests. Tests added to check persistence in DecisionTreeClassifier and DecisionTreeRegressor. Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #11892 from GayathriMurali/SPARK-13949.
*	[SPARK-14107][PYSPARK][ML] Add seed as named argument to GBTs in pyspark	sethah	2016-03-24	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? GBTs in pyspark previously had seed parameters, but they could not be passed as keyword arguments through the class constructor. This patch adds seed as a keyword argument and also sets default value. ## How was this patch tested? Doc tests were updated to pass a random seed through the GBTClassifier and GBTRegressor constructors. Author: sethah <seth.hendrickson16@gmail.com> Closes #11944 from sethah/SPARK-14107.
*	[SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params	sethah	2016-03-23	1	-9/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type. This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira. ## How was this patch tested? Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided. Author: sethah <seth.hendrickson16@gmail.com> Closes #11663 from sethah/SPARK-13068-tc.
*	[SPARK-13951][ML][PYTHON] Nested Pipeline persistence	Joseph K. Bradley	2016-03-22	1	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds support for saving and loading nested ML Pipelines from Python. Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations. Also: * Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader. * Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java Added new unit test for nested Pipelines. Abstracted validity check into a helper method for the 2 unit tests. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11866 from jkbradley/nested-pipeline-io. Closes #11835
*	[SPARK-13787][ML][PYSPARK] Pyspark feature importances for decision tree and ↵	sethah	2016-03-11	1	-0/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	random forest ## What changes were proposed in this pull request? This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`. ## How was this patch tested? Python doc tests for the affected classes were updated to check feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes #11622 from sethah/SPARK-13787.
*	[SPARK-13033] [ML] [PYSPARK] Add import/export for ml.regression	Tommy YU	2016-02-25	1	-4/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/regression.py. yanboliang Please help to review. For doctest, I though it's enough to add one since it's common usage. But I can add to all if we want it. Author: Tommy YU <tummyyu@163.com> Closes #11000 from Wenpei/spark-13033-ml.regression-exprot-import and squashes the following commits: 3646b36 [Tommy YU] address review comments 9cddc98 [Tommy YU] change base on review and pr 11197 cc61d9d [Tommy YU] remove default parameter set 19535d4 [Tommy YU] add export/import to regression 44a9dc2 [Tommy YU] add import/export for ml.regression
*	[SPARK-13302][PYSPARK][TESTS] Move the temp file creation and cleanup ↵	Holden Karau	2016-02-20	1	-11/+14
\| \| \| \| \| \| \| \| \| \| \| \|	outside of the doctests Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking. In part this is a follow up to https://github.com/apache/spark/pull/10999 Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time. Author: Holden Karau <holden@us.ibm.com> Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
*	[SPARK-13032][ML][PYSPARK] PySpark support model export/import and take ↵	Yanbo Liang	2016-01-29	1	-5/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	LinearRegression as example * Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark. * Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #10469 from yanboliang/spark-11939.
*	[SPARK-10509][PYSPARK] Reduce excessive param boiler plate code	Holden Karau	2016-01-26	1	-46/+0
\| \| \| \| \| \| \| \|	The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh). Author: Holden Karau <holden@us.ibm.com> Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
*	[SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & ↵	Yanbo Liang	2016-01-06	1	-5/+9
\| \| \| \| \| \| \| \| \| \|	DecisionTreeRegressor should support setSeed PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9807 from yanboliang/spark-11815.
*	[MINOR][ML] Use coefficients replace weights	Yanbo Liang	2015-12-03	1	-1/+1
\| \| \| \| \| \| \| \| \|	Use ```coefficients``` replace ```weights```, I wish they are the last two. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #10065 from yanboliang/coefficients.