aboutsummaryrefslogtreecommitdiff
path: root/python/pyspark/ml/feature.py
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-15040][ML][PYSPARK] Add Imputer to PySparkNick Pentreath2017-03-241-0/+160
| | | | | | | | | | | | Add Python wrapper for `Imputer` feature transformer. ## How was this patch tested? New doc tests and tweak to PySpark ML `tests.py` Author: Nick Pentreath <nickp@za.ibm.com> Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
* [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safeBryan Cutler2017-03-031-60/+60
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `keyword_only` decorator in PySpark is not thread-safe. It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`. If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten. See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code. This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition. It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize. ## How was this patch tested? Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances. Author: Bryan Cutler <cutlerb@gmail.com> Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
* [SPARK-19734][PYTHON][ML] Correct OneHotEncoder doc string to say dropLastMark Grover2017-03-011-1/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Updates the doc string to match up with the code i.e. say dropLast instead of includeFirst ## How was this patch tested? Not much, since it's a doc-like change. Will run unit tests via Jenkins job. Author: Mark Grover <mark@apache.org> Closes #17127 from markgrover/spark_19734.
* [MINOR][ML] Fix comments in LSH Examples and Python APIYun Ni2017-03-011-1/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Remove `org.apache.spark.examples.` in Add slash in one of the python doc. ## How was this patch tested? Run examples using the commands in the comments. Author: Yun Ni <yunn@uber.com> Closes #17104 from Yunni/yunn_minor.
* [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive HashingYun Ni2017-02-151-0/+291
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH. ## How was this patch tested? API and examples are tested using spark-submit: `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py` `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py` User guide changes are generated and manually inspected: `SKIP_API=1 jekyll build` Author: Yun Ni <yunn@uber.com> Author: Yanbo Liang <ybliang8@gmail.com> Author: Yunni <Euler57721@gmail.com> Closes #16715 from Yunni/spark-18080.
* [SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in ↵VinceShieh2017-02-151-1/+11
| | | | | | | | | | | | | | | | | pyspark ## What changes were proposed in this pull request? This PR is to document the changes on QuantileDiscretizer in pyspark for PR: https://github.com/apache/spark/pull/15428 ## How was this patch tested? No test needed Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #16922 from VinceShieh/spark-19590.
* [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor changePeng, Meng2017-01-101-4/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for #15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes #16434 from mpjlu/fdr_fwe_update.
* [SPARK-17645][MLLIB][ML] add feature selector method based on: False ↵Peng2016-12-281-7/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes #15212 from mpjlu/fdr_fwe.
* [SPARK-18628][ML] Update Scala param and Python param to have quoteskrishnakalyan32016-12-111-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updated Scala param and Python param to have quotes around the options making it easier for users to read. ## How was this patch tested? Manually checked the docstrings Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16242 from krishnakalyan3/doc-string.
* [SPARK-18366][PYSPARK][ML] Add handleInvalid to Pyspark for ↵Sandeep Singh2016-11-301-14/+71
| | | | | | | | | | | | | | | QuantileDiscretizer and Bucketizer ## What changes were proposed in this pull request? added the new handleInvalid param for these transformers to Python to maintain API parity. ## How was this patch tested? existing tests testing is done with new doctests Author: Sandeep Singh <sandeep@techaddict.me> Closes #15817 from techaddict/SPARK-18366.
* [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, ↵Yuhao2016-11-291-4/+0
| | | | | | | | | | | | | | | | | sealed audit ## What changes were proposed in this pull request? make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15972 from hhbyyh/experimental21.
* [SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across ↵hyukjinkwon2016-11-221-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Python API documentation ## What changes were proposed in this pull request? It seems in Python, there are - `Note:` - `NOTE:` - `Note that` - `.. note::` This PR proposes to fix those to `.. note::` to be consistent. **Before** <img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png"> <img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png"> **After** <img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png"> <img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png"> ## How was this patch tested? The notes were found via ```bash grep -r "Note: " . grep -r "NOTE: " . grep -r "Note that " . ``` And then fixed one by one comparing with API documentation. After that, manually tested via `make html` under `./python/docs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15947 from HyukjinKwon/SPARK-18447.
* [SPARK-18088][ML] Various ChiSqSelector cleanupsJoseph K. Bradley2016-11-011-18/+19
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? - Renamed kbest to numTopFeatures - Renamed alpha to fpr - Added missing Since annotations - Doc cleanups ## How was this patch tested? Added new standardized unit tests for spark.ml. Improved existing unit test coverage a bit. Author: Joseph K. Bradley <joseph@databricks.com> Closes #15647 from jkbradley/chisqselector-follow-ups.
* [SPARK-17219][ML] enhanced NaN value handling in BucketizerVinceShieh2016-10-271-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Author: Vincent Xie <vincent.xie@intel.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #15428 from VinceShieh/spark-17219_followup.
* [SPARK-17870][MLLIB][ML] Change statistic to pValue for SelectKBest and ↵Peng2016-10-141-2/+2
| | | | | | | | | | | | | | | | | SelectPercentile because of DoF difference ## What changes were proposed in this pull request? For feature selection method ChiSquareSelector, it is based on the ChiSquareTestResult.statistic (ChiSqure value) to select the features. It select the features with the largest ChiSqure value. But the Degree of Freedom (df) of ChiSqure value is different in Statistics.chiSqTest(RDD), and for different df, you cannot base on ChiSqure value to select features. So we change statistic to pValue for SelectKBest and SelectPercentile ## How was this patch tested? change existing test Author: Peng <peng.meng@intel.com> Closes #15444 from mpjlu/chisqure-bug.
* [SPARK-15957][FOLLOW-UP][ML][PYSPARK] Add Python API for RFormula ↵Yanbo Liang2016-10-131-4/+27
| | | | | | | | | | | | | | forceIndexLabel. ## What changes were proposed in this pull request? Follow-up work of #13675, add Python API for ```RFormula forceIndexLabel```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15430 from yanboliang/spark-15957-python.
* [SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement.Sean Owen2016-10-011-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Partial revert of #15277 to instead sort and store input to model rather than require sorted input ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #15299 from srowen/SPARK-17704.2.
* [SPARK-17017][FOLLOW-UP][ML] Refactor of ChiSqSelector and add ML Python API.Yanbo Liang2016-09-261-6/+65
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? #14597 modified ```ChiSqSelector``` to support ```fpr``` type selector, however, it left some issue need to be addressed: * We should allow users to set selector type explicitly rather than switching them by using different setting function, since the setting order will involves some unexpected issue. For example, if users both set ```numTopFeatures``` and ```percentile```, it will train ```kbest``` or ```percentile``` model based on the order of setting (the latter setting one will be trained). This make users confused, and we should allow users to set selector type explicitly. We handle similar issues at other place of ML code base such as ```GeneralizedLinearRegression``` and ```LogisticRegression```. * Meanwhile, if there are more than one parameter except ```alpha``` can be set for ```fpr``` model, we can not handle it elegantly in the existing framework. And similar issues for ```kbest``` and ```percentile``` model. Setting selector type explicitly can solve this issue also. * If setting selector type explicitly by users is allowed, we should handle param interaction such as if users set ```selectorType = percentile``` and ```alpha = 0.1```, we should notify users the parameter ```alpha``` will take no effect. We should handle complex parameter interaction checks at ```transformSchema```. (FYI #11620) * We should use lower case of the selector type names to follow MLlib convention. * Add ML Python API. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15214 from yanboliang/spark-17017.
* [SPARK-17219][ML] Add NaN value handling in BucketizerVinceShieh2016-09-211-0/+5
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14858 from VinceShieh/spark-17219.
* [SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit ↵Joseph K. Bradley2016-07-131-64/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for ML ## What changes were proposed in this pull request? General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml ** Annotate Estimator-Model pairs of classes and companion objects the same way. ** For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new: * Model Summary classes * MLWriter, MLReader, MLWritable, MLReadable * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency. * RFormula: Its behavior may need to change slightly to match R in edge cases. * AFTSurvivalRegression * MultilayerPerceptronClassifier DeveloperApi changes: * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi ## How was this patch tested? N/A Note to reviewers: * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental. * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14147 from jkbradley/experimental-audit.
* [SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to NoneBryan Cutler2016-06-211-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Several places set the seed Param default value to None which will translate to a zero value on the Scala side. This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation. These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero. ## How was this patch tested? Ran PySpark tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
* [SPARK-10258][DOC][ML] Add @Since annotations to ml.featureNick Pentreath2016-06-211-5/+5
| | | | | | | | | | | | | | This PR adds missing `Since` annotations to `ml.feature` package. Closes #8505. ## How was this patch tested? Existing tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13641 from MLnick/add-since-annotations.
* [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ↵Liang-Chi Hsieh2016-06-131-1/+1
| | | | | | | | | | | | | | | ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.
* [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar ↵Bryan Cutler2016-06-101-0/+12
| | | | | | | | | | | | | | to Scala API ## What changes were proposed in this pull request? Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review. ## How was this patch tested? run pyspark-ml tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
* [SPARK-15837][ML][PYSPARK] Word2vec python add maxsentence parameterWeichenXu2016-06-101-5/+24
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Word2vec python add maxsentence parameter. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13578 from WeichenXu123/word2vec_python_add_maxsentence.
* [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" propertyJeff Zhang2016-06-091-0/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? add method idf to IDF in pyspark ## How was this patch tested? add unit test Author: Jeff Zhang <zjffdu@apache.org> Closes #13540 from zjffdu/SPARK-15788.
* [SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.featureYanbo Liang2016-06-011-16/+13
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include: * Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless. * Scala API docs update. * Sync Scala and Python API docs for these changes. ## How was this patch tested? Exist tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13410 from yanboliang/spark-15587.
* [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark ↵Nick Pentreath2016-05-241-15/+36
| | | | | | | | | | | | | | | | | | QuantileDiscretizer This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala. Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`). Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`. ## How was this patch tested? A little doctest and built API docs locally to check HTML doc generation. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
* [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext ↵WeichenXu2016-05-231-32/+34
| | | | | | | | | | | | | | | | with SparkSession using builder pattern in python test code ## What changes were proposed in this pull request? Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code. ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
* [DOC][MINOR] ml.feature Scala and Python API syncBryan Cutler2016-05-191-13/+26
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? I reviewed Scala and Python APIs for ml.feature and corrected discrepancies. ## How was this patch tested? Built docs locally, ran style checks Author: Bryan Cutler <cutlerb@gmail.com> Closes #13159 from BryanCutler/ml.feature-api-sync.
* [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based ↵DB Tsai2016-05-171-13/+13
| | | | | | | | | | | | | | | | | | algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.
* [SPARK-15136][PYSPARK][DOC] Fix links to sphinx style and add a default ↵Holden Karau2016-05-091-7/+6
| | | | | | | | | | | | | | | | param doc note ## What changes were proposed in this pull request? PyDoc links in ml are in non-standard format. Switch to standard sphinx link format for better formatted documentation. Also add a note about default value in one place. Copy some extended docs from scala for GBT ## How was this patch tested? Built docs locally. Author: Holden Karau <holden@us.ibm.com> Closes #12918 from holdenk/SPARK-15137-linkify-pyspark-ml-classification.
* [SPARK-14050][ML] Add multiple languages support and additional methods for ↵Burak Köse2016-05-061-16/+22
| | | | | | | | | | | | | | | | | | | | | | | | | Stop Words Remover ## What changes were proposed in this pull request? This PR continues the work from #11871 with the following changes: * load English stopwords as default * covert stopwords to list in Python * update some tests and doc ## How was this patch tested? Unit tests. Closes #11871 cc: burakkose srowen Author: Burak Köse <burakks41@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Burak KOSE <burakks41@gmail.com> Closes #12843 from mengxr/SPARK-14050.
* [SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean upYanbo Liang2016-05-031-74/+37
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? PySpark ML Params setter code clean up. For examples, ```setInputCol``` can be simplified from ``` self._set(inputCol=value) return self ``` to: ``` return self._set(inputCol=value) ``` This is a pretty big sweeps, and we cleaned wherever possible. ## How was this patch tested? Exist unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12749 from yanboliang/spark-14971.
* [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in ↵Xusen Yin2016-05-011-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Spark and PySpark - update ## What changes were proposed in this pull request? This PR is an update for [https://github.com/apache/spark/pull/12738] which: * Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side * Various fixes for bugs found * This includes changing classes taking weightCol to treat unset and empty String Param values the same way. Defaults changed: * Scala * LogisticRegression: weightCol defaults to not set (instead of empty string) * StringIndexer: labels default to not set (instead of empty array) * GeneralizedLinearRegression: * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver) * weightCol defaults to not set (instead of empty string) * LinearRegression: weightCol defaults to not set (instead of empty string) * Python * MultilayerPerceptron: layers default to not set (instead of [1,1]) * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set) ## How was this patch tested? Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test. Author: Joseph K. Bradley <joseph@databricks.com> Author: yinxusen <yinxusen@gmail.com> Closes #12816 from jkbradley/yinxusen-SPARK-14931.
* [SPARK-13289][MLLIB] Fix infinite distances between word vectors in ↵Junyang2016-04-301-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Word2VecModel ## What changes were proposed in this pull request? This PR fixes the bug that generates infinite distances between word vectors. For example, Before this PR, we have ``` val synonyms = model.findSynonyms("who", 40) ``` will give the following results: ``` to Infinity and Infinity that Infinity with Infinity ``` With this PR, the distance between words is a value between 0 and 1, as follows: ``` scala> model.findSynonyms("who", 10) res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006)) scala> model.findSynonyms("king", 10) res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043)) scala> model.findSynonyms("queen", 10) res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199)) ``` ### There are two places changed in this PR: - Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once. - Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation ## How was this patch tested? Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors. Author: Junyang <fly.shenjy@gmail.com> Author: flyskyfly <fly.shenjy@gmail.com> Closes #11812 from flyjy/TVec.
* [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg optionYanbo Liang2016-04-271-30/+11
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone. ## How was this patch tested? Unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12702 from yanboliang/spark-14899.
* [SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3Yanbo Liang2016-04-251-8/+31
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method. Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work. ## How was this patch tested? unit tests. cc jkbradley MLnick Author: Yanbo Liang <ybliang8@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #12498 from yanboliang/spark-10574.
* [MINOR][ML][PYSPARK] Fix omissive param setters which should use _set methodYanbo Liang2016-04-201-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? #11939 make Python param setters use the `_set` method. This PR fix omissive ones. ## How was this patch tested? Existing tests. cc jkbradley sethah Author: Yanbo Liang <ybliang8@gmail.com> Closes #12531 from yanboliang/setters-omissive.
* [SPARK-14555] First cut of Python API for Structured StreamingBurak Yavuz2016-04-201-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes: - ContinuousQuery - Trigger - ProcessingTime in pyspark under `pyspark.sql.streaming`. In addition, it contains the new methods added under: - `DataFrameWriter` a) `startStream` b) `trigger` c) `queryName` - `DataFrameReader` a) `stream` - `DataFrame` a) `isStreaming` This PR doesn't contain all methods exposed for `ContinuousQuery`, for example: - `exception` - `sourceStatuses` - `sinkStatus` They may be added in a follow up. This PR also contains some very minor doc fixes in the Scala side. ## How was this patch tested? Python doc tests TODO: - [ ] verify Python docs look good Author: Burak Yavuz <brkyvz@gmail.com> Author: Burak Yavuz <burak@databricks.com> Closes #12320 from brkyvz/stream-python.
* [SPARK-14714][ML][PYTHON] Fixed issues with non-kwarg typeConverter arg for ↵Joseph K. Bradley2016-04-181-7/+10
| | | | | | | | | | | | | | | | | | | | | | | | | Param constructor ## What changes were proposed in this pull request? PySpark Param constructors need to pass the TypeConverter argument by name, partly to make sure it is not mistaken for the expectedType arg and partly because we will remove the expectedType arg in 2.1. In several places, this is not being done correctly. This PR changes all usages in pyspark/ml/ to keyword args. ## How was this patch tested? Existing unit tests. I will not test type conversion for every Param unless we really think it necessary. Also, if you start the PySpark shell and import classes (e.g., pyspark.ml.feature.StandardScaler), then you no longer get this warning: ``` /Users/josephkb/spark/python/pyspark/ml/param/__init__.py:58: UserWarning: expectedType is deprecated and will be removed in 2.1. Use typeConverter instead, as a keyword argument. "Use typeConverter instead, as a keyword argument.") ``` That warning came from the typeConverter argument being passes as the expectedType arg by mistake. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12480 from jkbradley/typeconverter-fix.
* [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize methodJason Lee2016-04-181-5/+23
| | | | | | | | | | | | ## What changes were proposed in this pull request? Added windowSize getter/setter to ML/MLlib ## How was this patch tested? Added test cases in tests.py under both ML and MLlib Author: Jason Lee <cjlee@us.ibm.com> Closes #12428 from jasoncl/SPARK-14564.
* [SPARK-14104][PYSPARK][ML] All Python param setters should use the `_set` methodsethah2016-04-151-35/+37
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens. Additional changes: * [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here * An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here. ## How was this patch tested? Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR. Author: sethah <seth.hendrickson16@gmail.com> Closes #11939 from sethah/SPARK-14104.
* [SPARK-14665][ML][PYTHON] Fixed bug with StopWordsRemover default stopwordsJoseph K. Bradley2016-04-151-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? The default stopwords were a Java object. They are no longer. ## How was this patch tested? Unit test which failed before the fix Author: Joseph K. Bradley <joseph@databricks.com> Closes #12422 from jkbradley/pyspark-stopwords.
* [SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark ↵Yong Tang2016-04-141-2/+22
| | | | | | | | | | | | | | | | | | HashingTF in ML & MLlib ## What changes were proposed in this pull request? This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1. Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done. ## How was this patch tested? This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib. Author: Yong Tang <yong.tang.github@outlook.com> Closes #12079 from yongtang/SPARK-14238.
* [SPARK-13967][PYSPARK][ML] Added binary Param to Python CountVectorizerBryan Cutler2016-04-141-5/+29
| | | | | | | | | | Added binary toggle param to CountVectorizer feature transformer in PySpark. Created a unit test for using CountVectorizer with the binary toggle on. Author: Bryan Cutler <cutlerb@gmail.com> Closes #12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
* [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark paramssethah2016-03-231-40/+55
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type. This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira. ## How was this patch tested? Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided. Author: sethah <seth.hendrickson16@gmail.com> Closes #11663 from sethah/SPARK-13068-tc.
* [SPARK-13951][ML][PYTHON] Nested Pipeline persistenceJoseph K. Bradley2016-03-221-43/+46
| | | | | | | | | | | | | | | Adds support for saving and loading nested ML Pipelines from Python. Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations. Also: * Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader. * Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java Added new unit test for nested Pipelines. Abstracted validity check into a helper method for the 2 unit tests. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11866 from jkbradley/nested-pipeline-io. Closes #11835
* [SPARK-13993][PYSPARK] Add pyspark Rformula/RforumlaModel save/loadXusen Yin2016-03-201-3/+27
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13993 ## How was this patch tested? doctest Author: Xusen Yin <yinxusen@gmail.com> Closes #11807 from yinxusen/SPARK-13993.
* [SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.pyXusen Yin2016-03-041-45/+296
| | | | | | | | | | Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`. In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`. Author: Xusen Yin <yinxusen@gmail.com> Closes #11203 from yinxusen/SPARK-13036.