aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* Preparing Spark release v1.4.0-rc1Patrick Wendell2015-05-191-1/+1
|
* Revert "Preparing Spark release v1.4.0-rc1"Patrick Wendell2015-05-191-1/+1
| | | | This reverts commit 79fb01a3be07b5086134a6fe103248e9a33a9500.
* Revert "Preparing development version 1.4.1-SNAPSHOT"Patrick Wendell2015-05-191-1/+1
| | | | This reverts commit a1d896b85bd3fb88284f8b6758d7e5f0a1bb9eb3.
* Preparing development version 1.4.1-SNAPSHOTPatrick Wendell2015-05-191-1/+1
|
* Preparing Spark release v1.4.0-rc1Patrick Wendell2015-05-191-1/+1
|
* Revert "Preparing Spark release v1.4.0-rc1"Patrick Wendell2015-05-191-1/+1
| | | | This reverts commit 38ccef36c1551dc36d9444f47df11ae34c1e139e.
* Revert "Preparing development version 1.4.1-SNAPSHOT"Patrick Wendell2015-05-191-1/+1
| | | | This reverts commit 40190ce22622cadd41f740a763fba061281c2966.
* [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansionXusen Yin2015-05-191-0/+91
| | | | | | | | | | | | | | | | | | | | | JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581). CC jkbradley Author: Xusen Yin <yinxusen@gmail.com> Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits: 1a7d80d [Xusen Yin] merge with master 892a8e9 [Xusen Yin] fix python 3 compatibility ec935bf [Xusen Yin] small fix 3e9fa1d [Xusen Yin] delete note 69fcf85 [Xusen Yin] simplify and add python example 81d21dc [Xusen Yin] add programming guide for Polynomial Expansion 40babfb [Xusen Yin] add java test suite for PolynomialExpansion (cherry picked from commit 6008ec14ed6491d0a854bb50548c46f2f9709269) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* Preparing development version 1.4.1-SNAPSHOTPatrick Wendell2015-05-191-1/+1
|
* Preparing Spark release v1.4.0-rc1Patrick Wendell2015-05-191-1/+1
|
* Revert "Preparing Spark release v1.4.0-rc1"Patrick Wendell2015-05-181-1/+1
| | | | This reverts commit e8e97e3a630dea3c68702e26bc56f61044b2db71.
* Revert "Preparing development version 1.4.1-SNAPSHOT"Patrick Wendell2015-05-181-1/+1
| | | | This reverts commit 758ca74bab7c342f94442f69476c6b9543ac1228.
* Preparing development version 1.4.1-SNAPSHOTPatrick Wendell2015-05-191-1/+1
|
* Preparing Spark release v1.4.0-rc1Patrick Wendell2015-05-191-1/+1
|
* [SPARK-7681] [MLLIB] Add SparseVector support for gemvLiang-Chi Hsieh2015-05-183-31/+224
| | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7681 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6209 from viirya/sparsevector_gemv and squashes the following commits: ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y. b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector. 57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4. 458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too. 054f05d [Liang-Chi Hsieh] Fix scala style. 410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized. 4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix. 5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix. (cherry picked from commit d03638cc2d414cee9ac7481084672e454495dfc1) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7380] [MLLIB] pipeline stages should be copyable in PythonXiangrui Meng2015-05-183-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes: 1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively. 2. Accept a list of param maps in `fit`. 3. Use parent uid and name to identify param. jkbradley Author: Xiangrui Meng <meng@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #6088 from mengxr/SPARK-7380 and squashes the following commits: 413c463 [Xiangrui Meng] remove unnecessary doc 4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380 611c719 [Xiangrui Meng] fix python style 68862b8 [Xiangrui Meng] update _java_obj initialization 927ad19 [Xiangrui Meng] fix ml/tests.py 0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer 9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params 7e0d27f [Xiangrui Meng] merge master 46840fb [Xiangrui Meng] update wrappers b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap 46cb6ed [Xiangrui Meng] merge master a163413 [Xiangrui Meng] fix style 1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380 9630eae [Xiangrui Meng] fix Identifiable._randomUID 13bd70a [Xiangrui Meng] update ml/tests.py 64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl 02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python 66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui 7431272 [Joseph K. Bradley] Rebased with master (cherry picked from commit 9c7e802a5a2b8cd3eb77642f84c54a8e976fc996) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7694] [MLLIB] Use getOrElse for getting the threshold of LR modelShuo Xiang2015-05-171-1/+1
| | | | | | | | | | | | | | | | The `toString` method of `LogisticRegressionModel` calls `get` method on an Option (threshold) without a safeguard. In spark-shell, the following code `val model = algorithm.run(data).clearThreshold()` in lbfgs code will fail as `toString `method will be called right after `clearThreshold()` to show the results in the REPL. Author: Shuo Xiang <shuoxiangpub@gmail.com> Closes #6224 from coderxiang/getorelse and squashes the following commits: d5f53c9 [Shuo Xiang] use getOrElse for getting the threshold of LR model 5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master' 98804c9 [Shuo Xiang] fix bug in topBykey and update test (cherry picked from commit 775e6f9909d4495cbc11c377508b43482d782742) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API.Reynold Xin2015-05-1611-14/+14
| | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #6211 from rxin/mllib-reader and squashes the following commits: 79a2cb9 [Reynold Xin] [SPARK-7654][MLlib] Migrate MLlib to the DataFrame reader/writer API. (cherry picked from commit 161d0b4a41f453b21adde46a86e16c2743752799) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-7473] [MLLIB] Add reservoir sample in RandomForestAiHe2015-05-152-4/+3
| | | | | | | | | | | | | | | | reservoir feature sample by using existing api Author: AiHe <ai.he@ussuning.com> Closes #5988 from AiHe/reservoir and squashes the following commits: e7a41ac [AiHe] remove non-robust testing case 28ffb9a [AiHe] set seed as rng.nextLong 37459e1 [AiHe] set fixed seed 1e98a4c [AiHe] [MLLIB][tree] Add reservoir sample in RandomForest (cherry picked from commit deb411335a09b91eb1f75421d77e1c3686719621) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7668] [MLLIB] Preserve isTransposed property for Matrix after calling ↵Liang-Chi Hsieh2015-05-151-2/+3
| | | | | | | | | | | | | | | map function JIRA: https://issues.apache.org/jira/browse/SPARK-7668 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6188 from viirya/fix_matrix_map and squashes the following commits: 2a7cc97 [Liang-Chi Hsieh] Preserve isTransposed property for Matrix after calling map function. (cherry picked from commit f96b85ab44b82736363764ea39ee62884007f4a3) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-6258] [MLLIB] GaussianMixture Python API parity checkYanbo Liang2015-05-152-11/+22
| | | | | | | | | | | | | | | | | | | | | | Implement Python API for major disparities of GaussianMixture cluster algorithm between Scala & Python ```scala GaussianMixture setInitialModel GaussianMixtureModel k ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #6087 from yanboliang/spark-6258 and squashes the following commits: b3af21c [Yanbo Liang] fix typo 2b645c1 [Yanbo Liang] fix doc 638b4b7 [Yanbo Liang] address comments b5bcade [Yanbo Liang] GaussianMixture Python API parity check (cherry picked from commit 94761485b207fa1f12a8410a68920300d851bf61) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7407] [MLLIB] use uid + name to identify parametersXiangrui Meng2015-05-1445-198/+413
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field. This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6019 from mengxr/SPARK-7407 and squashes the following commits: c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407 520f0a2 [Xiangrui Meng] address comments 2569168 [Xiangrui Meng] fix tests 873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn 409ea08 [Xiangrui Meng] minor updates 83a163c [Xiangrui Meng] update JavaDeveloperApiExample 5db5325 [Xiangrui Meng] update OneVsRest 7bde7ae [Xiangrui Meng] merge master 697fdf9 [Xiangrui Meng] update Bucketizer 7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407 629d402 [Xiangrui Meng] fix LRSuite 154516f [Xiangrui Meng] merge master aa4a611 [Xiangrui Meng] fix examples/compile a4794dd [Xiangrui Meng] change Param to use to reduce the size of diff fdbc415 [Xiangrui Meng] all tests passed c255f17 [Xiangrui Meng] fix tests in ParamsSuite 818e1db [Xiangrui Meng] merge master e1160cf [Xiangrui Meng] fix tests fbc39f0 [Xiangrui Meng] pass test:compile 108937e [Xiangrui Meng] pass compile 8726d39 [Xiangrui Meng] use parent uid in Param eaeed35 [Xiangrui Meng] update Identifiable (cherry picked from commit 1b8625f4258d6d1a049d0ba60e39e9757f5a568b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7620] [ML] [MLLIB] Removed calling size, length in while condition to ↵DB Tsai2015-05-1313-44/+73
| | | | | | | | | | | | | | avoid extra JVM call Author: DB Tsai <dbt@netflix.com> Closes #6137 from dbtsai/clean and squashes the following commits: 185816d [DB Tsai] fix compilication issue f418d08 [DB Tsai] first commit (cherry picked from commit d3db2fd66752e80865e9c7a75d8e8d945121697e) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7612] [MLLIB] update NB training to use mllib's BLASXiangrui Meng2015-05-131-23/+20
| | | | | | | | | | | | | | This is similar to the changes to k-means, which gives us better control on the performance. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #6128 from mengxr/SPARK-7612 and squashes the following commits: b5c24c5 [Xiangrui Meng] merge master a90e3ec [Xiangrui Meng] update NB training to use mllib's BLAS (cherry picked from commit d5f18de1657bfabf5493011e0b2c7ec29c02c64c) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that ↵leahmcguire2015-05-132-3/+58
| | | | | | | | | | | | | | | | | | | | | | | both training and predict features have values of 0 or 1 Author: leahmcguire <lmcguire@salesforce.com> Closes #6073 from leahmcguire/binaryCheckNB and squashes the following commits: b8442c2 [leahmcguire] changed to if else for value checks 911bf83 [leahmcguire] undid reformat 4eedf1e [leahmcguire] moved bernoulli check 9ee9e84 [leahmcguire] fixed style error 3f3b32c [leahmcguire] fixed zero one check so only called in combiner 831fd27 [leahmcguire] got test working f44bb3c [leahmcguire] removed changes from CV branch 67253f0 [leahmcguire] added check to bernoulli to ensure feature values are zero or one f191c71 [leahmcguire] fixed name 58d060b [leahmcguire] changed param name and test according to comments 04f0d3c [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access (cherry picked from commit 61e05fc58e1245de871c409b60951745b5db3420) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7593] [ML] Python Api for ml.feature.BucketizerBurak Yavuz2015-05-132-2/+15
| | | | | | | | | | | | | | | | Added `ml.feature.Bucketizer` to PySpark. cc mengxr Author: Burak Yavuz <brkyvz@gmail.com> Closes #6124 from brkyvz/ml-bucket and squashes the following commits: 05285be [Burak Yavuz] added sphinx doc 6abb6ed [Burak Yavuz] added support for Bucketizer (cherry picked from commit 5db18ba6e1bd8c6307c41549176c53590cf344a0) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7528] [MLLIB] make RankingMetrics Java-friendlyXiangrui Meng2015-05-122-4/+87
| | | | | | | | | | | | | `RankingMetrics` contains a ClassTag, which is hard to create in Java. This PR adds a factory method `of` for Java users. coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #6098 from mengxr/SPARK-7528 and squashes the following commits: e5d57ae [Xiangrui Meng] make RankingMetrics Java-friendly (cherry picked from commit 2713bc65af1e0e81edd5fad0338e34fd127391f9) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7573] [ML] OneVsRest cleanupsJoseph K. Bradley2015-05-123-31/+23
| | | | | | | | | | | | | | | | | | | | | Minor cleanups discussed with [~mengxr]: * move OneVsRest from reduction to classification sub-package * make model constructor private Some doc cleanups too CC: harsha2010 Could you please verify this looks OK? Thanks! Author: Joseph K. Bradley <joseph@databricks.com> Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following commits: 4ecd48d [Joseph K. Bradley] org imports 430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to classification. small java doc style fixes 9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest. Made model constructor private to ml package. (cherry picked from commit 96c4846db89802f5a81dca5dcfa3f2a0f72b5cb8) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, TokenizerJoseph K. Bradley2015-05-121-0/+81
| | | | | | | | | | | | | | | | | | | | | Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer. Added JavaHashingTFSuite to test Java examples in new guide. I've run Scala, Python examples in the Spark/PySpark shells. I ran the Java examples via the test suite (with small modifications for printing). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits: d5d213f [Joseph K. Bradley] small fix dd6e91a [Joseph K. Bradley] fixes from code review of user guide 33c3ff9 [Joseph K. Bradley] small fix bc6058c [Joseph K. Bradley] fix link 361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer. Added JavaHashingTFSuite to test Java examples in new guide (cherry picked from commit f0c1bc3472a7422ae5649634f29c88e161f5ecaf) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7571] [MLLIB] rename Math to mathXiangrui Meng2015-05-129-15/+15
| | | | | | | | | | | | | `scala.Math` is deprecated since 2.8. This PR only touchs `Math` usages in MLlib. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #6092 from mengxr/SPARK-7571 and squashes the following commits: fe8f8d3 [Xiangrui Meng] Math -> math (cherry picked from commit a4874b0d1820efd24071108434a4d89429473fe3) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in ↵Xiangrui Meng2015-05-122-39/+41
| | | | | | | | | | | | | | | | | | | the last bucket. We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11). This also update the impl to use `Arrays.binarySearch` and `withClue` in test. yinxusen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6075 from mengxr/SPARK-7559 and squashes the following commits: e28f910 [Xiangrui Meng] update bucketizer impl (cherry picked from commit 23b9863e2aa7ecd0c4fa3aa8a59fdae09b4fe1d7) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against AllRam Sriharsha2015-05-1210-8/+471
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | initial cut of one against all. test code is a scaffolding , not fully implemented. This WIP is to gather early feedback. Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #5830 from harsha2010/reduction and squashes the following commits: 5f4b495 [Ram Sriharsha] Fix Test 386e98b [Ram Sriharsha] Style fix 49b4a17 [Ram Sriharsha] Simplify the test 02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col bc78032 [Ram Sriharsha] Code Review Updates 8ce4845 [Ram Sriharsha] Merge with Master 2a807be [Ram Sriharsha] Merge branch 'master' into reduction e21bfcc [Ram Sriharsha] Style Fix 5614f23 [Ram Sriharsha] Style Fix c75583a [Ram Sriharsha] Cleanup 7a5f136 [Ram Sriharsha] Fix TODOs 804826b [Ram Sriharsha] Merge with Master 1448a5f [Ram Sriharsha] Style Fix 6e47807 [Ram Sriharsha] Style Fix d63e46b [Ram Sriharsha] Incorporate Code Review Feedback ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor 78fa82a [Ram Sriharsha] extra line 0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from UnresolvedAttribute a59a4f4 [Ram Sriharsha] @Experimental 4167234 [Ram Sriharsha] Merge branch 'master' into reduction 868a4fd [Ram Sriharsha] @Experimental 041d905 [Ram Sriharsha] Code Review Fixes df188d8 [Ram Sriharsha] Style fix 612ec48 [Ram Sriharsha] Style Fix 6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs are cleaner 6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction 1c7fa44 [Ram Sriharsha] Fix Tests ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to OneVsRestClassifier 221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base Classifiers 26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes 9738744 [Ram Sriharsha] Merge branch 'master' into reduction 1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to generate label column dynamically 32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers ff272da [Ram Sriharsha] Style fix 28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction b60f874 [Ram Sriharsha] Fix Style issues in Test 3191cdf [Ram Sriharsha] Remove this test, accidental commit 23f056c [Ram Sriharsha] Fix Headers for test 1b5e929 [Ram Sriharsha] Fix Style issues and add Header 8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary Reduction: One Against All (cherry picked from commit 595a67589a42f8025d3e5fd4da413b1faa2e14bf) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7485] [BUILD] Remove pyspark files from assembly.Marcelo Vanzin2015-05-121-11/+0
| | | | | | | | | | | | | | | | | The sbt part of the build is hacky; it basically tricks sbt into generating the zip by using a generator, but returns an empty list for the generated files so that nothing is actually added to the assembly. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #6022 from vanzin/SPARK-7485 and squashes the following commits: 22c1e04 [Marcelo Vanzin] Remove unneeded code. 4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly. (cherry picked from commit 82e890fb19d6fbaffa69856eecb4699f2f8a81eb) Signed-off-by: Andrew Or <andrew@databricks.com>
* [SPARK-5893] [ML] Add bucketizerXusen Yin2015-05-113-0/+290
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893). One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say, ```scala buckets = Array(-0.5, 0.0, 0.5) ``` splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits: dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893 1ca973a [Joseph K. Bradley] one more bucketizer test 34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead. eacfcfa [Xusen Yin] change ML attribute from splits into buckets c3cc770 [Xusen Yin] add more unit test for binary search 3a16cc2 [Xusen Yin] refine comments and names ac77859 [Xusen Yin] fix style error fb30d79 [Xusen Yin] fix and test binary search 2466322 [Xusen Yin] refactor Bucketizer 11fb00a [Xusen Yin] change it into an Estimator 998bc87 [Xusen Yin] check buckets 4024cf1 [Xusen Yin] add test suite 5fe190e [Xusen Yin] add bucketizer (cherry picked from commit 35fb42a0b01d3043b7d5e27256d1b45a08583aab) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-6092] [MLLIB] Add RankingMetrics in PySpark/MLlibYanbo Liang2015-05-111-0/+10
| | | | | | | | | | | | Author: Yanbo Liang <ybliang8@gmail.com> Closes #6044 from yanboliang/spark-6092 and squashes the following commits: 726a9b1 [Yanbo Liang] add newRankingMetrics 33f649c [Yanbo Liang] Add RankingMetrics in PySpark/MLlib (cherry picked from commit 042dda3c5c25b5ecb6ae4fd37c85b211b01c187b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5521] PCA wrapper for easy transform vectorsKirill A. Korinskiy2015-05-102-0/+141
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I implement a simple PCA wrapper for easy transform of vectors by PCA for example LabeledPoint or another complicated structure. Example of usage: ``` import org.apache.spark.mllib.regression.LinearRegressionWithSGD import org.apache.spark.mllib.regression.LabeledPoint import org.apache.spark.mllib.linalg.Vectors import org.apache.spark.mllib.feature.PCA val data = sc.textFile("data/mllib/ridge-data/lpsa.data").map { line => val parts = line.split(',') LabeledPoint(parts(0).toDouble, Vectors.dense(parts(1).split(' ').map(_.toDouble))) }.cache() val splits = data.randomSplit(Array(0.6, 0.4), seed = 11L) val training = splits(0).cache() val test = splits(1) val pca = PCA.create(training.first().features.size/2, data.map(_.features)) val training_pca = training.map(p => p.copy(features = pca.transform(p.features))) val test_pca = test.map(p => p.copy(features = pca.transform(p.features))) val numIterations = 100 val model = LinearRegressionWithSGD.train(training, numIterations) val model_pca = LinearRegressionWithSGD.train(training_pca, numIterations) val valuesAndPreds = test.map { point => val score = model.predict(point.features) (score, point.label) } val valuesAndPreds_pca = test_pca.map { point => val score = model_pca.predict(point.features) (score, point.label) } val MSE = valuesAndPreds.map{case(v, p) => math.pow((v - p), 2)}.mean() val MSE_pca = valuesAndPreds_pca.map{case(v, p) => math.pow((v - p), 2)}.mean() println("Mean Squared Error = " + MSE) println("PCA Mean Squared Error = " + MSE_pca) ``` Author: Kirill A. Korinskiy <catap@catap.ru> Author: Joseph K. Bradley <joseph@databricks.com> Closes #4304 from catap/pca and squashes the following commits: 501bcd9 [Joseph K. Bradley] Small updates: removed k from Java-friendly PCA fit(). In PCASuite, converted results to set for comparison. Added an error message for bad k in PCA. 9dcc02b [Kirill A. Korinskiy] [SPARK-5521] fix scala style 1892a06 [Kirill A. Korinskiy] [SPARK-5521] PCA wrapper for easy transform vectors (cherry picked from commit 8c07c75c9831d6c34f69fe840edb6470d4dfdfef) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-6091] [MLLIB] Add MulticlassMetrics in PySpark/MLlibYanbo Liang2015-05-101-0/+8
| | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6091 Author: Yanbo Liang <ybliang8@gmail.com> Closes #6011 from yanboliang/spark-6091 and squashes the following commits: bb3e4ba [Yanbo Liang] trigger jenkins 53c045d [Yanbo Liang] keep compatibility for python 2.6 972d5ac [Yanbo Liang] Add MulticlassMetrics in PySpark/MLlib (cherry picked from commit bf7e81a51cd81706570615cd67362c86602dec88) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7498] [ML] removed varargs annotation from Params.setDefaultsJoseph K. Bradley2015-05-082-2/+2
| | | | | | | | | | | | | | In SPARK-7429 and PR https://github.com/apache/spark/pull/5960, I added the varargs annotation to Params.setDefault which takes a variable number of ParamPairs. It worked locally and on Jenkins for me. However, mengxr reported issues compiling on his machine. So I'm reverting the change introduced in https://github.com/apache/spark/pull/5960 by removing varargs. Author: Joseph K. Bradley <joseph@databricks.com> Closes #6021 from jkbradley/revert-varargs and squashes the following commits: 098ed39 [Joseph K. Bradley] removed varargs annotation from Params.setDefaults taking multiple ParamPairs (cherry picked from commit 29926238418223b0888d418d163feebf0217b35e) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7262] [ML] Binary LogisticRegression with L1/L2 (elastic net) using ↵DB Tsai2015-05-085-40/+821
| | | | | | | | | | | | | | | | | | | | | | | | | OWLQN in new ML package 1) Handle scaling and addBias internally. 2) L1/L2 elasticnet using OWLQN optimizer. Author: DB Tsai <dbt@netflix.com> Closes #5967 from dbtsai/lor and squashes the following commits: fa029bb [DB Tsai] made the bound smaller 0806002 [DB Tsai] better initial intercept and more test 5c31824 [DB Tsai] fix import c387e25 [DB Tsai] Merge branch 'master' into lor c84e931 [DB Tsai] Made MultiClassSummarizer private f98e711 [DB Tsai] address feedback a784321 [DB Tsai] fix style 8ec65d2 [DB Tsai] remove new line f3f8c88 [DB Tsai] add more tests and they match R which is good. fix a bug 34705bc [DB Tsai] first commit (cherry picked from commit 86ef4cfd436867d88bdc211f76d6ea668d474558) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7488] [ML] Feature Parity in PySpark for ml.recommendationBurak Yavuz2015-05-081-4/+8
| | | | | | | | | | | | | | | | | Adds Python Api for `ALS` under `ml.recommendation` in PySpark. Also adds seed as a settable parameter in the Scala Implementation of ALS. Author: Burak Yavuz <brkyvz@gmail.com> Closes #6015 from brkyvz/ml-rec and squashes the following commits: be6e931 [Burak Yavuz] addressed comments eaed879 [Burak Yavuz] readd numFeatures 0bd66b1 [Burak Yavuz] fixed seed 7f6d964 [Burak Yavuz] merged master 52e2bda [Burak Yavuz] added ALS (cherry picked from commit 84bf931f36edf1f319c9116f7f326959a6118991) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5913] [MLLIB] Python API for ChiSqSelectorYanbo Liang2015-05-081-0/+10
| | | | | | | | | | | | | | Add a Python API for mllib.feature.ChiSqSelector https://issues.apache.org/jira/browse/SPARK-5913 Author: Yanbo Liang <ybliang8@gmail.com> Closes #5939 from yanboliang/spark-5913 and squashes the following commits: cdaac99 [Yanbo Liang] Python API for ChiSqSelector (cherry picked from commit 35c9599b94de759204ed33cdd46d8ee108bccd86) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-7383] [ML] Feature Parity in PySpark for ml.featuresBurak Yavuz2015-05-082-2/+2
| | | | | | | | | | | | | | | | Implemented python wrappers for Scala functions that don't exist in `ml.features` Author: Burak Yavuz <brkyvz@gmail.com> Closes #5991 from brkyvz/ml-feat-PR and squashes the following commits: adcca55 [Burak Yavuz] add regex tokenizer to __all__ b91cb44 [Burak Yavuz] addressed comments bd39fd2 [Burak Yavuz] remove addition b82bd7c [Burak Yavuz] Parity in PySpark for ml.features (cherry picked from commit f5ff4a84c4c75143086aae7d38730156bee35933) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7452] [MLLIB] fix bug in topBykey and update testShuo Xiang2015-05-072-5/+6
| | | | | | | | | | | | | the toArray function of the BoundedPriorityQueue does not necessarily preserve order. Add a counter-example as the test, which would fail the original impl. Author: Shuo Xiang <shuoxiangpub@gmail.com> Closes #5990 from coderxiang/topbykey-test and squashes the following commits: 98804c9 [Shuo Xiang] fix bug in topBykey and update test (cherry picked from commit 92f8f803a68e0c16771e9793098c6d76dfdf99af) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-6948] [MLLIB] compress vectors in VectorAssemblerXiangrui Meng2015-05-072-2/+10
| | | | | | | | | | | | | | The compression is based on storage. brkyvz Author: Xiangrui Meng <meng@databricks.com> Closes #5985 from mengxr/SPARK-6948 and squashes the following commits: df56a00 [Xiangrui Meng] update python tests 6d90d45 [Xiangrui Meng] compress vectors in VectorAssembler (cherry picked from commit e43803b8f477b2c8d28836ac163cb54328d13f1a) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-5726] [MLLIB] Elementwise (Hadamard) Vector Product TransformerOctavian Geagla2015-05-073-0/+180
| | | | | | | | | | | | | | | | | | | | | | See https://issues.apache.org/jira/browse/SPARK-5726 Author: Octavian Geagla <ogeagla@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #4580 from ogeagla/spark-mllib-weighting and squashes the following commits: fac12ad [Octavian Geagla] [SPARK-5726] [MLLIB] Use new createTransformFunc. 90f7e39 [Joseph K. Bradley] small cleanups 4595165 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove erroneous test case. ded3ac6 [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks. 37d4705 [Octavian Geagla] [SPARK-5726] [MLLIB] Incorporated feedback. 1dffeee [Octavian Geagla] [SPARK-5726] [MLLIB] Pass style checks. e436896 [Octavian Geagla] [SPARK-5726] [MLLIB] Remove 'TF' from 'ElementwiseProductTF' cb520e6 [Octavian Geagla] [SPARK-5726] [MLLIB] Rename HadamardProduct to ElementwiseProduct 4922722 [Octavian Geagla] [SPARK-5726] [MLLIB] Hadamard Vector Product Transformer (cherry picked from commit 658a478d3f86456df09d0fbb1ba438fb36d8725c) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-6093] [MLLIB] Add RegressionMetrics in PySpark/MLlibYanbo Liang2015-05-071-0/+9
| | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6093 Author: Yanbo Liang <ybliang8@gmail.com> Closes #5941 from yanboliang/spark-6093 and squashes the following commits: 6934af3 [Yanbo Liang] change to @property aac3bc5 [Yanbo Liang] Add RegressionMetrics in PySpark/MLlib (cherry picked from commit 1712a7c7057bf6dd5da8aea1d7fbecdf96ea4b32) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7388] [SPARK-7383] wrapper for VectorAssembler in PythonBurak Yavuz2015-05-074-5/+27
| | | | | | | | | | | | | | | | | | | The wrapper required the implementation of the `ArrayParam`, because `Array[T]` is hard to obtain from Python. `ArrayParam` has an extra function called `wCast` which is an internal function to obtain `Array[T]` from `Seq[T]` Author: Burak Yavuz <brkyvz@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #5930 from brkyvz/ml-feat and squashes the following commits: 73e745f [Burak Yavuz] Merge pull request #3 from mengxr/SPARK-7388 c221db9 [Xiangrui Meng] overload StringArrayParam.w c81072d [Burak Yavuz] addressed comments 99c2ebf [Burak Yavuz] add to python_shared_params 39ecb07 [Burak Yavuz] fix scalastyle 7f7ea2a [Burak Yavuz] [SPARK-7388][SPARK-7383] wrapper for VectorAssembler in Python (cherry picked from commit 9e2ffb13287e6efe256b8d23a4654e4cc305e20b) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7429] [ML] Params cleanupsJoseph K. Bradley2015-05-073-4/+4
| | | | | | | | | | | | | | | | | Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #5960 from jkbradley/params-cleanups and squashes the following commits: 118b158 [Joseph K. Bradley] Params.setDefault taking a set of ParamPairs should be annotated with varargs. I thought it would not work before, but it apparently does. CrossValidator.transform should call transformSchema since the underlying Model might be a PipelineModel (cherry picked from commit 4f87e9562aa0dfe5467d7fbaba9278213106377c) Signed-off-by: Xiangrui Meng <meng@databricks.com>
* [SPARK-7421] [MLLIB] OnlineLDA cleanupsJoseph K. Bradley2015-05-074-28/+34
| | | | | | | | | | | | | | | | | | | | Small changes, primarily to allow us more flexibility in the future: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future CC: hhbyyh Author: Joseph K. Bradley <joseph@databricks.com> Closes #5956 from jkbradley/onlinelda-cleanups and squashes the following commits: f4be508 [Joseph K. Bradley] added newline f4003e4 [Joseph K. Bradley] Changes: * Rename "tau_0" to "tau0" * Mark LDAOptimizer trait sealed and DeveloperApi. * Mark LDAOptimizer subclasses as final. * Mark setOptimizer (the one taking an LDAOptimizer) and getOptimizer as DeveloperApi since we may need to change them in the future (cherry picked from commit 8b6b46e4ff5f19fb7befecaaa0eda63bf29a0e2c) Signed-off-by: Joseph K. Bradley <joseph@databricks.com>
* [SPARK-5995] [ML] Make Prediction dev API publicJoseph K. Bradley2015-05-0616-267/+206
| | | | | | | | | | | | | | | | | | | | | Changes: * Update protected prediction methods, following design doc. **<--most interesting change** * Changed abstract classes for Estimator and Model to be public. Added DeveloperApi tag. (I kept the traits for Estimator/Model Params private.) * Changed ProbabilisticClassificationModel method names to use probability instead of probabilities. CC: mengxr shivaram etrain Author: Joseph K. Bradley <joseph@databricks.com> Closes #5913 from jkbradley/public-dev-api and squashes the following commits: e9aa0ea [Joseph K. Bradley] moved findMax to DenseVector and renamed to argmax. fixed bug for vector of length 0 15b9957 [Joseph K. Bradley] renamed probabilities to probability in method names 5cda84d [Joseph K. Bradley] regenerated sharedParams 7d1877a [Joseph K. Bradley] Made spark.ml prediction abstractions public. Organized their prediction methods for efficient computation of multiple output columns. (cherry picked from commit 1ad04dae038673a448f529c39b17817b78d6acd0) Signed-off-by: Xiangrui Meng <meng@databricks.com>