aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-10163] [ML] Allow single-category features for GBT modelsJoseph K. Bradley2015-08-211-5/+0
| | | | | | | | | | | | Removed categorical feature info validation since no longer needed This is needed to make the ML user guide examples work (in another current PR). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8367 from jkbradley/gbt-single-cat.
* [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotationMechCoder2015-08-2168-862/+692
| | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8352 from MechCoder/since.
* [SPARK-9245] [MLLIB] LDA topic assignmentsJoseph K. Bradley2015-08-204-7/+74
| | | | | | | | | | For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token. CC: rotationsymmetry mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8329 from jkbradley/lda-topic-assignments.
* [SPARK-10108] Add since tags to mllib.featureMechCoder2015-08-209-11/+76
| | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8309 from MechCoder/tags_feature.
* [SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add ↵Xiangrui Meng2015-08-202-27/+101
| | | | | | | | | | Java test suite Otherwise, setters do not return self type. jkbradley avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8342 from mengxr/SPARK-10138.
* [SPARK-9895] User Guide for RFormula Feature TransformerEric Liang2015-08-191-2/+2
| | | | | | | | mengxr Author: Eric Liang <ekl@databricks.com> Closes #8293 from ericl/docs-2.
* [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clusteringXiangrui Meng2015-08-199-52/+338
| | | | | | | | | | | | This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder Closes #8256 Author: Xiangrui Meng <meng@databricks.com> Author: Xiaoqing Wang <spark445@126.com> Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8288 from mengxr/SPARK-8918.
* [SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator`Feynman Liang2015-08-199-20/+50
| | | | | | | | | | | | | Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097.
* [SPARK-10012] [ML] Missing test case for Params#arrayLengthGtlewuathe2015-08-181-0/+3
| | | | | | | | Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012.
* [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.treeBryan Cutler2015-08-1824-1/+157
| | | | | | | | Added since tags to mllib.tree Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.
* [SPARK-9900] [MLLIB] User guide for Association RulesFeynman Liang2015-08-181-1/+1
| | | | | | | | Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules.
* [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate ↵Yuhao Yang2015-08-184-155/+402
| | | | | | | | | | | | | | | CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator.
* [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights publicYanbo Liang2015-08-171-2/+2
| | | | | | | | Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8263 from yanboliang/mlp-public.
* [SPARK-7808] [ML] add package doc for ml.featureXiangrui Meng2015-08-171-0/+89
| | | | | | | | This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8260 from mengxr/SPARK-7808.
* SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regressionPrayag Chandran2015-08-179-12/+168
| | | | | | | | | | | | | Added since tags to mllib.regression Author: Prayag Chandran <prayagchandran@gmail.com> Closes #7518 from prayagchandran/sinceTags and squashes the following commits: fa4dda2 [Prayag Chandran] Re-formatting 6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags 1a0365f [Prayag Chandran] Reformating and adding a few more tags 89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
* [SPARK-8920] [MLLIB] Add @since tags to mllib.linalgSameer Abhyankar2015-08-178-17/+227
| | | | | | | Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome> Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local> Closes #7729 from sabhyankar/branch_8920.
* [SPARK-9959] [MLLIB] Association Rules Java CompatibilityFeynman Liang2015-08-171-2/+28
| | | | | | | | mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8206 from feynmanliang/SPARK-9959-arules-java.
* [HOTFIX] fix duplicated bracesDavies Liu2015-08-141-1/+1
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #8219 from davies/fix_typo.
* [SPARK-9981] [ML] Made labels public for StringIndexerModelJoseph K. Bradley2015-08-142-1/+22
| | | | | | | | | | Also added unit test for integration between StringIndexerModel and IndexToString CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8211 from jkbradley/stridx-labels.
* [SPARK-9929] [SQL] support metadata in withColumnWenchen Fan2015-08-144-7/+6
| | | | | | | | in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8159 from cloud-fan/withColumn.
* [SPARK-8744] [ML] Add a public constructor to StringIndexerHolden Karau2015-08-142-1/+5
| | | | | | | | It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model. Author: Holden Karau <holden@pigscanfly.ca> Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
* [SPARK-9956] [ML] Make trees work with one-category featuresJoseph K. Bradley2015-08-142-10/+30
| | | | | | | | | | | | | | This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical. As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing. Targeted for 1.5 and master CC: manishamde mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8187 from jkbradley/tree-1cat.
* [SPARK-9661] [MLLIB] minor clean-up of SPARK-9661Xiangrui Meng2015-08-144-25/+28
| | | | | | | | Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8190 from mengxr/SPARK-9661-fix.
* [SPARK-9922] [ML] rename StringIndexerReverse to IndexToStringXiangrui Meng2015-08-132-36/+48
| | | | | | | | | | | | | What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better. ~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~ I also removed `invert`. jkbradley holdenk Author: Xiangrui Meng <meng@databricks.com> Closes #8152 from mengxr/SPARK-9922.
* [SPARK-9661] [MLLIB] [ML] Java compatibilityMechCoder2015-08-135-3/+99
| | | | | | | | | | | | I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same. 1. Some methods in LDAModel. 2. runMiniBatchSGD 3. kolmogorovSmirnovTest Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8126 from MechCoder/java_incop.
* [MINOR] [ML] change MultilayerPerceptronClassifierModel to ↵Yanbo Liang2015-08-131-8/+8
| | | | | | | | | | MultilayerPerceptronClassificationModel To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8164 from yanboliang/mlp-name.
* [SPARK-9073] [ML] spark.ml Models copy() should call setParent when there is ↵lewuathe2015-08-1339-20/+135
| | | | | | | | | | | a parent Copied ML models must have the same parent of original ones Author: lewuathe <lewuathe@me.com> Author: Lewuathe <lewuathe@me.com> Closes #7447 from Lewuathe/SPARK-9073.
* [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tolXiangrui Meng2015-08-122-50/+13
| | | | | | | | | | | | | | | | This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues. This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters. jkbradley yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8148 from mengxr/SPARK-9918 and squashes the following commits: 149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol 3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
* [SPARK-9914] [ML] define setters explicitly for Java and use setParam group ↵Xiangrui Meng2015-08-121-5/+6
| | | | | | | | | | | | | | | in RFormula The problem with defining setters in the base class is that it doesn't return the correct type in Java. ericl Author: Xiangrui Meng <meng@databricks.com> Closes #8143 from mengxr/SPARK-9914 and squashes the following commits: d36c887 [Xiangrui Meng] remove setters from model a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam group
* [SPARK-8922] [DOCUMENTATION, MLLIB] Add @since tags to mllib.evaluationshikai.tang2015-08-125-5/+50
| | | | | | Author: shikai.tang <tar.sky06@gmail.com> Closes #7429 from mosessky/master.
* [SPARK-9917] [ML] add getMin/getMax and doc for originalMin/origianlMax in ↵Xiangrui Meng2015-08-121-1/+9
| | | | | | | | | | MinMaxScaler hhbyyh Author: Xiangrui Meng <meng@databricks.com> Closes #8145 from mengxr/SPARK-9917.
* [SPARK-9903] [MLLIB] skip local processing in PrefixSpan if there are no ↵Xiangrui Meng2015-08-121-16/+21
| | | | | | | | | | small prefixes There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8136 from mengxr/SPARK-9903.
* [SPARK-9704] [ML] Made ProbabilisticClassifier, Identifiable, VectorUDT ↵Joseph K. Bradley2015-08-123-10/+20
| | | | | | | | | | | | | | | | | | public APIs Made ProbabilisticClassifier, Identifiable, VectorUDT public. All are annotated as DeveloperApi. CC: mengxr EronWright Author: Joseph K. Bradley <joseph@databricks.com> Closes #8004 from jkbradley/ml-api-public-items and squashes the following commits: 7ebefda [Joseph K. Bradley] update per code review 7ff0768 [Joseph K. Bradley] attepting to add mima fix 756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent ae7767d [Joseph K. Bradley] added another warning 94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
* [SPARK-9915] [ML] stopWords should use StringArrayParamXiangrui Meng2015-08-121-3/+3
| | | | | | | | hhbyyh Author: Xiangrui Meng <meng@databricks.com> Closes #8141 from mengxr/SPARK-9915.
* [SPARK-9912] [MLLIB] QRDecomposition should use QType and RType for type ↵Xiangrui Meng2015-08-121-1/+1
| | | | | | | | | | names instead of UType and VType hhbyyh Author: Xiangrui Meng <meng@databricks.com> Closes #8140 from mengxr/SPARK-9912.
* [SPARK-9909] [ML] [TRIVIAL] move weightCol to shared paramsHolden Karau2015-08-123-15/+20
| | | | | | | | As per the TODO move weightCol to Shared Params. Author: Holden Karau <holden@pigscanfly.ca> Closes #8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.
* [SPARK-9913] [MLLIB] LDAUtils should be privateXiangrui Meng2015-08-121-1/+1
| | | | | | | | feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8142 from mengxr/SPARK-9913.
* [SPARK-9789] [ML] Added logreg threshold param backJoseph K. Bradley2015-08-125-40/+137
| | | | | | | | | | Reinstated LogisticRegression.threshold Param for binary compatibility. Param thresholds overrides threshold, if set. CC: mengxr dbtsai feynmanliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8079 from jkbradley/logreg-reinstate-threshold.
* [SPARK-9847] [ML] Modified copyValues to distinguish between default, ↵Joseph K. Bradley2015-08-122-3/+24
| | | | | | | | | | | | | explicit param values From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics. This issue arose in SPARK-9789, where 2 params "threshold" and "thresholds" for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8115 from jkbradley/copyvalues-fix.
* [HOTFIX] Fix style error caused by 017b5deAndrew Or2015-08-111-1/+1
|
* [SPARK-8925] [MLLIB] Add @since tags to mllib.utilSudhakar Thota2015-08-111-1/+21
| | | | | | | | | Went thru the history of changes the file MLUtils.scala and picked up the version that the change went in. Author: Sudhakar Thota <sudhakarthota@yahoo.com> Author: Sudhakar Thota <sudhakarthota@sudhakars-mbp-2.usca.ibm.com> Closes #7436 from sthota2014/SPARK-8925_thotas.
* [SPARK-9788] [MLLIB] Fix LDA Binary CompatibilityFeynman Liang2015-08-114-24/+46
| | | | | | | | | | | | | | | | 1. Add “asymmetricDocConcentration” and revert docConcentration changes. If the (internal) doc concentration vector is a single value, “getDocConcentration" returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise. 2. Give `LDAModel.gammaShape` a default value in `LDAModel` concrete class constructors. jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8077 from feynmanliang/SPARK-9788 and squashes the following commits: 6b07bc8 [Feynman Liang] Code review changes 9d6a71e [Feynman Liang] Add asymmetricAlpha alias bf4e685 [Feynman Liang] Asymmetric docConcentration 4cab972 [Feynman Liang] Default gammaShape
* [SPARK-9750] [MLLIB] Improve equals on SparseMatrix and DenseMatrixFeynman Liang2015-08-112-2/+24
| | | | | | | | | | | | | | | | | Adds unit test for `equals` on `mllib.linalg.Matrix` class and `equals` to both `SparseMatrix` and `DenseMatrix`. Supports equality testing between `SparseMatrix` and `DenseMatrix`. mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8042 from feynmanliang/SPARK-9750 and squashes the following commits: bb70d5e [Feynman Liang] Breeze compare for dense matrices as well, in case other is sparse ab6f3c8 [Feynman Liang] Sparse matrix compare for equals 22782df [Feynman Liang] Add equality based on matrix semantics, not representation 78f9426 [Feynman Liang] Add casts 43d28fa [Feynman Liang] Fix failing test 6416fa0 [Feynman Liang] Add failing sparse matrix equals tests
* [SPARK-8764] [ML] string indexer should take option to handle unseen valuesHolden Karau2015-08-114-4/+73
| | | | | | | | | | | | | | | | | | | | | | As a precursor to adding a public constructor add an option to handle unseen values by skipping rather than throwing an exception (default remains throwing an exception), Author: Holden Karau <holden@pigscanfly.ca> Closes #7266 from holdenk/SPARK-8764-string-indexer-should-take-option-to-handle-unseen-values and squashes the following commits: 38a4de9 [Holden Karau] fix long line 045bf22 [Holden Karau] Add a second b entry so b gets 0 for sure 81dd312 [Holden Karau] Update the docs for handleInvalid param to be more descriptive 7f37f6e [Holden Karau] remove extra space (scala style) 414e249 [Holden Karau] And switch to using handleInvalid instead of skipInvalid 1e53f9b [Holden Karau] update the param (codegen side) 7a22215 [Holden Karau] fix typo 100a39b [Holden Karau] Merge in master aa5b093 [Holden Karau] Since we filter we should never go down this code path if getSkipInvalid is true 75ffa69 [Holden Karau] Remove extra newline d69ef5e [Holden Karau] Add a test b5734be [Holden Karau] Add support for unseen labels afecd4e [Holden Karau] Add a param to skip invalid entries.
* [SPARK-8345] [ML] Add an SQL node as a feature transformerYanbo Liang2015-08-112-0/+116
| | | | | | | | | | | | | | Implements the transforms which are defined by SQL statement. Currently we only support SQL syntax like 'SELECT ... FROM __THIS__' where '__THIS__' represents the underlying table of the input dataset. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7465 from yanboliang/spark-8345 and squashes the following commits: b403fcb [Yanbo Liang] address comments 0d4bb15 [Yanbo Liang] a better transformSchema() implementation 51eb9e7 [Yanbo Liang] Add an SQL node as a feature transformer
* [SPARK-9755] [MLLIB] Add docs to MultivariateOnlineSummarizer methodsFeynman Liang2015-08-101-0/+16
| | | | | | | | | | | | Adds method documentations back to `MultivariateOnlineSummarizer`, which were present in 1.4 but disappeared somewhere along the way to 1.5. jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8045 from feynmanliang/SPARK-9755 and squashes the following commits: af67fde [Feynman Liang] Add MultivariateOnlineSummarizer docs
* [SPARK-9719] [ML] Clean up Naive Bayes docFeynman Liang2015-08-071-0/+4
| | | | | | | | | | | | Small documentation cleanups, including: * Adds documentation for `pi` and `theta` * setParam to `setModelType` Author: Feynman Liang <fliang@databricks.com> Closes #8047 from feynmanliang/SPARK-9719 and squashes the following commits: b372438 [Feynman Liang] Clean up naive bayes doc
* [SPARK-9756] [ML] Make constructors in ML decision trees privateFeynman Liang2015-08-074-4/+7
| | | | | | | | | | | | | These should be made private until there is a public constructor for providing `rootNode: Node` to use these constructors. jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8046 from feynmanliang/SPARK-9756 and squashes the following commits: 2cbdf08 [Feynman Liang] Make RFRegressionModel aux constructor private a06f596 [Feynman Liang] Make constructors in ML decision trees private
* [SPARK-9748] [MLLIB] Centriod typo in KMeansModelBertrand Dechoux2015-08-071-5/+5
| | | | | | | | | | A minor typo (centriod -> centroid). Readable variable names help every users. Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com> Closes #8037 from BertrandDechoux/kmeans-typo and squashes the following commits: 47632fe [Bertrand Dechoux] centriod typo
* [SPARK-8481] [MLLIB] GaussianMixtureModel predict accepting single vectorDariusz Kobylarz2015-08-072-0/+23
| | | | | | | | | | | | | | | Resubmit of [https://github.com/apache/spark/pull/6906] for adding single-vec predict to GMMs CC: dkobylarz mengxr To be merged with master and branch-1.5 Primary author: dkobylarz Author: Dariusz Kobylarz <darek.kobylarz@gmail.com> Closes #8039 from jkbradley/gmm-predict-vec and squashes the following commits: bfbedc4 [Dariusz Kobylarz] [SPARK-8481] [MLlib] GaussianMixtureModel predict accepting single vector