| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
| |
Removed categorical feature info validation since no longer needed
This is needed to make the ML user guide examples work (in another current PR).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8367 from jkbradley/gbt-single-cat.
|
|
|
|
|
|
| |
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #8352 from MechCoder/since.
|
|
|
|
|
|
|
|
|
|
| |
For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token.
CC: rotationsymmetry mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8329 from jkbradley/lda-topic-assignments.
|
|
|
|
|
|
| |
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #8309 from MechCoder/tags_feature.
|
|
|
|
|
|
|
|
|
|
| |
Java test suite
Otherwise, setters do not return self type. jkbradley avulanov
Author: Xiangrui Meng <meng@databricks.com>
Closes #8342 from mengxr/SPARK-10138.
|
|
|
|
|
|
|
|
| |
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #8293 from ericl/docs-2.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder
Closes #8256
Author: Xiangrui Meng <meng@databricks.com>
Author: Xiaoqing Wang <spark445@126.com>
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #8288 from mengxr/SPARK-8918.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user.
This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized.
CC jkbradley
Author: Feynman Liang <fliang@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8290 from feynmanliang/SPARK-10097.
|
|
|
|
|
|
|
|
| |
Currently there is no test case for `Params#arrayLengthGt`.
Author: lewuathe <lewuathe@me.com>
Closes #8223 from Lewuathe/SPARK-10012.
|
|
|
|
|
|
|
|
| |
Added since tags to mllib.tree
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.
|
|
|
|
|
|
|
|
| |
Updates FPM user guide to include Association Rules.
Author: Feynman Liang <fliang@databricks.com>
Closes #8207 from feynmanliang/SPARK-9900-arules.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CountVectorizerModel
jira: https://issues.apache.org/jira/browse/SPARK-9028
Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.
I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #7388 from hhbyyh/cvEstimator.
|
|
|
|
|
|
|
|
| |
Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8263 from yanboliang/mlp-public.
|
|
|
|
|
|
|
|
| |
This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #8260 from mengxr/SPARK-7808.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Added since tags to mllib.regression
Author: Prayag Chandran <prayagchandran@gmail.com>
Closes #7518 from prayagchandran/sinceTags and squashes the following commits:
fa4dda2 [Prayag Chandran] Re-formatting
6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags
1a0365f [Prayag Chandran] Reformating and adding a few more tags
89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
|
|
|
|
|
|
|
| |
Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome>
Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local>
Closes #7729 from sabhyankar/branch_8920.
|
|
|
|
|
|
|
|
| |
mengxr
Author: Feynman Liang <fliang@databricks.com>
Closes #8206 from feynmanliang/SPARK-9959-arules-java.
|
|
|
|
|
|
| |
Author: Davies Liu <davies@databricks.com>
Closes #8219 from davies/fix_typo.
|
|
|
|
|
|
|
|
|
|
| |
Also added unit test for integration between StringIndexerModel and IndexToString
CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8211 from jkbradley/stridx-labels.
|
|
|
|
|
|
|
|
| |
in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8159 from cloud-fan/withColumn.
|
|
|
|
|
|
|
|
| |
It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical.
As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing.
Targeted for 1.5 and master
CC: manishamde mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8187 from jkbradley/tree-1cat.
|
|
|
|
|
|
|
|
| |
Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #8190 from mengxr/SPARK-9661-fix.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better.
~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~
I also removed `invert`.
jkbradley holdenk
Author: Xiangrui Meng <meng@databricks.com>
Closes #8152 from mengxr/SPARK-9922.
|
|
|
|
|
|
|
|
|
|
|
|
| |
I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same.
1. Some methods in LDAModel.
2. runMiniBatchSGD
3. kolmogorovSmirnovTest
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #8126 from MechCoder/java_incop.
|
|
|
|
|
|
|
|
|
|
| |
MultilayerPerceptronClassificationModel
To follow the naming rule of ML, change `MultilayerPerceptronClassifierModel` to `MultilayerPerceptronClassificationModel` like `DecisionTreeClassificationModel`, `GBTClassificationModel` and so on.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8164 from yanboliang/mlp-name.
|
|
|
|
|
|
|
|
|
|
|
| |
a parent
Copied ML models must have the same parent of original ones
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <lewuathe@me.com>
Closes #7447 from Lewuathe/SPARK-9073.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.
This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.
jkbradley yu-iskw
Author: Xiangrui Meng <meng@databricks.com>
Closes #8148 from mengxr/SPARK-9918 and squashes the following commits:
149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in RFormula
The problem with defining setters in the base class is that it doesn't return the correct type in Java.
ericl
Author: Xiangrui Meng <meng@databricks.com>
Closes #8143 from mengxr/SPARK-9914 and squashes the following commits:
d36c887 [Xiangrui Meng] remove setters from model
a49021b [Xiangrui Meng] define setters explicitly for Java and use setParam group
|
|
|
|
|
|
| |
Author: shikai.tang <tar.sky06@gmail.com>
Closes #7429 from mosessky/master.
|
|
|
|
|
|
|
|
|
|
| |
MinMaxScaler
hhbyyh
Author: Xiangrui Meng <meng@databricks.com>
Closes #8145 from mengxr/SPARK-9917.
|
|
|
|
|
|
|
|
|
|
| |
small prefixes
There exists a chance that the prefixes keep growing to the maximum pattern length. Then the final local processing step becomes unnecessary. feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes #8136 from mengxr/SPARK-9903.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
public APIs
Made ProbabilisticClassifier, Identifiable, VectorUDT public. All are annotated as DeveloperApi.
CC: mengxr EronWright
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8004 from jkbradley/ml-api-public-items and squashes the following commits:
7ebefda [Joseph K. Bradley] update per code review
7ff0768 [Joseph K. Bradley] attepting to add mima fix
756d84c [Joseph K. Bradley] VectorUDT annotated as AlphaComponent
ae7767d [Joseph K. Bradley] added another warning
94fd553 [Joseph K. Bradley] Made ProbabilisticClassifier, Identifiable, VectorUDT public APIs
|
|
|
|
|
|
|
|
| |
hhbyyh
Author: Xiangrui Meng <meng@databricks.com>
Closes #8141 from mengxr/SPARK-9915.
|
|
|
|
|
|
|
|
|
|
| |
names instead of UType and VType
hhbyyh
Author: Xiangrui Meng <meng@databricks.com>
Closes #8140 from mengxr/SPARK-9912.
|
|
|
|
|
|
|
|
| |
As per the TODO move weightCol to Shared Params.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8144 from holdenk/SPARK-9909-move-weightCol-toSharedParams.
|
|
|
|
|
|
|
|
| |
feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes #8142 from mengxr/SPARK-9913.
|
|
|
|
|
|
|
|
|
|
| |
Reinstated LogisticRegression.threshold Param for binary compatibility. Param thresholds overrides threshold, if set.
CC: mengxr dbtsai feynmanliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8079 from jkbradley/logreg-reinstate-threshold.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
explicit param values
From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics.
This issue arose in SPARK-9789, where 2 params "threshold" and "thresholds" for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8115 from jkbradley/copyvalues-fix.
|
| |
|
|
|
|
|
|
|
|
|
| |
Went thru the history of changes the file MLUtils.scala and picked up the version that the change went in.
Author: Sudhakar Thota <sudhakarthota@yahoo.com>
Author: Sudhakar Thota <sudhakarthota@sudhakars-mbp-2.usca.ibm.com>
Closes #7436 from sthota2014/SPARK-8925_thotas.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. Add “asymmetricDocConcentration” and revert docConcentration changes. If the (internal) doc concentration vector is a single value, “getDocConcentration" returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise.
2. Give `LDAModel.gammaShape` a default value in `LDAModel` concrete class constructors.
jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes #8077 from feynmanliang/SPARK-9788 and squashes the following commits:
6b07bc8 [Feynman Liang] Code review changes
9d6a71e [Feynman Liang] Add asymmetricAlpha alias
bf4e685 [Feynman Liang] Asymmetric docConcentration
4cab972 [Feynman Liang] Default gammaShape
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds unit test for `equals` on `mllib.linalg.Matrix` class and `equals` to both `SparseMatrix` and `DenseMatrix`. Supports equality testing between `SparseMatrix` and `DenseMatrix`.
mengxr
Author: Feynman Liang <fliang@databricks.com>
Closes #8042 from feynmanliang/SPARK-9750 and squashes the following commits:
bb70d5e [Feynman Liang] Breeze compare for dense matrices as well, in case other is sparse
ab6f3c8 [Feynman Liang] Sparse matrix compare for equals
22782df [Feynman Liang] Add equality based on matrix semantics, not representation
78f9426 [Feynman Liang] Add casts
43d28fa [Feynman Liang] Fix failing test
6416fa0 [Feynman Liang] Add failing sparse matrix equals tests
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As a precursor to adding a public constructor add an option to handle unseen values by skipping rather than throwing an exception (default remains throwing an exception),
Author: Holden Karau <holden@pigscanfly.ca>
Closes #7266 from holdenk/SPARK-8764-string-indexer-should-take-option-to-handle-unseen-values and squashes the following commits:
38a4de9 [Holden Karau] fix long line
045bf22 [Holden Karau] Add a second b entry so b gets 0 for sure
81dd312 [Holden Karau] Update the docs for handleInvalid param to be more descriptive
7f37f6e [Holden Karau] remove extra space (scala style)
414e249 [Holden Karau] And switch to using handleInvalid instead of skipInvalid
1e53f9b [Holden Karau] update the param (codegen side)
7a22215 [Holden Karau] fix typo
100a39b [Holden Karau] Merge in master
aa5b093 [Holden Karau] Since we filter we should never go down this code path if getSkipInvalid is true
75ffa69 [Holden Karau] Remove extra newline
d69ef5e [Holden Karau] Add a test
b5734be [Holden Karau] Add support for unseen labels
afecd4e [Holden Karau] Add a param to skip invalid entries.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implements the transforms which are defined by SQL statement.
Currently we only support SQL syntax like 'SELECT ... FROM __THIS__'
where '__THIS__' represents the underlying table of the input dataset.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #7465 from yanboliang/spark-8345 and squashes the following commits:
b403fcb [Yanbo Liang] address comments
0d4bb15 [Yanbo Liang] a better transformSchema() implementation
51eb9e7 [Yanbo Liang] Add an SQL node as a feature transformer
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds method documentations back to `MultivariateOnlineSummarizer`, which were present in 1.4 but disappeared somewhere along the way to 1.5.
jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes #8045 from feynmanliang/SPARK-9755 and squashes the following commits:
af67fde [Feynman Liang] Add MultivariateOnlineSummarizer docs
|
|
|
|
|
|
|
|
|
|
|
|
| |
Small documentation cleanups, including:
* Adds documentation for `pi` and `theta`
* setParam to `setModelType`
Author: Feynman Liang <fliang@databricks.com>
Closes #8047 from feynmanliang/SPARK-9719 and squashes the following commits:
b372438 [Feynman Liang] Clean up naive bayes doc
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These should be made private until there is a public constructor for providing `rootNode: Node` to use these constructors.
jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes #8046 from feynmanliang/SPARK-9756 and squashes the following commits:
2cbdf08 [Feynman Liang] Make RFRegressionModel aux constructor private
a06f596 [Feynman Liang] Make constructors in ML decision trees private
|
|
|
|
|
|
|
|
|
|
| |
A minor typo (centriod -> centroid). Readable variable names help every users.
Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com>
Closes #8037 from BertrandDechoux/kmeans-typo and squashes the following commits:
47632fe [Bertrand Dechoux] centriod typo
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Resubmit of [https://github.com/apache/spark/pull/6906] for adding single-vec predict to GMMs
CC: dkobylarz mengxr
To be merged with master and branch-1.5
Primary author: dkobylarz
Author: Dariusz Kobylarz <darek.kobylarz@gmail.com>
Closes #8039 from jkbradley/gmm-predict-vec and squashes the following commits:
bfbedc4 [Dariusz Kobylarz] [SPARK-8481] [MLlib] GaussianMixtureModel predict accepting single vector
|