| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
| |
Add Python API for ```MultilayerPerceptronClassifier```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8067 from yanboliang/SPARK-9773.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
between Scala and Python API.
"checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them.
```
member of DecisionTreeParams <-> Scala API
shared param for all ML Transformer/Estimator <-> Python API
```
Proposal:
"checkpointInterval" is also used by ALS, so we make it shared params at Scala.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8528 from yanboliang/spark-10023.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.
Two option is implemented.
* `numFeatures`: Specify the dimension of features vector
* `featuresType`: Specify the type of output vector. `sparse` is default.
Author: lewuathe <lewuathe@me.com>
Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits:
986999d [lewuathe] Change unit test phrase
11d513f [lewuathe] Fix some reviews
21600a4 [lewuathe] Merge branch 'master' into SPARK-10117
9ce63c7 [lewuathe] Rewrite service loader file
1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117
ba3657c [lewuathe] Merge branch 'master' into SPARK-10117
0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF
4f40891 [lewuathe] Improve test suites
5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117
8660d0e [lewuathe] Fix Java unit test
b56a948 [lewuathe] Merge branch 'master' into SPARK-10117
2c12894 [lewuathe] Remove unnecessary tag
7d693c2 [lewuathe] Resolv conflict
62010af [lewuathe] Merge branch 'master' into SPARK-10117
a97ee97 [lewuathe] Fix some points
aef9564 [lewuathe] Fix
70ee4dd [lewuathe] Add Java test
3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
40d3027 [lewuathe] Add Java test
7056d4a [lewuathe] Merge branch 'master' into SPARK-10117
99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
|
|
|
|
|
|
|
|
|
|
|
| |
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.
The remainder are some potential bugs, and deprecated syntax.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes #8433 from skyluc/issue/sbt-2.11.
|
|
|
|
|
|
|
|
| |
Adds IndexToString to PySpark.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
|
|
|
|
|
|
|
|
|
| |
Add WeibullGenerator for RandomDataGenerator.
#8611 need use WeibullGenerator to generate random data based on Weibull distribution.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8622 from yanboliang/spark-10464.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet.
There are couple TODOs that can be addressed in future PRs:
* consolidate summary statistics aggregators
* move `dspr` to `BLAS`
* etc
It would be nice to have this merged first because it blocks couple other features.
dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #8588 from mengxr/SPARK-9834.
|
|
|
|
|
|
|
|
|
| |
Loader.checkSchema was called to verify the schema after dataframe.select(...).
Schema verification should be done before dataframe.select(...)
Author: Vinod K C <vinod.kc@huawei.com>
Closes #8636 from vinodkc/fix_GaussianMixtureModel_load_verification.
|
|
|
|
|
|
|
|
|
| |
Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent.
Here fix it and add test case.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8637 from yanboliang/spark-10470.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR fix two model ```copy()``` related issues:
[SPARK-10480](https://issues.apache.org/jira/browse/SPARK-10480)
```ML.LinearRegressionModel.copy()``` ignored argument ```extra```, it will not take effect when users setting this parameter.
[SPARK-10479](https://issues.apache.org/jira/browse/SPARK-10479)
```ML.LogisticRegressionModel.copy()``` should copy model summary if available.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8641 from yanboliang/linear-regression-copy.
|
|
|
|
|
|
|
|
| |
From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
|
|
|
|
|
|
|
|
| |
We should make sure the scaladoc for params includes their default values through the models in ml/
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml.
|
|
|
|
|
|
|
|
| |
Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
|
|
|
|
|
|
|
|
| |
Add a python API for the Stop Words Remover.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
|
|
|
|
|
|
|
|
|
|
|
| |
new label at binary reduction
Currently OneVsRest use UDF to generate new binary label during training.
Considering that [SPARK-7321](https://issues.apache.org/jira/browse/SPARK-7321) has been merged, we can use ```when ... otherwise``` which will be more efficiency.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8519 from yanboliang/spark-10349.
|
|
|
|
|
|
|
|
| |
This could help reduce hash collisions, e.g., in `RDD[Vector].repartition`. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #8182 from mengxr/SPARK-9954.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
initializaiton
* do not cache first cost RDD
* change following cost RDD cache level to MEMORY_AND_DISK
* remove Vector wrapper to save a object per instance
Further improvements will be addressed in SPARK-10329
cc: yu-iskw HuJiayin
Author: Xiangrui Meng <meng@databricks.com>
Closes #8526 from mengxr/SPARK-10354.
|
|
|
|
|
|
|
|
|
|
|
|
| |
of matrix multiplications
mengxr jkbradley rxin
It would be great if this fix made it into RC3!
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #8525 from brkyvz/blas-scaling.
|
|
|
|
|
|
|
|
|
| |
### JIRA
[[SPARK-10260] Add Since annotation to ml.clustering - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10260)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8455 from yu-iskw/SPARK-10260.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
compatibility test
* Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
* Cleans up scaladocs for public methods
* Adds test for Java compatibility
* Follow up Python user guide code example is tracked by SPARK-10249
Author: Feynman Liang <fliang@databricks.com>
Closes #8436 from feynmanliang/SPARK-10230.
|
|
|
|
|
|
|
|
|
|
|
|
| |
`GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache.
The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning.
Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better.
Author: Vyacheslav Baranov <slavik.baranov@gmail.com>
Closes #8395 from SlavikBaranov/SPARK-10182.
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Replaces instances of `Lists.newArrayList` with `Arrays.asList`
* Replaces `commons.lang.StringUtils` over `com.google.collections.Strings`
* Replaces `List` interface over `ArrayList` implementations
This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests.
Author: Feynman Liang <fliang@databricks.com>
Closes #8451 from feynmanliang/SPARK-10257.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases.
Build for 2.10:
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install
and 2.11:
./dev/change-scala-version.sh 2.11
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install
Author: Jacek Laskowski <jacek@japila.pl>
Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
|
|
|
|
|
|
|
|
| |
JavaTests
Author: Feynman Liang <fliang@databricks.com>
Closes #8447 from feynmanliang/SPARK-10256.
|
|
|
|
|
|
| |
Author: Feynman Liang <fliang@databricks.com>
Closes #8446 from feynmanliang/SPARK-10255.
|
|
|
|
|
|
|
|
|
| |
* Replaces `com.google.common` dependencies with `java.util.Arrays`
* Small clean up in `JavaNormalizerSuite`
Author: Feynman Liang <fliang@databricks.com>
Closes #8445 from feynmanliang/SPARK-10254.
|
|
|
|
|
|
|
|
|
|
| |
Same as #8421 but for `mllib.recommendation`.
cc srowen coderxiang
Author: Xiangrui Meng <meng@databricks.com>
Closes #8432 from mengxr/SPARK-10241.
|
|
|
|
|
|
|
|
|
|
| |
I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs.
cc jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #8452 from mengxr/SPARK-9665.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Same as #8421 but for `mllib.feature`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits:
0e8d658 [Xiangrui Meng] remove unnecessary comment
ad70b03 [Xiangrui Meng] update since versions in mllib.feature
|
|
|
|
|
|
|
|
|
|
|
|
| |
Same as #8421 but for `mllib.regression`.
cc freeman-lab dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #8426 from mengxr/SPARK-10235 and squashes the following commits:
6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
|
|
|
|
|
|
|
|
|
|
| |
Same as #8421 but for `mllib.tree`.
cc jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #8442 from mengxr/SPARK-10236.
|
|
|
|
|
|
|
|
|
|
| |
Same as #8421 but for `mllib.clustering`.
cc feynmanliang yu-iskw
Author: Xiangrui Meng <meng@databricks.com>
Closes #8435 from mengxr/SPARK-10234.
|
|
|
|
|
|
|
|
|
|
|
|
| |
and mllib.stat
The same as #8241 but for `mllib.stat` and `mllib.random`.
cc feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes #8439 from mengxr/SPARK-10242.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Same as #8421 but for `mllib.linalg`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #8440 from mengxr/SPARK-10238 and squashes the following commits:
b38437e [Xiangrui Meng] update since versions in mllib.linalg
|
|
|
|
|
|
|
|
|
|
| |
Same as #8421 but for `mllib.evaluation`.
cc avulanov
Author: Xiangrui Meng <meng@databricks.com>
Closes #8423 from mengxr/SPARK-10233.
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Adds two new sections to LDA's user guide; one for each optimizer/model
* Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
* Cleans up a TODO and sets a default parameter in LDA code
jkbradley hhbyyh
Author: Feynman Liang <fliang@databricks.com>
Closes #8254 from feynmanliang/SPARK-9888.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
mllib.util
Same as #8421 but for `mllib.pmml` and `mllib.util`.
cc dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #8430 from mengxr/SPARK-10239 and squashes the following commits:
a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
|
|
|
|
|
|
|
|
|
|
| |
StreamingLinearRegressionWithSGD.setConvergenceTol default value
Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc
Author: Feynman Liang <fliang@databricks.com>
Closes #8424 from feynmanliang/SPARK-9797.
|
|
|
|
|
|
|
|
|
|
| |
Same as #8421 but for `mllib.fpm`.
cc feynmanliang
Author: Xiangrui Meng <meng@databricks.com>
Closes #8429 from mengxr/SPARK-10237.
|
|
|
|
|
|
|
|
|
| |
* Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol
* Cleans up a note in code
Author: Feynman Liang <fliang@databricks.com>
Closes #8425 from feynmanliang/SPARK-9800.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Update `Since` annotation in `mllib.classification`:
1. add version to classes, objects, constructors, and public variables declared in constructors
2. correct some versions
3. remove `Since` on `toString`
MechCoder dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #8421 from mengxr/SPARK-10231 and squashes the following commits:
b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
|
|
|
|
|
|
|
|
|
|
| |
See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770)
CC jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes #8422 from feynmanliang/SPARK-10230.
|
|
|
|
|
|
|
|
|
|
|
|
| |
to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes #8033 from srowen/SPARK-9613.
|
|
|
|
|
|
|
|
|
|
|
|
| |
GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests.
This PR adds a unit test which checks this. It failed previously but works with this fix.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8370 from jkbradley/gmm-fix.
|
|
|
|
|
|
|
|
|
|
| |
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
Note that Python version does not support selecting by names now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #8267 from yinxusen/SPARK-9893.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Removed categorical feature info validation since no longer needed
This is needed to make the ML user guide examples work (in another current PR).
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8367 from jkbradley/gbt-single-cat.
|
|
|
|
|
|
| |
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #8352 from MechCoder/since.
|
|
|
|
|
|
|
|
|
|
| |
For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token.
CC: rotationsymmetry mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8329 from jkbradley/lda-topic-assignments.
|
|
|
|
|
|
| |
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #8309 from MechCoder/tags_feature.
|
|
|
|
|
|
|
|
|
|
| |
Java test suite
Otherwise, setters do not return self type. jkbradley avulanov
Author: Xiangrui Meng <meng@databricks.com>
Closes #8342 from mengxr/SPARK-10138.
|