aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize methodJason Lee2016-04-181-1/+4
| | | | | | | | | | | | ## What changes were proposed in this pull request? Added windowSize getter/setter to ML/MLlib ## How was this patch tested? Added test cases in tests.py under both ML and MLlib Author: Jason Lee <cjlee@us.ibm.com> Closes #12428 from jasoncl/SPARK-14564.
* [SPARK-14306][ML][PYSPARK] PySpark ml.classification OneVsRest support ↵Xusen Yin2016-04-181-0/+7
| | | | | | | | | | | | | | | | | | export/import ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14306 Add PySpark OneVsRest save/load supports. ## How was this patch tested? Test with Python unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #12439 from yinxusen/SPARK-14306-0415.
* [MINOR] Remove inappropriate type notation and extra anonymous closure ↵hyukjinkwon2016-04-162-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | within functional transformations ## What changes were proposed in this pull request? This PR removes - Inappropriate type notations For example, from ```scala words.foreachRDD { (rdd: RDD[String], time: Time) => ... ``` to ```scala words.foreachRDD { (rdd, time) => ... ``` - Extra anonymous closure within functional transformations. For example, ```scala .map(item => { ... }) ``` which can be just simply as below: ```scala .map { item => ... } ``` and corrects some obvious style nits. ## How was this patch tested? This was tested after adding rules in `scalastyle-config.xml`, which ended up with not finding all perfectly. The rules applied were below: - For the first correction, ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">(?m)\.[a-zA-Z_][a-zA-Z0-9]*\(\s*[^,]+s*=>\s*\{[^\}]+\}\s*\)</parameter></parameters> </check> ``` ```xml <check customId="NoExtraClosure" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]([^\n>,]+=>)?\s*\{([^()]|(?R))*\}^[,]</parameter></parameters> </check> ``` - For the second correction ```xml <check customId="TypeNotation" level="error" class="org.scalastyle.file.RegexChecker" enabled="true"> <parameters><parameter name="regex">\.[a-zA-Z_][a-zA-Z0-9]*\s*[\{|\(]\s*\([^):]*:R))*\}^[,]</parameter></parameters> </check> ``` **Those rules were not added** Author: hyukjinkwon <gurwls223@gmail.com> Closes #12413 from HyukjinKwon/SPARK-style.
* [SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm ↵Yanbo Liang2016-04-151-6/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for more family and link functions ## What changes were proposed in this pull request? Expose R-like summary statistics in SparkR::glm for more family and link functions. Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work. ## How was this patch tested? Unit tests. SparkR Output: ``` Deviance Residuals: (Note: These are approximate quantiles with relative error <= 0.01) Min 1Q Median 3Q Max -0.95096 -0.16585 -0.00232 0.17410 0.72918 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6765 0.23536 7.1231 4.4561e-11 Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12 Species_versicolor -0.98339 0.072075 -13.644 0 Species_virginica -1.0075 0.093306 -10.798 0 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.22 Number of Fisher Scoring iterations: 1 ``` R output: ``` Deviance Residuals: Min 1Q Median 3Q Max -0.95096 -0.16522 0.00171 0.18416 0.72918 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.67650 0.23536 7.123 4.46e-11 *** Sepal.Length 0.34988 0.04630 7.557 4.19e-12 *** Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 *** Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.217 Number of Fisher Scoring iterations: 2 ``` cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12393 from yanboliang/spark-13925.
* [SPARK-14370][MLLIB] removed duplicate generation of ids in OnlineLDAOptimizerPravin Gadakh2016-04-152-11/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Removed duplicated generation of `ids` in OnlineLDAOptimizer. ## How was this patch tested? tested with existing unit tests. Author: Pravin Gadakh <prgadakh@in.ibm.com> Closes #12176 from pravingadakh/SPARK-14370.
* [SPARK-14549][ML] Copy the Vector and Matrix classes from mllib to ml in ↵DB Tsai2016-04-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mllib-local ## What changes were proposed in this pull request? This task will copy the Vector and Matrix classes from mllib to ml package in mllib-local jar. The UDTs and `since` annotation in ml vector and matrix will be removed from now. UDTs will be achieved by #SPARK-14487, and `since` will be replaced by /* since 1.2.0 */ The BLAS implementation will be copied, and some of the test utilities will be copies as well. Summary of changes: 1. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/BLAS.scala - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/BLAS.scala - logDebug("gemm: alpha is equal to 0 and beta is equal to 1. Returning C.") is removed in ml version. 2. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Matrices.scala - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Matrices.scala - `Since` was removed, and we'll use standard `/* Since /*` Java doc. Will be in another PR. - `UDT` related code was removed, and will use `SPARK-13944` https://github.com/apache/spark/pull/12259 to replace the annotation. 3. In mllib-local/src/main/scala/org/apache/spark/**ml**/linalg/Vectors.scala - Copied from mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala - `Since` was removed. - `UDT` related code was removed. - In `def parseNumeric`, it was throwing `throw new SparkException(s"Cannot parse $other.")`, and now it's throwing `throw new IllegalArgumentException(s"Cannot parse $other.")` 4. In mllib/src/main/scala/org/apache/spark/**mllib**/linalg/Vectors.scala - For consistency with ML version of vector, `def parseNumeric` is now throwing `throw new IllegalArgumentException(s"Cannot parse $other.")` 5. mllib/src/main/scala/org/apache/spark/**mllib**/util/NumericParser.scala is moved to mllib-local/src/main/scala/org/apache/spark/**ml**/util/NumericParser.scala - All the `throw new SparkException` were replaced by `throw new IllegalArgumentException` ## How was this patch tested? unit tests Author: DB Tsai <dbt@netflix.com> Closes #12317 from dbtsai/dbtsai-ml-vector.
* [SPARK-12869] Implemented an improved version of the toIndexedRowMatrixFokko Driesprong2016-04-142-8/+57
| | | | | | | | | | | | | | Hi guys, I've implemented an improved version of the `toIndexedRowMatrix` function on the `BlockMatrix`. I needed this for a project, but would like to share it with the rest of the community. In the case of dense matrices, it can increase performance up to 19 times: https://github.com/Fokko/BlockMatrixToIndexedRowMatrix If there are any questions or suggestions, please let me know. Keep up the good work! Cheers. Author: Fokko Driesprong <f.driesprong@catawiki.nl> Author: Fokko Driesprong <fokko@driesprongen.nl> Closes #10839 from Fokko/master.
* [SPARK-14565][ML] RandomForest should use parseInt and parseDouble for ↵Yong Tang2016-04-144-13/+25
| | | | | | | | | | | | | | | | | feature subset size instead of regexes ## What changes were proposed in this pull request? This fix tries to change RandomForest's supported strategies from using regexes to using parseInt and parseDouble, for the purpose of robustness and maintainability. ## How was this patch tested? Existing tests passed. Author: Yong Tang <yong.tang.github@outlook.com> Closes #12360 from yongtang/SPARK-14565.
* [SPARK-14618][ML][DOC] Updated RegressionEvaluator.metricName param docJoseph K. Bradley2016-04-141-4/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In Spark 1.4, we negated some metrics from RegressionEvaluator since CrossValidator always maximized metrics. This was fixed in 1.5, but the docs were not updated. This PR updates the docs. ## How was this patch tested? no tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12377 from jkbradley/regeval-doc.
* [SPARK-14612][ML] Consolidate the version of dependencies in mllib and ↵Sean Owen2016-04-141-13/+0
| | | | | | | | | | | | | | | | mllib-local into one place ## What changes were proposed in this pull request? Move json4s, breeze dependency declaration into parent ## How was this patch tested? Should be no functional change, but Jenkins tests will test that. Author: Sean Owen <sowen@cloudera.com> Closes #12390 from srowen/SPARK-14612.
* [SPARK-14375][ML] Unit test for spark.ml KMeansSummaryYanbo Liang2016-04-133-8/+47
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? * Modify ```KMeansSummary.clusterSizes``` method to make it robust to empty clusters. * Add unit test for spark.ml ```KMeansSummary```. * Add Since tag. ## How was this patch tested? unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12254 from yanboliang/spark-14375.
* [SPARK-14461][ML] GLM training summaries should provide solverYanbo Liang2016-04-132-3/+11
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? GLM training summaries should provide solver. ## How was this patch tested? Unit tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12253 from yanboliang/spark-14461.
* [SPARK-10386][MLLIB] PrefixSpanModel supports save/loadYanbo Liang2016-04-133-1/+163
| | | | | | | | | | ```PrefixSpanModel``` supports ```save/load```. It's similar with #9267. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10664 from yanboliang/spark-10386.
* [SPARK-13783][ML] Model export/import for spark.ml: GBTsYanbo Liang2016-04-138-137/+262
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? * Added save/load for ```GBTClassifier/GBTClassificationModel/GBTRegressor/GBTRegressionModel```. * Meanwhile, I modified ```EnsembleModelReadWrite.saveImpl/loadImpl``` to support save/load ```treeWeights```. ## How was this patch tested? Adds standard unit tests for GBT save/load. cc jkbradley GayathriMurali Author: Yanbo Liang <ybliang8@gmail.com> Closes #12230 from yanboliang/spark-13783.
* [SPARK-14568][ML] Instrumentation framework for logistic regressionTimothy Hunter2016-04-132-1/+127
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This adds extra logging information about a `LogisticRegression` estimator when being fit on a dataset. With this PR, you see the following extra lines when running the example in the documentation: ``` 16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training: numPartitions=1 storageLevel=StorageLevel(disk=true, memory=true, offheap=false, deserialized=true, replication=1) 16/04/13 07:19:00 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): {"regParam":0.3,"elasticNetParam":0.8,"maxIter":10} ... 16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numClasses=2 16/04/12 11:48:07 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_a89eb23cb386-358781145):numFeatures=692 ... 16/04/13 07:19:01 INFO Instrumentation: Instrumentation(LogisticRegression-logreg_55dd3c09f164-1230977381-1): training finished ``` ## How was this patch tested? This PR was manually tested. Author: Timothy Hunter <timhunter@databricks.com> Closes #12331 from thunterdb/1604-instrumentation.
* Revert "[SPARK-14154][MLLIB] Simplify the implementation for ↵Xiangrui Meng2016-04-131-4/+73
| | | | | | Kolmogorov–Smirnov test" This reverts commit d2a819a6363190b946986ebf6f8001d520098c3b.
* [MINOR][SQL] Remove some unused imports in datasources.hyukjinkwon2016-04-131-3/+0
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? It looks several recent commits for datasources (maybe while removing old `HadoopFsRelation` interface) missed removing some unused imports. This PR removes some unused imports in datasources. ## How was this patch tested? `sbt scalastyle` and some unit tests for them. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12326 from HyukjinKwon/minor-imports.
* [SPARK-14147][ML][SPARKR] SparkR predict should not output feature columnYanbo Liang2016-04-122-2/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? SparkR does not support type of vector which is the default type of feature column in ML. R predict also does not output intermediate feature column. So SparkR ```predict``` should not output feature column. In this PR, I only fix this issue for ```naiveBayes``` and ```survreg```. ```kmeans``` has the right code route already and ```glm``` will be fixed at SparkRWrapper refactor(#12294). ## How was this patch tested? No new tests. cc mengxr shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #11958 from yanboliang/spark-14147.
* [SPARK-14563][ML] use a random table name instead of __THIS__ in SQLTransformerXiangrui Meng2016-04-122-4/+16
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Use a random table name instead of `__THIS__` in SQLTransformer, and add a test for `transformSchema`. The problems of using `__THIS__` are: * It doesn't work under HiveContext (in Spark 1.6) * Race conditions ## How was this patch tested? * Manual test with HiveContext. * Added a unit test for `transformSchema` to improve coverage. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #12330 from mengxr/SPARK-14563.
* [SPARK-13322][ML] AFTSurvivalRegression supports feature standardizationYanbo Liang2016-04-122-34/+93
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? AFTSurvivalRegression should support feature standardization, it will improve the convergence rate. Test the convergence rate on the [Ovarian](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/ovarian.html) data which is standard data comes with Survival library in R, * without standardization(before this PR) -> 74 iterations. * with standardization(after this PR) -> 38 iterations. But after this fix, with or without ```standardization``` will converge to the same solution. It means that ```standardization = false``` will run the same code route as ```standardization = true```. Because if the features are not standardized at all, it will result convergency issue when the features have very different scales. This behavior is the same as ML [```LinearRegression``` and ```LogisticRegression```](https://issues.apache.org/jira/browse/SPARK-8522). See more discussion about this topic at #11247. cc mengxr ## How was this patch tested? unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11365 from yanboliang/spark-13322.
* [SPARK-12566][SPARK-14324][ML] GLM model family, link function support in ↵Yanbo Liang2016-04-122-115/+79
| | | | | | | | | | | | | | | | | SparkR:::glm * SparkR glm supports families and link functions which match R's signature for family. * SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```. * This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in. * This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR. Unit tests. cc mengxr jkbradley hhbyyh Author: Yanbo Liang <ybliang8@gmail.com> Closes #12294 from yanboliang/spark-12566.
* [SPARK-3724][ML] RandomForest: More options for feature subset size.Yong Tang2016-04-126-3/+95
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR tries to support more options for feature subset size in RandomForest implementation. Previously, RandomForest only support "auto", "all", "sort", "log2", "onethird". This PR tries to support any given value to allow model search. In this PR, `featureSubsetStrategy` could be passed with: a) a real number in the range of `(0.0-1.0]` that represents the fraction of the number of features in each subset, b) an integer number (`>0`) that represents the number of features in each subset. ## How was this patch tested? Two tests `JavaRandomForestClassifierSuite` and `JavaRandomForestRegressorSuite` have been updated to check the additional options for params in this PR. An additional test has been added to `org.apache.spark.mllib.tree.RandomForestSuite` to cover the cases in this PR. Author: Yong Tang <yong.tang.github@outlook.com> Closes #11989 from yongtang/SPARK-3724.
* [SPARK-14508][BUILD] Add a new ScalaStyle Rule `OmitBracesInCase`Dongjoon Hyun2016-04-125-14/+7
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) and [Scala Style Guide](http://docs.scala-lang.org/style/control-structures.html#curlybraces), we had better enforce the following rule. ``` case: Always omit braces in case clauses. ``` This PR makes a new ScalaStyle rule, 'OmitBracesInCase', and enforces it to the code. ## How was this patch tested? Pass the Jenkins tests (including Scala style checking) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12280 from dongjoon-hyun/SPARK-14508.
* [SPARK-14535][SQL] Remove buildInternalScan from FileFormatWenchen Fan2016-04-111-33/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Now `HadoopFsRelation` with all kinds of file formats can be handled in `FileSourceStrategy`, we can remove the branches for `HadoopFsRelation` in `FileSourceStrategy` and the `buildInternalScan` API from `FileFormat`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12300 from cloud-fan/remove.
* [MINOR][ML] Fixed MLlib build warningsJoseph K. Bradley2016-04-121-0/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes to eliminate warnings during package and doc builds. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12263 from jkbradley/warning-cleanups.
* [SPARK-14298][ML][MLLIB] Add unit test for EM LDA disable checkpointingYanbo Liang2016-04-111-0/+11
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This is follow up for #12089, add unit test for EM LDA which test disable checkpointing when set ```checkpointInterval = -1```. ## How was this patch tested? unit test. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12286 from yanboliang/spark-14298-followup.
* [SPARK-13600][MLLIB] Use approxQuantile from DataFrame stats in ↵Oliver Pierson2016-04-112-169/+65
| | | | | | | | | | | | | QuantileDiscretizer ## What changes were proposed in this pull request? QuantileDiscretizer can return an unexpected number of buckets in certain cases. This PR proposes to fix this issue and also refactor QuantileDiscretizer to use approxQuantiles from DataFrame stats functions. ## How was this patch tested? QuantileDiscretizerSuite unit tests (some existing tests will change or even be removed in this PR) Author: Oliver Pierson <ocp@gatech.edu> Closes #11553 from oliverpierson/SPARK-13600.
* [SPARK-14462][ML][MLLIB] Add the mllib-local build to maven pomDB Tsai2016-04-111-0/+12
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The previous PR was failing the build because of `spark-core:test` dependency, and that was reverted. In this PR, `FunSuite` with `// scalastyle:ignore funsuite` in mllib-local test was used, similar to sketch. Thanks. ## How was this patch tested? Unit tests mengxr tedyu holdenk Author: DB Tsai <dbt@netflix.com> Closes #12298 from dbtsai/dbtsai-mllib-local-build-fix.
* [SPARK-14510][MLLIB] Add args-checking for LDA and StreamingKMeansZheng RuiFeng2016-04-112-3/+17
| | | | | | | | | | | | ## What changes were proposed in this pull request? add the checking for LDA and StreamingKMeans ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12062 from zhengruifeng/initmodel.
* [SPARK-14500] [ML] Accept Dataset[_] instead of DataFrame in MLlib APIsXiangrui Meng2016-04-1173-237/+293
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR updates MLlib APIs to accept `Dataset[_]` as input where `DataFrame` was the input type. This PR doesn't change the output type. In Java, `Dataset[_]` maps to `Dataset<?>`, which includes `Dataset<Row>`. Some implementations were changed in order to return `DataFrame`. Tests and examples were updated. Note that this is a breaking change for subclasses of Transformer/Estimator. Lol, we don't have to rename the input argument, which has been `dataset` since Spark 1.2. TODOs: - [x] update MiMaExcludes (seems all covered by explicit filters from SPARK-13920) - [x] Python - [x] add a new test to accept Dataset[LabeledPoint] - [x] remove unused imports of Dataset ## How was this patch tested? Exiting unit tests with some modifications. cc: rxin jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #12274 from mengxr/SPARK-14500.
* [SPARK-14497][ML] Use top instead of sortBy() to get top N frequent words as ↵fwang12016-04-102-13/+8
| | | | | | | | | | | | | | | dict in ConutVectorizer ## What changes were proposed in this pull request? Replace sortBy() with top() to calculate the top N frequent words as dictionary. ## How was this patch tested? existing unit tests. The terms with same TF would be sorted in descending order. The test would fail if hardcode the terms with same TF the dictionary like "c", "d"... Author: fwang1 <desperado.wf@gmail.com> Closes #12265 from lionelfeng/master.
* Revert "[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom"Xiangrui Meng2016-04-091-12/+0
| | | | This reverts commit 1598d11bb0248384872cf88bc2b16f3b238046ad.
* [SPARK-14462][ML][MLLIB] add the mllib-local build to maven pomDB Tsai2016-04-091-0/+12
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Closes #12241 from dbtsai/dbtsai-mllib-local-build.
* [SPARK-14392][ML] CountVectorizer Estimator should include binary toggle Paramwm624@hotmail.com2016-04-092-20/+34
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param. ## How was this patch tested? Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0. All tests in CounterVectorizerSuite.scala are passed. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12200 from wangmiao1981/binary_param.
* [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docsJoseph K. Bradley2016-04-087-40/+80
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Cleanups to documentation. No changes to code. * GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor * GLM regParam: needs doc saying it is for L2 only * TrainValidationSplitModel: add .. versionadded:: 2.0.0 * Rename “_transformer_params_from_java” to “_transfer_params_from_java” * LogReg Summary classes: “probability” col should not say “calibrated” * LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values * approxCountDistinct: Document meaning of “rsd" argument. * LDA: note which params are for online LDA only ## How was this patch tested? Doc build Author: Joseph K. Bradley <joseph@databricks.com> Closes #12266 from jkbradley/ml-doc-cleanups.
* [SPARK-14298][ML][MLLIB] LDA should support disable checkpointYanbo Liang2016-04-082-3/+6
| | | | | | | | | | | | | ## What changes were proposed in this pull request? In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug. ## How was this patch tested? Existing tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12089 from yanboliang/spark-14298.
* [SPARK-13048][ML][MLLIB] keepLastCheckpoint option for LDA EM optimizerJoseph K. Bradley2016-04-077-32/+191
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint). This PR adds a "deleteLastCheckpoint" option which defaults to false. This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default. This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option. This also: * Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir * Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory) * Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]``` ## How was this patch tested? Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12166 from jkbradley/emlda-save-checkpoint.
* [SPARK-14134][CORE] Change the package name used for shading classes.Marcelo Vanzin2016-04-061-1/+1
| | | | | | | | | | | | | | | The current package name uses a dash, which is a little weird but seemed to work. That is, until a new test tried to mock a class that references one of those shaded types, and then things started failing. Most changes are just noise to fix the logging configs. For reference, SPARK-8815 also raised this issue, although at the time it did not cause any issues in Spark, so it was not addressed. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11941 from vanzin/SPARK-14134.
* [SPARK-12382][ML] Remove mllib GBT implementation and wrap mlsethah2016-04-066-226/+207
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes the implementation of gradient boosted trees in mllib/tree/GradientBoostedTrees.scala and changes mllib GBTs to call the implementation in spark.ML. Primary changes: * Removed `boost` method in mllib GradientBoostedTrees.scala * Created new test suite GradientBoostedTreesSuite in ML, which contains unit tests that were specific to GBT internals from mllib Other changes: * Added an `updatePrediction` method in GradientBoostedTrees package. This method is added to provide consistency for methods that build predictions from boosted models. There are several methods that hard code the method of predicting as: sum_{i=1}^{numTrees} (treePrediction*treeWeight). Calling this function ensures that test methods that check accuracy use the same prediction method that the algorithm uses during training * Added methods that were previously only used in testing, but were public methods, to GradientBoostedTrees. This includes `computeError` (previously part of `Loss` trait) and `evaluateEachIteration`. These are used in the new spark.ML unit tests. They are left in mllib as well so as to not break the API. ## How was this patch tested? Existing unit tests which compare ML and MLlib ensure that mllib GBTs have not changed. Only a single unit test was moved to ML, which verifies that `runWithValidation` performs as expected. Author: sethah <seth.hendrickson16@gmail.com> Closes #12050 from sethah/SPARK-12382.
* [SPARK-14444][BUILD] Add a new scalastyle `NoScalaDoc` to prevent ↵Dongjoon Hyun2016-04-061-4/+6
| | | | | | | | | | | | | | | | | | | | | ScalaDoc-style multiline comments ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings. ``` /** In Spark, we don't use the ScalaDoc style so this * is not correct. */ ``` ## How was this patch tested? Pass the Jenkins tests (including `lint-scala`). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12221 from dongjoon-hyun/SPARK-14444.
* [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and ↵Bryan Cutler2016-04-062-12/+37
| | | | | | | | | | | | | | | logistic regression ## What changes were proposed in this pull request? Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML. ## How was this patch tested? Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
* [SPARK-13538][ML] Add GaussianMixture to MLZheng RuiFeng2016-04-062-0/+444
| | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-13538 ## What changes were proposed in this pull request? Add GaussianMixture and GaussianMixtureModel to ML package ## How was this patch tested? unit tests and manual tests were done. Local Scalastyle checks passed. Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #11419 from zhengruifeng/mlgmm.
* [SPARK-14322][MLLIB] Use treeAggregate instead of reduce in OnlineLDAOptimizerYuhao Yang2016-04-061-2/+3
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14322 OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix. This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate. See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452 and a few lines below it. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12106 from hhbyyh/ldaTreeReduce.
* [SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuningXusen Yin2016-04-063-0/+29
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13786 Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model. ## How was this patch tested? Test with Python doctest. Author: Xusen Yin <yinxusen@gmail.com> Closes #12020 from yinxusen/SPARK-13786.
* [SPARK-14284][ML] KMeansSummary deprecating size; adding clusterSizesShally Sangal2016-04-052-2/+3
| | | | | | | | | | ## What changes were proposed in this pull request? KMeansSummary class : deprecated size and added clusterSizes Author: Shally Sangal <shallysangal@gmail.com> Closes #12084 from shallys/master.
* [SPARK-14386][ML] Changed spark.ml ensemble trees methods to return concrete ↵Joseph K. Bradley2016-04-046-19/+22
| | | | | | | | | | | | | | | | | | | | types ## What changes were proposed in this pull request? In spark.ml, GBT and RandomForest expose the trait DecisionTreeModel in the trees method, but they should not since it is a private trait (and not ready to be made public). It will also be more useful to users if we return the concrete types. This PR: return concrete types The MIMA checks appear to be OK with this change. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12158 from jkbradley/hide-dtm.
* [SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressorJoseph K. Bradley2016-04-049-103/+424
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? **Main change**: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below) Modified numTrees method (*deprecation*) * Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params. * What this PR does: Moves method numTrees outside of trait TreeEnsembleModel. Adds it to GBT and RF Models. Deprecates it in RF Models in favor of new method getNumTrees. In Spark 2.1, we can have RF Models include Param numTrees. Minor items * Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID. **Implementation details** * Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles. * Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs * These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame. * Split trait RandomForestParams into parts in order to add more Estimator Params to RF models * Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame. Same for DefaultParamsReader.loadMetadata ## How was this patch tested? Adds standard unit tests for RF save/load Author: Joseph K. Bradley <joseph@databricks.com> Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #12118 from jkbradley/GayathriMurali-SPARK-13784.
* [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static ↵Dongjoon Hyun2016-04-0312-17/+17
| | | | | | | | | | | | | | | | | | | | | analysis results ## What changes were proposed in this pull request? This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines). - Fix typos(exception/log strings, testcase name, comments) in 44 lines. - Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011) - Use diamond operators in 40 lines. (New codes after SPARK-13702) - Fix redundant semicolon in 5 lines. - Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala. ## How was this patch tested? Manual and pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12139 from dongjoon-hyun/SPARK-14355.
* [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.Dongjoon Hyun2016-04-025-17/+21
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes. (All comment-only changes over 77 files: +786 lines, −747 lines) ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
* [MINOR] Typo fixesJacek Laskowski2016-04-022-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Typo fixes. No functional changes. ## How was this patch tested? Built the sources and ran with samples. Author: Jacek Laskowski <jacek@japila.pl> Closes #11802 from jaceklaskowski/typo-fixes.