aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-18628][ML] Update Scala param and Python param to have quoteskrishnakalyan32016-12-112-6/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updated Scala param and Python param to have quotes around the options making it easier for users to read. ## How was this patch tested? Manually checked the docstrings Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #16242 from krishnakalyan3/doc-string.
* [SPARK-3359][DOCS] Fix greater-than symbols in Javadoc to allow building ↵Michal Senkyr2016-12-106-6/+6
| | | | | | | | | | | | | | | | | | | | | | with Java 8 ## What changes were proposed in this pull request? The API documentation build was failing when using Java 8 due to incorrect character `>` in Javadoc. Replace `>` with literals in Javadoc to allow the build to pass. ## How was this patch tested? Documentation was built and inspected manually to ensure it still displays correctly in the browser ``` cd docs && jekyll serve ``` Author: Michal Senkyr <mike.senkyr@gmail.com> Closes #16201 from michalsenkyr/javadoc8-gt-fix.
* [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1Yanbo Liang2016-12-072-5/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues: * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle. * Fix ```spark.als``` params to make it consistent with MLlib. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16169 from yanboliang/spark-18326.
* [SPARK-18701][ML] Fix Poisson GLM failure due to wrong initializationactuaryzhang2016-12-072-10/+17
| | | | | | | | | | | | | | | | Poisson GLM fails for many standard data sets (see example in test or JIRA). The issue is incorrect initialization leading to almost zero probability and weights. Specifically, the mean is initialized as the response, which could be zero. Applying the log link results in very negative numbers (protected against -Inf), which again leads to close to zero probability and weights in the weighted least squares. Fix and test are included in the commits. ## What changes were proposed in this pull request? Update initialization in Poisson GLM ## How was this patch tested? Add test in GeneralizedLinearRegressionSuite srowen sethah yanboliang HyukjinKwon mengxr Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16131 from actuaryzhang/master.
* [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.Yanbo Liang2016-12-071-35/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #16117 from yanboliang/spark-18686.
* [SPARK-18374][ML] Incorrect words in StopWords/english.txtYuhao2016-12-072-27/+55
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes. Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list. see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #16103 from hhbyyh/addstopwords.
* [SPARK-18625][ML] OneVsRestModel should support setFeaturesCol and ↵Zheng RuiFeng2016-12-052-1/+22
| | | | | | | | | | | | | | setPredictionCol ## What changes were proposed in this pull request? add `setFeaturesCol` and `setPredictionCol` for `OneVsRestModel` ## How was this patch tested? added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #16059 from zhengruifeng/ovrm_setCol.
* [SPARK-18695] Bump master branch version to 2.2.0-SNAPSHOTReynold Xin2016-12-021-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch bumps master branch version to 2.2.0-SNAPSHOT. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #16126 from rxin/SPARK-18695.
* [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm ↵Yanbo Liang2016-12-021-71/+7
| | | | | | | | | | | | | | | | predict should output original label when family = binomial." ## What changes were proposed in this pull request? It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved. ## How was this patch tested? Existing unit tests. This reverts commit daa975f4bfa4f904697bf3365a4be9987032e490. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16118 from yanboliang/spark-18291-revert.
* [SPARK-18658][SQL] Write text records directly to a FileOutputStreamNathan Howell2016-12-011-20/+8
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This replaces uses of `TextOutputFormat` with an `OutputStream`, which will either write directly to the filesystem or indirectly via a compressor (if so configured). This avoids intermediate buffering. The inverse of this (reading directly from a stream) is necessary for streaming large JSON records (when `wholeFile` is enabled) so I wanted to keep the read and write paths symmetric. ## How was this patch tested? Existing unit tests. Author: Nathan Howell <nhowell@godaddy.com> Closes #16089 from NathanHowell/SPARK-18658.
* [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support ↵wm624@hotmail.com2016-11-301-11/+26
| | | | | | | | | | | | | | | | | | output original label. ## What changes were proposed in this pull request? Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label. In this PR, original label output is supported and test cases are modified and added. Document is also modified. ## How was this patch tested? Unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #15910 from wangmiao1981/audit.
* [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docsYanbo Liang2016-11-309-22/+28
| | | | | | | | | | | | ## What changes were proposed in this pull request? API review for 2.1, except ```LSH``` related classes which are still under development. ## How was this patch tested? Only doc changes, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16009 from yanboliang/spark-18318.
* [SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFunAnthony Truchet2016-11-301-0/+3
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix a broadcasted variable leak occurring at each invocation of CostFun in L-BFGS. ## How was this patch tested? UTests + check that fixed fatal memory consumption on Criteo's use cases. This contribution is made on behalf of Criteo S.A. (http://labs.criteo.com/) under the terms of the Apache v2 License. Author: Anthony Truchet <a.truchet@criteo.com> Closes #16040 from AnthonyTruchet/SPARK-18612-lbfgs-cost-fun.
* [SPARK-18319][ML][QA2.1] 2.1 QA: API: Experimental, DeveloperApi, final, ↵Yuhao2016-11-299-51/+7
| | | | | | | | | | | | | | | | | sealed audit ## What changes were proposed in this pull request? make a pass through the items marked as Experimental or DeveloperApi and see if any are stable enough to be unmarked. Also check for items marked final or sealed to see if they are stable enough to be opened up as APIs. Some discussions in the jira: https://issues.apache.org/jira/browse/SPARK-18319 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15972 from hhbyyh/experimental21.
* [SPARK-18592][ML] Move DT/RF/GBT Param setter methods to subclassesYanbo Liang2016-11-297-90/+260
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Mainly two changes: * Move DT/RF/GBT Param setter methods to subclasses. * Deprecate corresponding setter methods in the model classes. See discussion here https://github.com/apache/spark/pull/15913#discussion_r89662469. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16017 from yanboliang/spark-18592.
* [SPARK-18615][DOCS] Switch to multi-line doc to avoid a genjavadoc bug for ↵hyukjinkwon2016-11-2911-20/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | backticks ## What changes were proposed in this pull request? Currently, single line comment does not mark down backticks to `<code>..</code>` but prints as they are (`` `..` ``). For example, the line below: ```scala /** Return an RDD with the pairs from `this` whose keys are not in `other`. */ ``` So, we could work around this as below: ```scala /** * Return an RDD with the pairs from `this` whose keys are not in `other`. */ ``` - javadoc - **Before** ![2016-11-29 10 39 14](https://cloud.githubusercontent.com/assets/6477701/20693606/e64c8f90-b622-11e6-8dfc-4a029216e23d.png) - **After** ![2016-11-29 10 39 08](https://cloud.githubusercontent.com/assets/6477701/20693607/e7280d36-b622-11e6-8502-d2e21cd5556b.png) - scaladoc (this one looks fine either way) - **Before** ![2016-11-29 10 38 22](https://cloud.githubusercontent.com/assets/6477701/20693640/12c18aa8-b623-11e6-901a-693e2f6f8066.png) - **After** ![2016-11-29 10 40 05](https://cloud.githubusercontent.com/assets/6477701/20693642/14eb043a-b623-11e6-82ac-7cd0000106d1.png) I suspect this is related with SPARK-16153 and genjavadoc issue in ` typesafehub/genjavadoc#85`. ## How was this patch tested? I found them via ``` grep -r "\/\*\*.*\`" . | grep .scala ```` and then checked if each is in the public API documentation with manually built docs (`jekyll build`) with Java 7. Author: hyukjinkwon <gurwls223@gmail.com> Closes #16050 from HyukjinKwon/javadoc-markdown.
* [SPARK-3359][DOCS] Make javadoc8 working for unidoc/genjavadoc compatibility ↵hyukjinkwon2016-11-2961-208/+231
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in Java API documentation ## What changes were proposed in this pull request? This PR make `sbt unidoc` complete with Java 8. This PR roughly includes several fixes as below: - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` `` ```diff - * A column that will be computed based on the data in a [[DataFrame]]. + * A column that will be computed based on the data in a `DataFrame`. ``` - Fix throws annotations so that they are recognisable in javadoc - Fix URL links to `<a href="http..."></a>`. ```diff - * [[http://en.wikipedia.org/wiki/Decision_tree_learning Decision tree]] model for regression. + * <a href="http://en.wikipedia.org/wiki/Decision_tree_learning"> + * Decision tree (Wikipedia)</a> model for regression. ``` ```diff - * see http://en.wikipedia.org/wiki/Receiver_operating_characteristic + * see <a href="http://en.wikipedia.org/wiki/Receiver_operating_characteristic"> + * Receiver operating characteristic (Wikipedia)</a> ``` - Fix < to > to - `greater than`/`greater than or equal to` or `less than`/`less than or equal to` where applicable. - Wrap it with `{{{...}}}` to print them in javadoc or use `{code ...}` or `{literal ..}`. Please refer https://github.com/apache/spark/pull/16013#discussion_r89665558 - Fix `</p>` complaint ## How was this patch tested? Manually tested by `jekyll build` with Java 7 and 8 ``` java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) ``` ``` java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #16013 from HyukjinKwon/SPARK-3359-errors-more.
* [SPARK-18408][ML] API Improvements for LSHYun Ni2016-11-286-221/+306
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (1) Change output schema to `Array of Vector` instead of `Vectors` (2) Use `numHashTables` as the dimension of Array (3) Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH` (4) Make `randUnitVectors/randCoefficients` private (5) Make Multi-Probe NN Search and `hashDistance` private for future discussion Saved for future PRs: (1) AND-amplification and `numHashFunctions` as the dimension of Vector are saved for a future PR. (2) `hashDistance` and MultiProbe NN Search needs more discussion. The current implementation is just a backward compatible one. ## How was this patch tested? Related unit tests are modified to make sure the performance of LSH are ensured, and the outputs of the APIs meets expectation. Author: Yun Ni <yunn@uber.com> Author: Yunni <Euler57721@gmail.com> Closes #15874 from Yunni/SPARK-18408-yunn-api-improvements.
* [SPARK-18481][ML] ML 2.1 QA: Remove deprecated methods for MLYanbo Liang2016-11-2614-103/+78
| | | | | | | | | | | | ## What changes were proposed in this pull request? Remove deprecated methods for ML. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15913 from yanboliang/spark-18481.
* [SPARK-18356][ML] Improve MLKmeans PerformanceZakaria_Hili2016-11-251-4/+16
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark Kmeans fit() doesn't cache the RDD which generates a lot of warnings : WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached. So, Kmeans should cache the internal rdd before calling the Mllib.Kmeans algo, this helped to improve spark kmeans performance by 14% https://github.com/ZakariaHili/spark/commit/a9cf905cf7dbd50eeb9a8b4f891f2f41ea672472 hhbyyh ## How was this patch tested? Pass Kmeans tests and existing tests Author: Zakaria_Hili <zakahili@gmail.com> Author: HILI Zakaria <zakahili@gmail.com> Closes #15965 from ZakariaHili/zakbranch.
* [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will ↵hyukjinkwon2016-11-2569-310/+336
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | help unidoc/genjavadoc compatibility ## What changes were proposed in this pull request? This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before. This PR roughly fixes several things as below: - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` `` ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found [error] * Loads text files and returns a {link DataFrame} whose schema starts with a string column named ``` - Fix an exception annotation and remove code backticks in `throws` annotation Currently, sbt unidoc with Java 8 complains as below: ``` [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text [error] * throws StreamingQueryException, if <code>this</code> query has terminated with an exception. ``` `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)). - Fix `[[http..]]` to `<a href="http..."></a>`. ```diff - * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle - * blog page]]. + * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https"> + * Oracle blog page</a>. ``` `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc. - It seems class can't have `return` annotation. So, two cases of this were removed. ``` [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return [error] * return New instance of IsotonicRegression. ``` - Fix < to `&lt;` and > to `&gt;` according to HTML rules. - Fix `</p>` complaint - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`. ## How was this patch tested? Manually tested by `jekyll build` with Java 7 and 8 ``` java version "1.7.0_80" Java(TM) SE Runtime Environment (build 1.7.0_80-b15) Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode) ``` ``` java version "1.8.0_45" Java(TM) SE Runtime Environment (build 1.8.0_45-b14) Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode) ``` Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15999 from HyukjinKwon/SPARK-3359-errors.
* [SPARK-18520][ML] Add missing setXXXCol methods for BisectingKMeansModel and ↵Zheng RuiFeng2016-11-242-0/+20
| | | | | | | | | | | | | | GaussianMixtureModel ## What changes were proposed in this pull request? add `setFeaturesCol` and `setPredictionCol` for BiKModel and GMModel add `setProbabilityCol` for GMModel ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15957 from zhengruifeng/bikm_set.
* [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear dataYanbo Liang2016-11-223-31/+90
| | | | | | | | | | | | | ## What changes were proposed in this pull request? * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition. * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs". ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15930 from yanboliang/spark-18501.
* [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKMsethah2016-11-2112-34/+44
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark. ## How was this patch tested? Unit tests. Author: sethah <seth.hendrickson16@gmail.com> Closes #15777 from sethah/pyspark_cluster_summaries.
* [SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in ↵sethah2016-11-201-62/+53
| | | | | | | | | | | | | | | | | | LogisticRegression training ## What changes were proposed in this pull request? This is a follow up to some of the discussion [here](https://github.com/apache/spark/pull/15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing. Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark. ## How was this patch tested? This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch. Author: sethah <seth.hendrickson16@gmail.com> Closes #15893 from sethah/logreg_refactor.
* [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note ↵hyukjinkwon2016-11-1936-146/+185
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | that`/`'''Note:'''` across Scala/Java API documentation ## What changes were proposed in this pull request? It seems in Scala/Java, - `Note:` - `NOTE:` - `Note that` - `'''Note:'''` - `note` This PR proposes to fix those to `note` to be consistent. **Before** - Scala ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png) - Java ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png) **After** - Scala ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png) - Java ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png) ## How was this patch tested? The notes were found via ```bash grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// NOTE: " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...` -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// Note that " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// Note: " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` ```bash grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:''' grep -v "// '''Note:''' " | \ # starting with // does not appear in API documentation. grep -E '.scala|.java' | \ # java/scala files grep -v Suite | \ # exclude tests grep -v Test | \ # exclude tests grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation -e 'org.apache.spark.api.java.function' \ -e 'org.apache.spark.api.r' \ ... ``` And then fixed one by one comparing with API documentation/access modifiers. After that, manually tested via `jekyll build`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15889 from HyukjinKwon/SPARK-18437.
* [SPARK-18480][DOCS] Fix wrong links for ML guide docsZheng RuiFeng2016-11-173-9/+9
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert. 2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter` in `ml-pipeline.md` were linked to `ml-guide.html` by mistake. 3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private. 4, Other link updates. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15912 from zhengruifeng/md_fix.
* [SPARK-17462][MLLIB]use VersionUtils to parse Spark version stringsVinceShieh2016-11-172-8/+4
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Several places in MLlib use custom regexes or other approaches to parse Spark versions. Those should be fixed to use the VersionUtils. This PR replaces custom regexes with VersionUtils to get Spark version numbers. ## How was this patch tested? Existing tests. Signed-off-by: VinceShieh vincent.xieintel.com Author: VinceShieh <vincent.xie@intel.com> Closes #15055 from VinceShieh/SPARK-17462.
* [SPARK-18434][ML] Add missing ParamValidations for ML algosZheng RuiFeng2016-11-166-10/+22
| | | | | | | | | | | ## What changes were proposed in this pull request? Add missing ParamValidations for ML algos ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15881 from zhengruifeng/arg_checking.
* [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.Yanbo Liang2016-11-161-27/+34
| | | | | | | | | | | | | ## What changes were proposed in this pull request? ```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers. BTW, I did some cleanup and improvement for ```spark.mlp```. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15883 from yanboliang/spark-18438.
* [SPARK-18166][MLLIB] Fix Poisson GLM bug due to wrong requirement of ↵actuaryzhang2016-11-142-2/+47
| | | | | | | | | | | | | | | response values ## What changes were proposed in this pull request? The current implementation of Poisson GLM seems to allow only positive values. This is incorrect since the support of Poisson includes the origin. The bug is easily fixed by changing the test of the Poisson variable from 'require(y **>** 0.0' to 'require(y **>=** 0.0'. mengxr srowen Author: actuaryzhang <actuaryzhang10@gmail.com> Author: actuaryzhang <actuaryzhang@uber.com> Closes #15683 from actuaryzhang/master.
* [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms ↵Yanbo Liang2016-11-135-38/+53
| | | | | | | | | | | | | | | | | | | | training on libsvm data ## What changes were proposed in this pull request? * Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data. ``` java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true. ``` See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more detail about how to reproduce this bug. * Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function. * Drop some unwanted columns when making prediction. ## How was this patch tested? Add unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15851 from yanboliang/spark-18412.
* [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayesYanbo Liang2016-11-122-39/+39
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? * Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it. * Avoid capturing the outer object for ```modelType```. * Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15826 from yanboliang/spark-14077-2.
* [SPARK-18060][ML] Avoid unnecessary computation for MLORsethah2016-11-121-51/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (`value / featuresStd(index)`), which means we performed the computation `numFeatures * numClasses` times. We only need to perform that computation `numFeatures` times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model. ## How was this patch tested? This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups. ## Performance Tests **Setup** 3 node bare-metal cluster 120 cores total 384 gb RAM total **Results** NOTE: The `currentMasterTime` and `thisPatchTime` are times in seconds for a single iteration of L-BFGS or OWL-QN. | | numPoints | numFeatures | numClasses | regParam | elasticNetParam | currentMasterTime (sec) | thisPatchTime (sec) | pctSpeedup | |----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------| | 0 | 1e+07 | 100 | 500 | 0.5 | 0 | 90 | 18 | 80 | | 1 | 1e+08 | 100 | 50 | 0.5 | 0 | 90 | 19 | 78 | | 2 | 1e+08 | 100 | 50 | 0.05 | 1 | 72 | 19 | 73 | | 3 | 1e+06 | 100 | 5000 | 0.5 | 0 | 93 | 53 | 43 | | 4 | 1e+07 | 100 | 5000 | 0.5 | 0 | 900 | 390 | 56 | | 5 | 1e+08 | 100 | 500 | 0.5 | 0 | 840 | 174 | 79 | | 6 | 1e+08 | 100 | 200 | 0.5 | 0 | 360 | 72 | 80 | | 7 | 1e+08 | 1000 | 5 | 0.5 | 0 | 9 | 3 | 66 | Author: sethah <seth.hendrickson16@gmail.com> Closes #15593 from sethah/MLOR_PERF_COL_MAJOR_COEF.
* [SPARK-18401][SPARKR][ML] SparkR random forest should support output ↵Yanbo Liang2016-11-101-4/+24
| | | | | | | | | | | | | | original label. ## What changes were proposed in this pull request? SparkR ```spark.randomForest``` classification prediction should output original label rather than the indexed label. This issue is very similar with [SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291). ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15842 from yanboliang/spark-18401.
* [SPARK-18268][ML][MLLIB] ALS fail with better message if ratings is empty rddSandeep Singh2016-11-104-0/+18
| | | | | | | | | | | | | ## What changes were proposed in this pull request? ALS.run fail with better message if ratings is empty rdd ALS.train and ALS.trainImplicit are also affected ## How was this patch tested? added new tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #15809 from techaddict/SPARK-18268.
* [SPARK-18239][SPARKR] Gradient Boosted Tree for RFelix Cheung2016-11-085-14/+326
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Gradient Boosted Tree in R. With a few minor improvements to RandomForest in R. Since this is relatively isolated I'd like to target this for branch-2.1 ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15746 from felixcheung/rgbt.
* [SPARK-17748][ML] Minor cleanups to one-pass linear regression with elastic netJoseph K. Bradley2016-11-083-12/+23
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? * Made SingularMatrixException private ml * WeightedLeastSquares: Changed to allow tol >= 0 instead of only tol > 0 ## How was this patch tested? existing tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #15779 from jkbradley/wls-cleanups.
* [SPARK-14914][CORE] Fix Resource not closed after using, mostly for unit testsHyukjin Kwon2016-11-071-6/+10
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Close `FileStreams`, `ZipFiles` etc to release the resources after using. Not closing the resources will cause IO Exception to be raised while deleting temp files. ## How was this patch tested? Existing tests Author: U-FAREAST\tl <tl@microsoft.com> Author: hyukjinkwon <gurwls223@gmail.com> Author: Tao LI <tl@microsoft.com> Closes #15618 from HyukjinKwon/SPARK-14914-1.
* [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label ↵Yanbo Liang2016-11-071-8/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when family = binomial. ## What changes were proposed in this pull request? SparkR ```spark.glm``` predict should output original label when family = "binomial". ## How was this patch tested? Add unit test. You can also run the following code to test: ```R training <- suppressWarnings(createDataFrame(iris)) training <- training[training$Species %in% c("versicolor", "virginica"), ] model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit")) showDF(predict(model, training)) ``` Before this change: ``` +------------+-----------+------------+-----------+----------+-----+-------------------+ |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label| prediction| +------------+-----------+------------+-----------+----------+-----+-------------------+ | 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| 0.8271421517601544| | 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| 0.6044595910413112| | 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| 0.7916340858281998| | 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|0.16080518180591158| | 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| 0.6112229217050189| | 5.7| 2.8| 4.5| 1.3|versicolor| 0.0| 0.2555087295500885| | 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| 0.5681507664364834| | 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|0.05990570219972002| | 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| 0.6644434078306246| | 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|0.11293577405862379| | 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|0.06152372321585971| | 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|0.35250697207602555| | 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|0.32267018290814303| | 6.1| 2.9| 4.7| 1.4|versicolor| 0.0| 0.433391153814592| | 5.6| 2.9| 3.6| 1.3|versicolor| 0.0| 0.2280744262436993| | 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| 0.7219848389339459| | 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|0.23527698971404695| | 5.8| 2.7| 4.1| 1.0|versicolor| 0.0| 0.285024533520016| | 6.2| 2.2| 4.5| 1.5|versicolor| 0.0| 0.4107047877447493| | 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|0.20083561961645083| +------------+-----------+------------+-----------+----------+-----+-------------------+ ``` After this change: ``` +------------+-----------+------------+-----------+----------+-----+----------+ |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label|prediction| +------------+-----------+------------+-----------+----------+-----+----------+ | 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| virginica| | 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| virginica| | 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| virginica| | 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|versicolor| | 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| virginica| | 5.7| 2.8| 4.5| 1.3|versicolor| 0.0|versicolor| | 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| virginica| | 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|versicolor| | 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| virginica| | 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|versicolor| | 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|versicolor| | 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|versicolor| | 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|versicolor| | 6.1| 2.9| 4.7| 1.4|versicolor| 0.0|versicolor| | 5.6| 2.9| 3.6| 1.3|versicolor| 0.0|versicolor| | 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| virginica| | 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|versicolor| | 5.8| 2.7| 4.1| 1.0|versicolor| 0.0|versicolor| | 6.2| 2.2| 4.5| 1.5|versicolor| 0.0|versicolor| | 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|versicolor| +------------+-----------+------------+-----------+----------+-----+----------+ ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #15788 from yanboliang/spark-18291.
* [SPARK-18210][ML] Pipeline.copy does not create an instance with the same UIDWojciech Szymanski2016-11-062-3/+21
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Motivation: `org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an instance with the same UID. It does not conform to the method specification from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)` Solution: - fix for Pipeline UID - introduced new tests for `org.apache.spark.ml.Pipeline.copy` - minor improvements in test for `org.apache.spark.ml.PipelineModel.copy` ## How was this patch tested? Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"` Improved existing unit test: `org.apache.spark.ml.PipelineSuite."PipelineModel.copy"` Author: Wojciech Szymanski <wk.szymanski@gmail.com> Closes #15759 from wojtek-szymanski/SPARK-18210.
* [SPARK-18276][ML] ML models should copy the training summary and set parentsethah2016-11-0512-20/+62
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Only some of the models which contain a training summary currently set the summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and BKM do not. Additionally, these copy methods did not set the parent pointer of the copied model. This patch modifies the copy methods of the four models mentioned above to copy the training summary and set the parent. ## How was this patch tested? Add unit tests in Linear/Logistic/GeneralizedLinear regression and GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the copied model and check that the copied model has a summary. Author: sethah <seth.hendrickson16@gmail.com> Closes #15773 from sethah/SPARK-18276.
* [SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat ↵Sean Owen2016-11-021-2/+2
| | | | | | | | | | | | | | | to Locale.US ## What changes were proposed in this pull request? Fix `Locale.US` for all usages of `DateFormat`, `NumberFormat` ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #15610 from srowen/SPARK-18076.
* [SPARK-18088][ML] Various ChiSqSelector cleanupsJoseph K. Bradley2016-11-015-121/+139
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? - Renamed kbest to numTopFeatures - Renamed alpha to fpr - Added missing Since annotations - Doc cleanups ## How was this patch tested? Added new standardized unit tests for spark.ml. Improved existing unit test coverage a bit. Author: Joseph K. Bradley <joseph@databricks.com> Closes #15647 from jkbradley/chisqselector-follow-ups.
* [SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fitZheng RuiFeng2016-11-019-11/+98
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, move cast to `Predictor` 2, and then, remove unnecessary cast ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15414 from zhengruifeng/move_cast.
* [SPARK-18024][SQL] Introduce an internal commit protocol APIReynold Xin2016-10-311-10/+7
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch introduces an internal commit protocol API that is used by the batch data source to do write commits. It currently has only one implementation that uses Hadoop MapReduce's OutputCommitter API. In the future, this commit API can be used to unify streaming and batch commits. ## How was this patch tested? Should be covered by existing write tests. Author: Reynold Xin <rxin@databricks.com> Author: Eric Liang <ekl@databricks.com> Closes #15707 from rxin/SPARK-18024-2.
* [SPARK-16137][SPARKR] randomForest for RFelix Cheung2016-10-303-0/+295
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Random Forest Regression and Classification for R Clean-up/reordering generics.R ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15607 from felixcheung/rrandomforest.
* [SPARK-3261][MLLIB] KMeans clusterer can return duplicate cluster centersSean Owen2016-10-303-65/+85
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Return potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #15450 from srowen/SPARK-3261.
* [SPARK-5992][ML] Locality Sensitive HashingYunni2016-10-286-0/+1208
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit). Detailed changes are as follows: (1) Implement abstract LSH, LSHModel classes as Estimator-Model (2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel (3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance (4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin Things that will be implemented in a follow-up PR: - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance - PySpark Integration for the scala classes and methods. ## How was this patch tested? Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally. Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit). ## References Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529. Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014). Author: Yunni <Euler57721@gmail.com> Author: Yun Ni <yunn@uber.com> Closes #15148 from Yunni/SPARK-5992-yunn-lsh.
* [SPARK-18109][ML] Add instrumentation to GMMZheng RuiFeng2016-10-281-0/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add instrumentation to GMM ## How was this patch tested? Test in spark-shell Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15636 from zhengruifeng/gmm_instr.