aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-19825][R][ML] spark.ml R API for FPGrowthzero3232017-04-032-0/+88
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825): - `spark.fpGrowth` -model training. - `freqItemsets` and `associationRules` methods with new corresponding generics. - Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper` - unit tests. ## How was this patch tested? Feature specific unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17170 from zero323/SPARK-19825.
* [SPARK-19969][ML] Imputer doc and exampleYuhao Yang2017-04-031-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316 ## How was this patch tested? local doc generation and example execution Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17324 from hhbyyh/imputerdoc.
* [SPARK-19985][ML] Fixed copy method for some ML ModelsBryan Cutler2017-04-038-8/+30
| | | | | | | | | | | | ## What changes were proposed in this pull request? Some ML Models were using `defaultCopy` which expects a default constructor, and others were not setting the parent estimator. This change fixes these by creating a new instance of the model and explicitly setting values and parent. ## How was this patch tested? Added `MLTestingUtils.checkCopy` to the offending models to tests to verify the copy is made and parent is set. Author: Bryan Cutler <cutlerb@gmail.com> Closes #17326 from BryanCutler/ml-model-copy-error-SPARK-19985.
* [DOCS] Docs-only improvementsJacek Laskowski2017-03-301-1/+1
| | | | | | | | | | | | | | | | …adoc ## What changes were proposed in this pull request? Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0). ## How was this patch tested? Local build Author: Jacek Laskowski <jacek@japila.pl> Closes #17417 from jaceklaskowski/window-expression-scaladoc.
* [SPARK-20040][ML][PYTHON] pyspark wrapper for ChiSquareTestBago Amirbekian2017-03-281-3/+3
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? A pyspark wrapper for spark.ml.stat.ChiSquareTest ## How was this patch tested? unit tests doctests Author: Bago Amirbekian <bago@databricks.com> Closes #17421 from MrBago/chiSquareTestWrapper.
* [SPARK-20043][ML] DecisionTreeModel: ImpurityCalculator builder fails for ↵颜发才(Yan Facai)2017-03-282-1/+15
| | | | | | | | | | | | | | uppercase impurity type Gini Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading TODO: + [x] add unit test + [x] fix the bug Author: 颜发才(Yan Facai) <facai.yan@gmail.com> Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.
* [SPARK-17137][ML][WIP] Compress logistic regression coefficientssethah2017-03-252-37/+49
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Use the new `compressed` method on matrices to store the logistic regression coefficients as sparse or dense - whichever is requires less memory. Marked as WIP so we can add some performance test results. Basically, we should see if prediction is slower because of using a sparse matrix over a dense one. This can happen since sparse matrices do not use native BLAS operations when computing the margins. ## How was this patch tested? Unit tests added. Author: sethah <seth.hendrickson16@gmail.com> Closes #17426 from sethah/SPARK-17137.
* [SPARK-15040][ML][PYSPARK] Add Imputer to PySparkNick Pentreath2017-03-241-5/+5
| | | | | | | | | | | | Add Python wrapper for `Imputer` feature transformer. ## How was this patch tested? New doc tests and tweak to PySpark ML `tests.py` Author: Nick Pentreath <nickp@za.ibm.com> Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
* [SPARK-19636][ML] Feature parity for correlation statistics in MLlibTimothy Hunter2017-03-232-0/+163
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket. The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage. ## How was this patch tested? ``` build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite" ``` Author: Timothy Hunter <timhunter@databricks.com> Closes #17108 from thunterdb/19636.
* [MINOR][BUILD] Fix javadoc8 breakhyukjinkwon2017-03-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Several javadoc8 breaks have been introduced. This PR proposes fix those instances so that we can build Scala/Java API docs. ``` [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:6: error: reference not found [error] * <code>flatMapGroupsWithState</code> operations on {link KeyValueGroupedDataset}. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:10: error: reference not found [error] * Both, <code>mapGroupsWithState</code> and <code>flatMapGroupsWithState</code> in {link KeyValueGroupedDataset} [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:51: error: reference not found [error] * {link GroupStateTimeout.ProcessingTimeTimeout}) or event time (i.e. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:52: error: reference not found [error] * {link GroupStateTimeout.EventTimeTimeout}). [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:158: error: reference not found [error] * Spark SQL types (see {link Encoder} for more details). [error] ^ [error] .../spark/mllib/target/java/org/apache/spark/ml/fpm/FPGrowthParams.java:26: error: bad use of '>' [error] * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and [error] ^ [error] .../spark/sql/core/src/main/java/org/apache/spark/api/java/function/FlatMapGroupsWithStateFunction.java:30: error: reference not found [error] * {link org.apache.spark.sql.KeyValueGroupedDataset#flatMapGroupsWithState( [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:211: error: reference not found [error] * See {link GroupState} for more details. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:232: error: reference not found [error] * See {link GroupState} for more details. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:254: error: reference not found [error] * See {link GroupState} for more details. [error] ^ [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:277: error: reference not found [error] * See {link GroupState} for more details. [error] ^ [error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found [error] * {link TaskMetrics} &amp; {link MetricsSystem} objects are not thread safe. [error] ^ [error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found [error] * {link TaskMetrics} &amp; {link MetricsSystem} objects are not thread safe. [error] ^ [info] 13 errors ``` ``` jekyll 3.3.1 | Error: Unidoc generation failed ``` ## How was this patch tested? Manually via `jekyll build` Author: hyukjinkwon <gurwls223@gmail.com> Closes #17389 from HyukjinKwon/minor-javadoc8-fix.
* [SPARK-20039][ML] rename ChiSquare to ChiSquareTestJoseph K. Bradley2017-03-212-6/+6
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? I realized that since ChiSquare is in the package stat, it's pretty unclear if it's the hypothesis test, distribution, or what. This PR renames it to ChiSquareTest to clarify this. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #17368 from jkbradley/SPARK-20039.
* [SPARK-20041][DOC] Update docs for NaN handling in approxQuantileZheng RuiFeng2017-03-211-2/+2
| | | | | | | | | | | | ## What changes were proposed in this pull request? Update docs for NaN handling in approxQuantile. ## How was this patch tested? existing tests. Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17369 from zhengruifeng/doc_quantiles_nan.
* [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameterchristopher snow2017-03-211-8/+8
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter. - [DOCS] was previously: "rank is the number of latent factors in the model." - [API] was previously: "rank - number of features to use" This change describes rank in both places consistently as: - "Number of features to use (also referred to as the number of latent factors)" Author: Chris Snow <chris.snowuk.ibm.com> Author: christopher snow <chsnow123@gmail.com> Closes #17345 from snowch/SPARK-20011.
* [SPARK-19899][ML] Replace featuresCol with itemsCol in ml.fpm.FPGrowthzero3232017-03-202-18/+31
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Replaces `featuresCol` `Param` with `itemsCol`. See [SPARK-19899](https://issues.apache.org/jira/browse/SPARK-19899). ## How was this patch tested? Manual tests. Existing unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17321 from zero323/SPARK-19899.
* [SPARK-19635][ML] DataFrame-based API for chi square testJoseph K. Bradley2017-03-164-6/+192
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Wrapper taking and return a DataFrame ## How was this patch tested? Copied unit tests from RDD-based API Author: Joseph K. Bradley <joseph@databricks.com> Closes #17110 from jkbradley/df-hypotests.
* [SPARK-13568][ML] Create feature transformer to impute missing valuesYuhao Yang2017-03-162-0/+444
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13568 It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn. Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc). Currently this PR supports imputation for Double and Vector (null and NaN in Vector). ## How was this patch tested? new unit tests and manual test Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao <yuhao.yang@intel.com> Closes #11601 from hhbyyh/imputer.
* [SPARK-11569][ML] Fix StringIndexer to handle null value properlyMenglong TAN2017-03-142-22/+77
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is to enhance StringIndexer with NULL values handling. Before the PR, StringIndexer will throw an exception when encounters NULL values. With this PR: - handleInvalid=error: Throw an exception as before - handleInvalid=skip: Skip null values as well as unseen labels - handleInvalid=keep: Give null values an additional index as well as unseen labels BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-) ## How was this patch tested? new unit tests Author: Menglong TAN <tanmenglong@renrenche.com> Author: Menglong TAN <tanmenglong@gmail.com> Closes #17233 from crackcell/11569_StringIndexer_NULL.
* [SPARK-19940][ML][MINOR] FPGrowthModel.transform should skip duplicated itemszero3232017-03-142-2/+16
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This commit moved `distinct` in its intended place to avoid duplicated predictions and adds unit test covering the issue. ## How was this patch tested? Unit tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17283 from zero323/SPARK-19940.
* [SPARK-19922][ML] small speedups to findSynonymsAsher Krim2017-03-141-15/+19
| | | | | | | | | | | | | | | | | | | | | | | | | Currently generating synonyms using a large model (I've tested with 3m words) is very slow. These efficiencies have sped things up for us by ~17% I wasn't sure if such small changes were worthy of a jira, but the guidelines seemed to suggest that that is the preferred approach ## What changes were proposed in this pull request? Address a few small issues in the findSynonyms logic: 1) remove usage of ``Array.fill`` to zero out the ``cosineVec`` array. The default float value in Scala and Java is 0.0f, so explicitly setting the values to zero is not needed 2) use Floats throughout. The conversion to Doubles before doing the ``priorityQueue`` is totally superfluous, since all the similarity computations are done using Floats anyway. Creating a second large array just serves to put extra strain on the GC 3) convert the slow ``for(i <- cosVec.indices)`` to an ugly, but faster, ``while`` loop These efficiencies are really only apparent when working with a large model ## How was this patch tested? Existing unit tests + some in-house tests to time the difference cc jkbradley MLNick srowen Author: Asher Krim <krim.asher@gmail.com> Author: Asher Krim <krim.asher@gmail> Closes #17263 from Krimit/fasterFindSynonyms.
* [SPARK-19391][SPARKR][ML] Tweedie GLM API for SparkRactuaryzhang2017-03-141-4/+15
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Port Tweedie GLM #16344 to SparkR felixcheung yanboliang ## How was this patch tested? new test in SparkR Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16729 from actuaryzhang/sparkRTweedie.
* [MINOR][ML] Improve MLWriter overwrite error messageJoseph K. Bradley2017-03-131-2/+3
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Give proper syntax for Java and Python in addition to Scala. ## How was this patch tested? Manually. Author: Joseph K. Bradley <joseph@databricks.com> Closes #17215 from jkbradley/write-err-msg.
* [SPARK-19282][ML][SPARKR] RandomForest Wrapper and GBT Wrapper return param ↵Xin Ren2017-03-124-0/+4
| | | | | | | | | | | | | | | | | | | | | | "maxDepth" to R models ## What changes were proposed in this pull request? RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models. Below 4 R wrappers are changed: * `RandomForestClassificationWrapper` * `RandomForestRegressionWrapper` * `GBTClassificationWrapper` * `GBTRegressionWrapper` ## How was this patch tested? Test manually on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #17207 from keypointt/SPARK-19282.
* [SPARK-16440][MLLIB] Ensure broadcasted variables are destroyed even in case ↵Anthony Truchet2017-03-081-3/+15
| | | | | | | | | | | | | | | of exception ## What changes were proposed in this pull request? Ensure broadcasted variable are destroyed even in case of exception ## How was this patch tested? Word2VecSuite was run locally Author: Anthony Truchet <a.truchet@criteo.com> Closes #14299 from AnthonyTruchet/SPARK-16440.
* [SPARK-19806][ML][PYSPARK] PySpark GeneralizedLinearRegression supports ↵Yanbo Liang2017-03-081-4/+4
| | | | | | | | | | | | | | tweedie distribution. ## What changes were proposed in this pull request? PySpark ```GeneralizedLinearRegression``` supports tweedie distribution. ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17146 from yanboliang/spark-19806.
* [ML][MINOR] Separate estimator and model params for read/write test.Yanbo Liang2017-03-0823-54/+59
| | | | | | | | | | | | ## What changes were proposed in this pull request? Since we allow ```Estimator``` and ```Model``` not always share same params (see ```ALSParams``` and ```ALSModelParams```), we should pass in test params for estimator and model separately in function ```testEstimatorAndModelReadWrite```. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #17151 from yanboliang/test-rw.
* [SPARK-17629][ML] methods to return synonyms directlyAsher Krim2017-03-072-12/+45
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? provide methods to return synonyms directly, without wrapping them in a dataframe In performance sensitive applications (such as user facing apis) the roundtrip to and from dataframes is costly and unnecessary The methods are named ``findSynonymsArray`` to make the return type clear, which also implies a local datastructure ## How was this patch tested? updated word2vec tests Author: Asher Krim <akrim@hubspot.com> Closes #16811 from Krimit/w2vFindSynonymsLocal.
* [SPARK-17498][ML] StringIndexer enhancement for handling unseen labelsVinceShieh2017-03-072-28/+71
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is an enhancement to ML StringIndexer. Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records. But those unseen records might still be useful and user would like to keep the unseen labels in certain use cases, This PR enables StringIndexer to support keeping unseen labels as indices [numLabels]. '''Before StringIndexer().setHandleInvalid("skip") StringIndexer().setHandleInvalid("error") '''After support the third option "keep" StringIndexer().setHandleInvalid("keep") ## How was this patch tested? Test added in StringIndexerSuite Signed-off-by: VinceShieh <vincent.xieintel.com> (Please fill in changes proposed in this fix) Author: VinceShieh <vincent.xie@intel.com> Closes #16883 from VinceShieh/spark-17498.
* [SPARK-19382][ML] Test sparse vectors in LinearSVCSuitewm624@hotmail.com2017-03-061-2/+22
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add unit tests for testing SparseVector. We can't add mixed DenseVector and SparseVector test case, as discussed in JIRA 19382. def merge(other: MultivariateOnlineSummarizer): this.type = { if (this.totalWeightSum != 0.0 && other.totalWeightSum != 0.0) { require(n == other.n, s"Dimensions mismatch when merging with another summarizer. " + s"Expecting $n but got $ {other.n} .") ## How was this patch tested? Unit tests Author: wm624@hotmail.com <wm624@hotmail.com> Author: Miao Wang <wangmiao1981@users.noreply.github.com> Closes #16784 from wangmiao1981/bk.
* [SPARK-19535][ML] RecommendForAllUsers RecommendForAllItems for ALS on DataframeSue Ann Hong2017-03-054-9/+297
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a simple implementation of RecommendForAllUsers & RecommendForAllItems for the Dataframe version of ALS. It uses Dataframe operations (not a wrapper on the RDD implementation). Haven't benchmarked against a wrapper, but unit test examples do work. ## How was this patch tested? Unit tests ``` $ build/sbt > mllib/testOnly *ALSSuite -- -z "recommendFor" > mllib/testOnly ``` Author: Your Name <you@example.com> Author: sueann <sueann@databricks.com> Closes #17090 from sueann/SPARK-19535.
* [SPARK-19745][ML] SVCAggregator captures coefficients in its closuresethah2017-03-026-24/+34
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745) Reorganize SVCAggregator to avoid serializing coefficients. This patch also makes the gradient array a `lazy val` which will avoid materializing a large array on the driver before shipping the class to the executors. This improvement stems from https://github.com/apache/spark/pull/16037. Actually, probably all ML aggregators can benefit from this. We can either: a.) separate the gradient improvement into another patch b.) keep what's here _plus_ add the lazy evaluation to all other aggregators in this patch or c.) keep it as is. ## How was this patch tested? This is an interesting question! I don't know of a reasonable way to test this right now. Ideally, we could perform an optimization and look at the shuffle write data for each task, and we could compare the size to what it we know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way to do that right now? We could discuss this here or in another JIRA, but I suspect it would be a significant undertaking. Author: sethah <seth.hendrickson16@gmail.com> Closes #17076 from sethah/svc_agg.
* [SPARK-19704][ML] AFTSurvivalRegression should support numeric censorColZheng RuiFeng2017-03-023-5/+37
| | | | | | | | | | | ## What changes were proposed in this pull request? make `AFTSurvivalRegression` support numeric censorCol ## How was this patch tested? existing tests and added tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17034 from zhengruifeng/aft_numeric_censor.
* [SPARK-19733][ML] Removed unnecessary castings and refactored checked casts ↵Vasilis Vryniotis2017-03-022-20/+95
| | | | | | | | | | | | | | | | in ALS. ## What changes were proposed in this pull request? The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values. ## How was this patch tested? I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values. Author: Vasilis Vryniotis <bbriniotis@datumbox.com> Closes #17059 from datumbox/als_casting_fix.
* [SPARK-19787][ML] Changing the default parameter of regParam.Vasilis Vryniotis2017-03-011-1/+1
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1. I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them. Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](https://github.com/apache/spark/pull/17059#issuecomment-283333572) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :) ## How was this patch tested? Unit-tests Author: Vasilis Vryniotis <vvryniotis@hotels.com> Closes #17121 from datumbox/als_regparam.
* [SPARK-14503][ML] spark.ml API for FPGrowthYuhao2017-02-282-0/+469
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14503 Function parity: Add FPGrowth and AssociationRules to ML. design doc: https://docs.google.com/document/d/1bVhABn5DiEj8bw0upqGMJT2L4nvO_0_cXdwu4uMT6uU/pub Currently I make FPGrowthModel a transformer. For each association rule, it will just examine the input items against antecedents and summarize the consequents. Update: Thinking again, FPGrowth is only the algorithm to find the frequent itemsets, and can be replaced by other algorithms. The frequent itemsets are used by AssociationRules to generate the association rules. Then we can use the association rules to predict with other records. ![drawing1](https://cloud.githubusercontent.com/assets/7981698/22489294/76b9302c-e7cb-11e6-8d2d-3fc53f407b2f.png) **For reviewers**, Let's first decide if the current `transform` function meets your expectation. Current options: 1. Current implementation: Use Estimator and Transformer pattern in ML, the `transform` function will examine the input items against all the association rules and summarize the consequents. Users can also access frequent items and association rules via other model members. 2. Keep the Estimator and Transformer pattern. But AssociationRulesModel and FPGrowthModel will have empty `transform` function, meaning DataFrame has no change after transform. But users can access frequent items and association rules via other model members. 3. (mentioned by zhengruifeng) Keep the Estimator and Transformer pattern. But `FPGrowthModel` and `AssociationRulesModel` will just return frequent itemsets and association rules DataFrame in the `transform` function. Meaning the resulting DataFrame after `transform` will not be related to the input DataFrame. 4. Discard the Estimator and Transformer pattern. Both FPGrowth and FPGrowthModel will directly extend from PipelineStage, thus we don't need to have a `transform` function. I'd like to hear more concrete suggestions. I would prefer option 1 or 2. update 2: As discussed in the jira, we will not expose AssociationRules as a public API for now. ## How was this patch tested? new unit test suites Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #15415 from hhbyyh/mlfpm.
* [SPARK-14489][ML][PYSPARK] ALS unknown user/item prediction strategyNick Pentreath2017-02-282-4/+91
| | | | | | | | | | | | | | | This PR adds a param to `ALS`/`ALSModel` to set the strategy used when encountering unknown users or items at prediction time in `transform`. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation. The current behavior returns `NaN` if a user/item is unknown. In scenario (b), this can easily occur when using `CrossValidator` or `TrainValidationSplit` since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns `NaN` for all metrics, making model selection impossible. The new param, `coldStartStrategy`, defaults to `nan` (the current behavior). The other option supported initially is `drop`, which drops all rows with `NaN` predictions. This flag allows users to use `ALS` in cross-validation settings. It is made an `expertParam`. The param is made a string so that the set of strategies can be extended in future (some options are discussed in [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)). ## How was this patch tested? New unit tests, and manual "before and after" tests for Scala & Python using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator` or `TrainValidationSplit` with the default param setting results in metrics that are all `NaN`, while setting `coldStartStrategy` to `drop` results in valid metrics. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12896 from MLnick/SPARK-14489-als-nan.
* [SPARK-19746][ML] Faster indexing for logistic aggregatorsethah2017-02-282-3/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? JIRA: [SPARK-19746](https://issues.apache.org/jira/browse/SPARK-19746) The following code is inefficient: ````scala val localCoefficients: Vector = bcCoefficients.value features.foreachActive { (index, value) => val stdValue = value / localFeaturesStd(index) var j = 0 while (j < numClasses) { margins(j) += localCoefficients(index * numClasses + j) * stdValue j += 1 } } ```` `localCoefficients(index * numClasses + j)` calls `Vector.apply` which creates a new Breeze vector and indexes that. Even if it is not that slow to create the object, we will generate a lot of extra garbage that may result in longer GC pauses. This is a hot inner loop, so we should optimize wherever possible. ## How was this patch tested? I don't think there's a great way to test this patch. It's purely performance related, so unit tests should guarantee that we haven't made any unwanted changes. Empirically I observed between 10-40% speedups just running short local tests. I suspect the big differences will be seen when large data/coefficient sizes have to pause for GC more often. I welcome other ideas for testing. Author: sethah <seth.hendrickson16@gmail.com> Closes #17078 from sethah/logistic_agg_indexing.
* [MINOR][BUILD] Fix lint-java breaks in Javahyukjinkwon2017-02-271-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR proposes to fix the lint-breaks as below: ``` [ERROR] src/test/java/org/apache/spark/network/TransportResponseHandlerSuite.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.network.buffer.ManagedBuffer. [ERROR] src/main/java/org/apache/spark/unsafe/types/UTF8String.java:[156,10] (modifier) ModifierOrder: 'Nonnull' annotation modifier does not precede non-annotation modifiers. [ERROR] src/main/java/org/apache/spark/SparkFirehoseListener.java:[122] (sizes) LineLength: Line is longer than 100 characters (found 105). [ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[164,78] (coding) OneStatementPerLine: Only one statement per line allowed. [ERROR] src/test/java/test/org/apache/spark/JavaAPISuite.java:[1157] (sizes) LineLength: Line is longer than 100 characters (found 121). [ERROR] src/test/java/org/apache/spark/streaming/JavaMapWithStateSuite.java:[149] (sizes) LineLength: Line is longer than 100 characters (found 113). [ERROR] src/test/java/test/org/apache/spark/streaming/Java8APISuite.java:[146] (sizes) LineLength: Line is longer than 100 characters (found 122). [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[32,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.Time. [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[611] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[1317] (sizes) LineLength: Line is longer than 100 characters (found 102). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java:[91] (sizes) LineLength: Line is longer than 100 characters (found 102). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[113] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[164] (sizes) LineLength: Line is longer than 100 characters (found 110). [ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[212] (sizes) LineLength: Line is longer than 100 characters (found 114). [ERROR] src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java:[36] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java:[26,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils. [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[20,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils. [ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[94] (sizes) LineLength: Line is longer than 100 characters (found 103). [ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[30,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.api.java.UDF1. [ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[72] (sizes) LineLength: Line is longer than 100 characters (found 104). [ERROR] src/main/java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java:[121] (sizes) LineLength: Line is longer than 100 characters (found 101). [ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaRDD. [ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaSparkContext. ``` ## How was this patch tested? Manually via ```bash ./dev/lint-java ``` Author: hyukjinkwon <gurwls223@gmail.com> Closes #17072 from HyukjinKwon/java-lint.
* [MINOR][ML][DOC] Document default value for ↵Joseph K. Bradley2017-02-251-0/+2
| | | | | | | | | | | | GeneralizedLinearRegression.linkPower Add Scaladoc for GeneralizedLinearRegression.linkPower default value Follow-up to https://github.com/apache/spark/pull/16344 Author: Joseph K. Bradley <joseph@databricks.com> Closes #17069 from jkbradley/tweedie-comment.
* [SPARK-19616][SPARKR] weightCol and aggregationDepth should be improved for ↵wm624@hotmail.com2017-02-224-5/+15
| | | | | | | | | | | | | | | | | some SparkR APIs ## What changes were proposed in this pull request? This is a follow-up PR of #16800 When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter. ## How was this patch tested? Existing tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16945 from wangmiao1981/svc.
* [SPARK-19679][ML] Destroy broadcasted object without blockingZheng RuiFeng2017-02-223-3/+3
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Destroy broadcasted object without blocking use `find mllib -name '*.scala' | xargs -i bash -c 'egrep "destroy" -n {} && echo {}'` ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17016 from zhengruifeng/destroy_without_block.
* [SPARK-19694][ML] Add missing 'setTopicDistributionCol' for LDAModelZheng RuiFeng2017-02-221-0/+3
| | | | | | | | | | | ## What changes were proposed in this pull request? Add missing 'setTopicDistributionCol' for LDAModel ## How was this patch tested? existing tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #17021 from zhengruifeng/lda_outputCol.
* [SPARK-19534][TESTS] Convert Java tests to use lambdas, Java 8 featuresSean Owen2017-02-197-79/+42
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Convert tests to use Java 8 lambdas, and modest related fixes to surrounding code. ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #16964 from srowen/SPARK-19534.
* [MLLIB][TYPO] Replace LeastSquaresAggregator with LogisticAggregatorMoussa Taifi2017-02-181-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Replace LeastSquaresAggregator with LogisticAggregator in the require statement of the merge op. ## How was this patch tested? Simple message fix. Author: Moussa Taifi <moutai10@gmail.com> Closes #16903 from moutai/master.
* [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive HashingYun Ni2017-02-151-3/+4
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH. ## How was this patch tested? API and examples are tested using spark-submit: `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py` `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py` User guide changes are generated and manually inspected: `SKIP_API=1 jekyll build` Author: Yun Ni <yunn@uber.com> Author: Yanbo Liang <ybliang8@gmail.com> Author: Yunni <Euler57721@gmail.com> Closes #16715 from Yunni/spark-18080.
* [SPARK-19456][SPARKR] Add LinearSVC R APIwm624@hotmail.com2017-02-152-0/+154
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API. Marked as WIP, as I am designing unit tests. ## How was this patch tested? Please review http://spark.apache.org/contributing.html before opening a pull request. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16800 from wangmiao1981/svc.
* [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the ↵sureshthalamati2017-02-141-2/+2
| | | | | | | | | | | | | | | | | | | | user in case-sensitive manner. ## What changes were proposed in this pull request? The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner. This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case. This PR enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection. Alternative approach PR https://github.com/apache/spark/pull/16847 is to pass original input keys to JDBC data source by adding check in the Data source class and handle case-insensitivity in the JDBC source code. ## How was this patch tested? Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.
* [SPARK-18613][ML] make spark.mllib LDA dependencies in spark.ml LDA privatesueann2017-02-101-6/+6
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? spark.ml.*LDAModel classes were exposing spark.mllib LDA models via protected methods. Made them package (clustering) private. ## How was this patch tested? ``` build/sbt doc # "millib.clustering" no longer appears in the docs for *LDA* classes build/sbt compile # compiles build/sbt > mllib/testOnly # tests pass ``` Author: sueann <sueann@databricks.com> Closes #16860 from sueann/SPARK-18613.
* [SPARK-19400][ML] Allow GLM to handle intercept only modelactuaryzhang2017-02-083-1/+60
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Intercept-only GLM is failing for non-Gaussian family because of reducing an empty array in IWLS. The following code `val maxTolOfCoefficients = oldCoefficients.toArray.reduce { (x, y) => math.max(math.abs(x), math.abs(y))` fails in the intercept-only model because `oldCoefficients` is empty. This PR fixes this issue. yanboliang srowen imatiach-msft zhengruifeng ## How was this patch tested? New test for intercept only model. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16740 from actuaryzhang/interceptOnly.
* [SPARK-19397][SQL] Make option names of LIBSVM and TEXT case insensitivegatorsmile2017-02-083-6/+73
| | | | | | | | | | | | | | ### What changes were proposed in this pull request? Prior to Spark 2.1, the option names are case sensitive for all the formats. Since Spark 2.1, the option key names become case insensitive except the format `Text` and `LibSVM `. This PR is to fix these issues. Also, add a check to know whether the input option vector type is legal for `LibSVM`. ### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #16737 from gatorsmile/libSVMTextOptions.
* [SPARK-19279][SQL] Infer Schema for Hive Serde Tables and Block Creating a ↵gatorsmile2017-02-061-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | Hive Table With an Empty Schema ### What changes were proposed in this pull request? So far, we allow users to create a table with an empty schema: `CREATE TABLE tab1`. This could break many code paths if we enable it. Thus, we should follow Hive to block it. For Hive serde tables, some serde libraries require the specified schema and record it in the metastore. To get the list, we need to check `hive.serdes.using.metastore.for.schema,` which contains a list of serdes that require user-specified schema. The default values are - org.apache.hadoop.hive.ql.io.orc.OrcSerde - org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe - org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe - org.apache.hadoop.hive.serde2.dynamic_type.DynamicSerDe - org.apache.hadoop.hive.serde2.MetadataTypedColumnsetSerDe - org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe - org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe - org.apache.hadoop.hive.serde2.lazybinary.LazyBinarySerDe ### How was this patch tested? Added test cases for both Hive and data source tables Author: gatorsmile <gatorsmile@gmail.com> Closes #16636 from gatorsmile/fixEmptyTableSchema.