aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects ↵William Benton2016-09-174-21/+74
| | | | | | | | | | | | | | | | the best match when invoked with a vector ## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes #15105 from willb/fix/findSynonyms.
* [SPARK-17507][ML][MLLIB] check weight vector size in ANNWeichenXu2016-09-151-6/+4
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? as the TODO described, check weight vector size and if wrong throw exception. ## How was this patch tested? existing tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15060 from WeichenXu123/check_input_weight_size_of_ann.
* [SPARK-17389][FOLLOW-UP][ML] Change KMeans k-means|| default init steps from ↵Yanbo Liang2016-09-112-3/+3
| | | | | | | | | | | | | | 5 to 2. ## What changes were proposed in this pull request? #14956 reduced default k-means|| init steps to 2 from 5 only for spark.mllib package, we should also do same change for spark.ml and PySpark. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15050 from yanboliang/spark-17389.
* [SPARK-17389][ML][MLLIB] KMeans speedup with better choice of k-means|| init ↵Sean Owen2016-09-112-10/+6
| | | | | | | | | | | | | | | | | steps = 2 ## What changes were proposed in this pull request? Reduce default k-means|| init steps to 2 from 5. See JIRA for discussion. See also https://github.com/apache/spark/pull/14948 ## How was this patch tested? Existing tests. Author: Sean Owen <sowen@cloudera.com> Closes #14956 from srowen/SPARK-17389.2.
* [SPARK-15509][FOLLOW-UP][ML][SPARKR] R MLlib algorithms should support input ↵Yanbo Liang2016-09-108-42/+14
| | | | | | | | | | | | | | | | | | | | | | | | | columns "features" and "label" ## What changes were proposed in this pull request? #13584 resolved the issue of features and label columns conflict with ```RFormula``` default ones when loading libsvm data, but it still left some issues should be resolved: 1, It’s not necessary to check and rename label column. Since we have considerations on the design of ```RFormula```, it can handle the case of label column already exists(with restriction of the existing label column should be numeric/boolean type). So it’s not necessary to change the column name to avoid conflict. If the label column is not numeric/boolean type, ```RFormula``` will throw exception. 2, We should rename features column name to new one if there is conflict, but appending a random value is enough since it was used internally only. We done similar work when implementing ```SQLTransformer```. 3, We should set correct new features column for the estimators. Take ```GLM``` as example: ```GLM``` estimator should set features column with the changed one(rFormula.getFeaturesCol) rather than the default “features”. Although it’s same when training model, but it involves problems when predicting. The following is the prediction result of GLM before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18308227/84c3c452-74a8-11e6-9caa-9d6d846cc957.png) We should drop the internal used feature column name, otherwise, it will appear on the prediction DataFrame which will confused users. And this behavior is same as other scenarios which does not exist column name conflict. After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/18308240/92082a04-74a8-11e6-9226-801f52b856d9.png) ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14993 from yanboliang/spark-15509.
* [SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ↵Liwei Lin2016-09-075-6/+6
| | | | | | | | | | | | | | | | ArrayBuffer.append(A) in performance critical paths ## What changes were proposed in this pull request? We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing. ## How was this patch tested? N/A Author: Liwei Lin <lwlin7@gmail.com> Closes #14914 from lw-lin/append_to_plus_eq_v2.
* [MINOR] Remove unnecessary check in MLSerDeZheng RuiFeng2016-09-061-5/+4
| | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, remove unnecessary `require()`, because it will make following check useless. 2, update the error msg. ## How was this patch tested? no test Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14972 from zhengruifeng/del_unnecessary_check.
* [MINOR][ML] Correct weights doc of MultilayerPerceptronClassificationModel.Yanbo Liang2016-09-061-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? ```weights``` of ```MultilayerPerceptronClassificationModel``` should be the output weights of layers rather than initial weights, this PR correct it. ## How was this patch tested? Doc change. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14967 from yanboliang/mlp-weights.
* [SPARK-17279][SQL] better error message for exceptions during ScalaUDF executionWenchen Fan2016-09-061-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? If `ScalaUDF` throws exceptions during executing user code, sometimes it's hard for users to figure out what's wrong, especially when they use Spark shell. An example ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 12 in stage 325.0 failed 4 times, most recent failure: Lost task 12.3 in stage 325.0 (TID 35622, 10.0.207.202): java.lang.NullPointerException at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40) at line8414e872fb8b42aba390efc153d1611a12.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$2.apply(<console>:40) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) ... ``` We should catch these exceptions and rethrow them with better error message, to say that the exception is happened in scala udf. This PR also does some clean up for `ScalaUDF` and add a unit test suite for it. ## How was this patch tested? the new test suite Author: Wenchen Fan <wenchen@databricks.com> Closes #14850 from cloud-fan/npe.
* [MINOR][ML][MLLIB] Remove work around for breeze sparse matrix.Yanbo Liang2016-09-041-9/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Since we have updated breeze version to 0.12, we should remove work around for bug of breeze sparse matrix in v0.11. I checked all mllib code and found this is the only work around for breeze 0.11. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14953 from yanboliang/matrices.
* [SPARK-17311][MLLIB] Standardize Python-Java MLlib API to accept optional ↵Sean Owen2016-09-041-9/+11
| | | | | | | | | | | | | | | | | | | | long seeds in all cases ## What changes were proposed in this pull request? Related to https://github.com/apache/spark/pull/14524 -- just the 'fix' rather than a behavior change. - PythonMLlibAPI methods that take a seed now always take a `java.lang.Long` consistently, allowing the Python API to specify "no seed" - .mllib's Word2VecModel seemed to be an odd man out in .mllib in that it picked its own random seed. Instead it defaults to None, meaning, letting the Scala implementation pick a seed - BisectingKMeansModel arguably should not hard-code a seed for consistency with .mllib, I think. However I left it. ## How was this patch tested? Existing tests Author: Sean Owen <sowen@cloudera.com> Closes #14826 from srowen/SPARK-16832.2.
* [SPARK-17308] Improved the spark core code by replacing all pattern match on ↵Shivansh2016-09-041-5/+4
| | | | | | | | | | | | | | | boolean value by if/else block. ## What changes were proposed in this pull request? Improved the code quality of spark by replacing all pattern match on boolean value by if/else block. ## How was this patch tested? By running the tests Author: Shivansh <shiv4nsh@gmail.com> Closes #14873 from shiv4nsh/SPARK-17308.
* [SPARK-17315][SPARKR] Kolmogorov-Smirnov test SparkR wrapperJunyang Qian2016-09-031-0/+57
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR tries to add Kolmogorov-Smirnov Test wrapper to SparkR. This wrapper implementation only supports one sample test against normal distribution. ## How was this patch tested? R unit test. Author: Junyang Qian <junyangq@databricks.com> Closes #14881 from junyangq/SPARK-17315.
* [SPARK-17363][ML][MLLIB] fix MultivariantOnlineSummerizer.numNonZerosWeichenXu2016-09-032-3/+3
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? fix `MultivariantOnlineSummerizer.numNonZeros` method, return `nnz` array, instead of `weightSum` array ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14923 from WeichenXu123/fix_MultivariantOnlineSummerizer_numNonZeros.
* [SPARK-15509][ML][SPARKR] R MLlib algorithms should support input columns ↵Xin Ren2016-09-029-14/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "features" and "label" https://issues.apache.org/jira/browse/SPARK-15509 ## What changes were proposed in this pull request? Currently in SparkR, when you load a LibSVM dataset using the sqlContext and then pass it to an MLlib algorithm, the ML wrappers will fail since they will try to create a "features" column, which conflicts with the existing "features" column from the LibSVM loader. E.g., using the "mnist" dataset from LibSVM: `training <- loadDF(sqlContext, ".../mnist", "libsvm")` `model <- naiveBayes(label ~ features, training)` This fails with: ``` 16/05/24 11:52:41 ERROR RBackendHandler: fit on org.apache.spark.ml.r.NaiveBayesWrapper failed Error in invokeJava(isStatic = TRUE, className, methodName, ...) : java.lang.IllegalArgumentException: Output column features already exists. at org.apache.spark.ml.feature.VectorAssembler.transformSchema(VectorAssembler.scala:120) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:179) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:179) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:67) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:131) at org.apache.spark.ml.feature.RFormula.fit(RFormula.scala:169) at org.apache.spark.ml.r.NaiveBayesWrapper$.fit(NaiveBayesWrapper.scala:62) at org.apache.spark.ml.r.NaiveBayesWrapper.fit(NaiveBayesWrapper.sca The same issue appears for the "label" column once you rename the "features" column. ``` The cause is, when using `loadDF()` to generate dataframes, sometimes it’s with default column name `“label”` and `“features”`, and these two name will conflict with default column names `setDefault(labelCol, "label")` and ` setDefault(featuresCol, "features")` of `SharedParams.scala` ## How was this patch tested? Test on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #13584 from keypointt/SPARK-15509.
* [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arraysSean Owen2016-09-014-7/+7
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]() ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14895 from srowen/SPARK-17331.
* [SPARK-17241][SPARKR][MLLIB] SparkR spark.glm should have configurable ↵Xin Ren2016-08-312-1/+43
| | | | | | | | | | | | | | | | | | regularization parameter https://issues.apache.org/jira/browse/SPARK-17241 ## What changes were proposed in this pull request? Spark has configurable L2 regularization parameter for generalized linear regression. It is very important to have them in SparkR so that users can run ridge regression. ## How was this patch tested? Test manually on local laptop. Author: Xin Ren <iamshrek@126.com> Closes #14856 from keypointt/SPARK-17241.
* [MINOR][MLLIB][SQL] Clean up unused variables and unused importXin Ren2016-08-305-8/+8
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Clean up unused variables and unused import statements, unnecessary `return` and `toArray`, and some more style improvement, when I walk through the code examples. ## How was this patch tested? Testet manually on local laptop. Author: Xin Ren <iamshrek@126.com> Closes #14836 from keypointt/codeWalkThroughML.
* [SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when ↵Sean Owen2016-08-274-53/+76
| | | | | | | | | | | | | | | | withMean=True ## What changes were proposed in this pull request? Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages. ## How was this patch tested? Jenkins tests, including new caes to reflect the new behavior. Author: Sean Owen <sowen@cloudera.com> Closes #14663 from srowen/SPARK-17001.
* [ML][MLLIB] The require condition and message doesn't match in SparseMatrix.Peng, Meng2016-08-271-4/+7
| | | | | | | | | | | | | ## What changes were proposed in this pull request? The require condition and message doesn't match, and the condition also should be optimized. Small change. Please kindly let me know if JIRA required. ## How was this patch tested? No additional test required. Author: Peng, Meng <peng.meng@intel.com> Closes #14824 from mpjlu/smallChangeForMatrixRequire.
* [SPARK-17207][MLLIB] fix comparing Vector bug in TestingUtilsPeng, Meng2016-08-263-10/+285
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? fix comparing Vector bug in TestingUtils. There is the same bug for Matrix comparing. How to check the length of Matrix should be discussed first. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng, Meng <peng.meng@intel.com> Closes #14785 from mpjlu/testUtils.
* [SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkRXin Ren2016-08-242-0/+136
| | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-16445 ## What changes were proposed in this pull request? Create Multilayer Perceptron Classifier wrapper in SparkR ## How was this patch tested? Tested manually on local machine Author: Xin Ren <iamshrek@126.com> Closes #14447 from keypointt/SPARK-16445.
* [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer ↵VinceShieh2016-08-242-1/+25
| | | | | | | | | | | | | | | | | | | | | | when some quantiles are duplicated ## What changes were proposed in this pull request? In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements, we will take the unique elements generated from approxQuantiles as input for Bucketizer. ## How was this patch tested? An unit test is added in QuantileDiscretizerSuite QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two or more unique cut points, although that may produce less number of buckets than requested. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14747 from VinceShieh/SPARK-17086.
* [TRIVIAL] Typo FixZheng RuiFeng2016-08-231-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? Fix a typo ## How was this patch tested? no tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14772 from zhengruifeng/minor_numClasses.
* [SPARK-17095] [Documentation] [Latex and Scala doc do not play nicely]Jagadeesan2016-08-234-13/+24
| | | | | | | | | | ## What changes were proposed in this pull request? In Latex, it is common to find "}}}" when closing several expressions at once. [SPARK-16822](https://issues.apache.org/jira/browse/SPARK-16822) added Mathjax to render Latex equations in scaladoc. However, when scala doc sees "}}}" or "{{{" it treats it as a special character for code block. This results in some very strange output. Author: Jagadeesan <as2@us.ibm.com> Closes #14688 from jagadeesanas2/SPARK-17095.
* [SPARK-17090][FOLLOW-UP][ML] Add expert param support to SharedParamsCodeGenhqzizania2016-08-222-6/+12
| | | | | | | | | | ## What changes were proposed in this pull request? Add expert param support to SharedParamsCodeGen where aggregationDepth a expert param is added. Author: hqzizania <hqzizania@gmail.com> Closes #14738 from hqzizania/SPARK-17090-minor.
* [SPARK-15113][PYSPARK][ML] Add missing num features num classesHolden Karau2016-08-221-0/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc. ## How was this patch tested? Extended doctests Author: Holden Karau <holden@us.ibm.com> Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
* [SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalogWenchen Fan2016-08-211-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties. This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place. changes overview: 1. **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`) **after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore. 2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties. **after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it. bonus: now we can create data source table using `SessionCatalog`, if schema is specified. breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14155 from cloud-fan/catalog-table.
* [SPARK-17090][ML] Make tree aggregation level in linear/logistic regression ↵hqzizania2016-08-205-17/+74
| | | | | | | | | | | | configurable ## What changes were proposed in this pull request? Linear/logistic regression use treeAggregate with default depth (always = 2) for collecting coefficient gradient updates to the driver. For high dimensional problems, this can cause OOM error on the driver. This patch makes it configurable to avoid this problem if users' input data has many features. It adds a HasTreeDepth API in `sharedParams.scala`, and extends it to both Linear regression and logistic regression in .ml Author: hqzizania <hqzizania@gmail.com> Closes #14717 from hqzizania/SPARK-17090.
* [SPARK-16443][SPARKR] Alternating Least Squares (ALS) wrapperJunyang Qian2016-08-192-0/+121
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add Alternating Least Squares wrapper in SparkR. Unit tests have been updated. ## How was this patch tested? SparkR unit tests. (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) ![screen shot 2016-07-27 at 3 50 31 pm](https://cloud.githubusercontent.com/assets/15318264/17195347/f7a6352a-5411-11e6-8e21-61a48070192a.png) ![screen shot 2016-07-27 at 3 50 46 pm](https://cloud.githubusercontent.com/assets/15318264/17195348/f7a7d452-5411-11e6-845f-6d292283bc28.png) Author: Junyang Qian <junyangq@databricks.com> Closes #14384 from junyangq/SPARK-16443.
* [SPARK-17141][ML] MinMaxScaler should remain NaN value.Yanbo Liang2016-08-192-2/+31
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately. * If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```. * Otherwise, it will remain ```NaN``` after transformation. I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14716 from yanboliang/spark-17141.
* [SPARK-7159][ML] Add multiclass logistic regression to Spark MLsethah2016-08-184-88/+2062
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch adds a new estimator/transformer `MultinomialLogisticRegression` to spark ML. JIRA: [SPARK-7159](https://issues.apache.org/jira/browse/SPARK-7159) ## How was this patch tested? Added new test suite `MultinomialLogisticRegressionSuite`. ## Approach ### Do not use a "pivot" class in the algorithm formulation Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by dbtsai. ### Separate multinomial logistic regression and binary logistic regression The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one. **Arguments for:** * The multinomial case without pivot is distinctly different than the current binary case since the binary case uses a pivot class. * The current logistic regression model in ML uses a vector of coefficients and a scalar intercept. In the multinomial case, we require a matrix of coefficients and a vector of intercepts. There are potential workarounds for this issue if we were to merge the two estimators, but none are particularly elegant. **Arguments against:** * It may be inconvenient for users to have to switch the estimator class when transitioning between binary and multiclass (although the new multinomial estimator can be used for two class outcomes). * Some portions of the code are repeated. This is a major design point and warrants more discussion. ### Mean centering When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail [here](https://core.ac.uk/download/files/153/6287975.pdf). R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well. ### Feature scaling In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space _before_ optimizaton. In ML, this means that `numFeatures * numClasses` extra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases. One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm _much_ slower. The tradeoffs here merit more discussion. ### Specifying and inferring the number of outcome classes The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the `histogram` variable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm. ## Performance Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary. Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each Notes: * Time in units of seconds * Metric is classification accuracy | algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam | |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------| | ml | 0 | true | 0.746415 | 30 | 100000 | 3 | 100000 | 327.923 | true | 0 | | mllib | 0 | true | 0.743785 | 30 | 100000 | 3 | 100000 | 390.217 | true | 0 | | algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam | |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------| | ml | 0 | true | 0.973238 | 30 | 2000000 | 3 | 10000 | 385.476 | true | 0 | | mllib | 0 | true | 0.949828 | 30 | 2000000 | 3 | 10000 | 550.403 | true | 0 | | algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam | |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------| | mllib | 0 | true | 0.864358 | 30 | 2000000 | 3 | 10000 | 543.359 | true | 0.1 | | ml | 0 | true | 0.867418 | 30 | 2000000 | 3 | 10000 | 401.955 | true | 0.1 | | algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam | |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------| | ml | 1 | true | 0.807449 | 30 | 2000000 | 3 | 10000 | 334.892 | true | 0.05 | | algo | elasticNetParam | fitIntercept | metric | maxIter | numPoints | numClasses | numFeatures | time | standardization | regParam | |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------| | ml | 0 | true | 0.602006 | 30 | 2000000 | 500 | 100 | 112.319 | true | 0 | | mllib | 0 | true | 0.567226 | 30 | 2000000 | 500 | 100 | 263.768 | true | 0 |e | 0.567226 | 30 | 2000000 | 500 | 100 | 263.768 | true | 0 | ## References Friedman, et al. ["Regularization Paths for Generalized Linear Models via Coordinate Descent"](https://core.ac.uk/download/files/153/6287975.pdf) [http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html](http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html) ## Follow up items * Consider using level 2 BLAS routines in the gradient computations - [SPARK-17134](https://issues.apache.org/jira/browse/SPARK-17134) * Add model summary for MLOR - [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) * Add initial model to MLOR and add test for intercept priors - [SPARK-17140](https://issues.apache.org/jira/browse/SPARK-17140) * Python API - [SPARK-17138](https://issues.apache.org/jira/browse/SPARK-17138) * Consider changing the tree aggregation level for MLOR/BLOR or making it user configurable to avoid memory problems with high dimensional data - [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090) * Refactor helper classes out of `LogisticRegression.scala` - [SPARK-17135](https://issues.apache.org/jira/browse/SPARK-17135) * Design optimizer interface for added flexibility in ML algos - [SPARK-17136](https://issues.apache.org/jira/browse/SPARK-17136) * Support compressing the coefficients and intercepts for MLOR models - [SPARK-17137](https://issues.apache.org/jira/browse/SPARK-17137) Author: sethah <seth.hendrickson16@gmail.com> Closes #13796 from sethah/SPARK-7159_M.
* [SPARK-16447][ML][SPARKR] LDA wrapper in SparkRXusen Yin2016-08-183-0/+222
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add LDA Wrapper in SparkR with the following interfaces: - spark.lda(data, ...) - spark.posterior(object, newData, ...) - spark.perplexity(object, ...) - summary(object) - write.ml(object) - read.ml(path) ## How was this patch tested? Test with SparkR unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #14229 from yinxusen/SPARK-16447.
* [SPARK-16446][SPARKR][ML] Gaussian Mixture Model wrapper in SparkRYanbo Liang2016-08-172-0/+130
| | | | | | | | | | | | ## What changes were proposed in this pull request? Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14392 from yanboliang/spark-16446.
* [SPARK-16444][SPARKR] Isotonic Regression wrapper in SparkRwm624@hotmail.com2016-08-172-0/+121
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Add Isotonic Regression wrapper in SparkR Wrappers in R and Scala are added. Unit tests Documentation ## How was this patch tested? Manually tested with sudo ./R/run-tests.sh (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Author: wm624@hotmail.com <wm624@hotmail.com> Closes #14182 from wangmiao1981/isoR.
* [SPARK-16934][ML][MLLIB] Update LogisticCostAggregator serialization code to ↵WeichenXu2016-08-151-16/+20
| | | | | | | | | | | | | | | | | | | make it consistent with LinearRegression ## What changes were proposed in this pull request? Update LogisticCostAggregator serialization code to make it consistent with #14109 ## How was this patch tested? MLlib 2.0: ![image](https://cloud.githubusercontent.com/assets/19235986/17649601/5e2a79ac-61ee-11e6-833c-3bd8b5250470.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/19235986/17649599/52b002ae-61ee-11e6-9402-9feb3439880f.png) Author: WeichenXu <WeichenXu123@outlook.com> Closes #14520 from WeichenXu123/improve_logistic_regression_costfun.
* [TRIVIAL][ML] Fix LogisticRegression typo in error message.Yanbo Liang2016-08-151-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? Fix ```LogisticRegression``` typo in error message. ## How was this patch tested? Docs change, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14633 from yanboliang/lr-typo.
* [SPARK-17027][ML] Avoid integer overflow in PolynomialExpansion.getPolySizezero3232016-08-142-4/+30
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Replaces custom choose function with o.a.commons.math3.CombinatoricsUtils.binomialCoefficient ## How was this patch tested? Spark unit tests Author: zero323 <zero323@users.noreply.github.com> Closes #14614 from zero323/SPARK-17027.
* [SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve ↵Yanbo Liang2016-08-121-1/+2
| | | | | | | | | | | | | | | performance ## What changes were proposed in this pull request? ```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance. BTW, we should destroy broadcast variable ```compute``` at the end of each iteration. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14621 from yanboliang/spark-17033.
* [SPARK-16710][SPARKR][ML] spark.glm should support weightColYanbo Liang2016-08-101-1/+3
| | | | | | | | | | | | ## What changes were proposed in this pull request? Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib. ## How was this patch tested? Unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14346 from yanboliang/spark-16710.
* [SPARK-16933][ML] Fix AFTAggregator in AFTSurvivalRegression serializes ↵Yanbo Liang2016-08-091-18/+29
| | | | | | | | | | | | | | | | | | | | unnecessary data. ## What changes were proposed in this pull request? Similar to ```LeastSquaresAggregator``` in #14109, ```AFTAggregator``` used for ```AFTSurvivalRegression``` ends up serializing the ```parameters``` and ```featuresStd```, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. This PR is highly inspired by #14109. ## How was this patch tested? I tested this locally and verified the serialization reduction. Before patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512035/abb93f04-5dda-11e6-97d3-8ae6b61a0dfd.png) After patch ![image](https://cloud.githubusercontent.com/assets/1962026/17512024/9e0dc44c-5dda-11e6-93d0-6e130ba0d6aa.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #14519 from yanboliang/spark-16933.
* [SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add ↵Holden Karau2016-08-081-1/+0
| | | | | | | | | | | | | | | | much and remove whitelisting ## What changes were proposed in this pull request? Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability. ## How was this patch tested? Existing tests. Author: Holden Karau <holden@us.ibm.com> Closes #14407 from holdenk/SPARK-16779.
* [SPARK-16404][ML] LeastSquaresAggregators serializes unnecessary datasethah2016-08-081-20/+45
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Similar to `LogisticAggregator`, `LeastSquaresAggregator` used for linear regression ends up serializing the coefficients and the features standard deviations, which is not necessary and can cause performance issues for high dimensional data. This patch removes this serialization. In https://github.com/apache/spark/pull/13729 the approach was to pass these values directly to the add method. The approach used here, initially, is to mark these fields as transient instead which gives the benefit of keeping the signature of the add method simple and interpretable. The downside is that it requires the use of `transient lazy val`s which are difficult to reason about if one is not quite familiar with serialization in Scala/Spark. ## How was this patch tested? **MLlib** ![image](https://cloud.githubusercontent.com/assets/7275795/16703660/436f79fa-4524-11e6-9022-ef00058ec718.png) **ML without patch** ![image](https://cloud.githubusercontent.com/assets/7275795/16703831/c4d50b9e-4525-11e6-80cb-9b58c850cd41.png) **ML with patch** ![image](https://cloud.githubusercontent.com/assets/7275795/16703675/63e0cf40-4524-11e6-9120-1f512a70e083.png) Author: sethah <seth.hendrickson16@gmail.com> Closes #14109 from sethah/LIR_serialize.
* [SPARK-16750][FOLLOW-UP][ML] Add transformSchema for ↵Yanbo Liang2016-08-054-5/+16
| | | | | | | | | | | | | | | StringIndexer/VectorAssembler and fix failed tests. ## What changes were proposed in this pull request? This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review. The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process. ## How was this patch tested? Modified unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14455 from yanboliang/transformSchema.
* [SPARK-16863][ML] ProbabilisticClassifier.fit check threshoulds' lengthZheng RuiFeng2016-08-044-0/+28
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier ## How was this patch tested? unit tests and manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14470 from zhengruifeng/classifier_check_setThreshoulds_length.
* [SPARK-16880][ML][MLLIB] make ann training data persisted if neededWeichenXu2016-08-041-2/+7
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? To Make sure ANN layer input training data to be persisted, so that it can avoid overhead cost if the RDD need to be computed from lineage. ## How was this patch tested? Existing Tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14483 from WeichenXu123/add_ann_persist_training_data.
* [SPARK-16822][DOC] Support latex in scaladoc.Shuai Lin2016-08-026-117/+205
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Support using latex in scaladoc by adding MathJax javascript to the js template. ## How was this patch tested? Generated scaladoc. Preview: - LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) - MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) Author: Shuai Lin <linshuai2012@gmail.com> Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
* [SPARK-16851][ML] Incorrect threshould length in 'setThresholds()' evoke ↵Zheng RuiFeng2016-08-021-1/+6
| | | | | | | | | | | | | | Exception ## What changes were proposed in this pull request? Add a length checking for threshoulds' length in method `setThreshoulds()` of classification models. ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14457 from zhengruifeng/check_setThresholds.
* [SPARK-16485][DOC][ML] Remove useless latex in a log messge.Shuai Lin2016-08-011-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Removed useless latex in a log messge. ## How was this patch tested? Check generated scaladoc. Author: Shuai Lin <linshuai2012@gmail.com> Closes #14380 from lins05/fix-docs-formatting.
* [SPARK-16696][ML][MLLIB] destroy KMeans bcNewCenters when loop finished and ↵WeichenXu2016-07-303-9/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | update code where should release unused broadcast/RDD in proper time ## What changes were proposed in this pull request? update unused broadcast in KMeans/Word2Vec, use destroy(false) to release memory in time. and several place destroy() update to destroy(false) so that it will be async-called, it will better than blocking called. and update bcNewCenters in KMeans to make it destroy in correct time. I use a list to store all historical `bcNewCenters` generated in each loop iteration and delay them to release at the end of loop. fix TODO in `BisectingKMeans.run` "unpersist old indices", Implements the pattern "persist current step RDD, and unpersist previous one" in the loop iteration. ## How was this patch tested? Existing tests. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14333 from WeichenXu123/broadvar_unpersist_to_destroy.