[SPARK-17163][ML] Unified LogisticRegression interface

## What changes were proposed in this pull request? Merge `MultinomialLogisticRegression` into `LogisticRegression` and remove `MultinomialLogisticRegression`. Marked as WIP because we should discuss the coefficients API in the model. See discussion below. JIRA: [SPARK-17163](https://issues.apache.org/jira/browse/SPARK-17163) ## How was this patch tested? Merged test suites and added some new unit tests. ## Design ### Switching between binomial and multinomial We default to automatically detecting whether we should run binomial or multinomial lor. We expose a new parameter called `family` which defaults to auto. When "auto" is used, we run normal binomial lor with pivoting if there are 1 or 2 label classes. Otherwise, we run multinomial. If the user explicitly sets the family, then we abide by that setting. In the case where "binomial" is set but multiclass lor is detected, we throw an error. ### coefficients/intercept model API (TODO) This is the biggest design point remaining, IMO. We need to decide how to store the coefficients and intercepts in the model, and in turn how to expose them via the API. Two important points: * We must maintain compatibility with the old API, i.e. we must expose `def coefficients: Vector` and `def intercept: Double` * There are two separate cases: binomial lr where we have a single set of coefficients and a single intercept and multinomial lr where we have `numClasses` sets of coefficients and `numClasses` intercepts. Some options: 1. **Store the binomial coefficients as a `2 x numFeatures` matrix.** This means that we would center the model coefficients before storing them in the model. The BLOR algorithm gives `1 * numFeatures` coefficients, but we would convert them to `2 x numFeatures` coefficients before storing them, effectively doubling the storage in the model. This has the advantage that we can make the code cleaner (i.e. less `if (isMultinomial) ... else ...`) and we don't have to reason about the different cases as much. It has the disadvantage that we double the storage space and we could see small regressions at prediction time since there are 2x the number of operations in the prediction algorithms. Additionally, we still have to produce the uncentered coefficients/intercept via the API, so we will have to either ALSO store the uncentered version, or compute it in `def coefficients: Vector` every time. 2. **Store the binomial coefficients as a `1 x numFeatures` matrix.** We still store the coefficients as a matrix and the intercepts as a vector. When users call `coefficients` we return them a `Vector` that is backed by the same underlying array as the `coefficientMatrix`, so we don't duplicate any data. At prediction time, we use the old prediction methods that are specialized for binary LOR. The benefits here are that we don't store extra data, and we won't see any regressions in performance. The cost of this is that we have separate implementations for predict methods in the binary vs multiclass case. The duplicated code is really not very high, but it's still a bit messy. If we do decide to store the 2x coefficients, we would likely want to see some performance tests to understand the potential regressions. **Update:** We have chosen option 2 ### Threshold/thresholds (TODO) Currently, when `threshold` is set we clear whatever value is in `thresholds` and when `thresholds` is set we clear whatever value is in `threshold`. [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) was created to prefer thresholds over threshold. We should decide if we should implement this behavior now or if we want to do it in a separate JIRA. **Update:** Let's leave it for a follow up PR ## Follow up * Summary model for multiclass logistic regression [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139) * Thresholds vs threshold [SPARK-11543](https://issues.apache.org/jira/browse/SPARK-11543) Author: sethah <seth.hendrickson16@gmail.com> Closes #14834 from sethah/SPARK-17163.
author: sethah <seth.hendrickson16@gmail.com> 2016-09-19 21:33:54 -0700
committer: DB Tsai <dbt@netflix.com> 2016-09-19 21:33:54 -0700
commit: 26145a5af9a88053c0eaf280206ca2621c8919f6 (patch)
tree: edf2965b4903c10504183006e0b60ee011dc286c /project/MimaExcludes.scala
parent: e719b1c045ba185d242d21bbfcdee2c84dafc587 (diff)
download: spark-26145a5af9a88053c0eaf280206ca2621c8919f6.tar.gz
spark-26145a5af9a88053c0eaf280206ca2621c8919f6.tar.bz2
spark-26145a5af9a88053c0eaf280206ca2621c8919f6.zip
1 files changed, 3 insertions, 0 deletions
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 1bdcf9a623..d4cbf510b9 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -812,6 +812,9 @@ object MimaExcludes {
       ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToTotalCores"),
       ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToTasksMax"),
       ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ui.exec.ExecutorsListener.executorToJvmGCTime")
+    ) ++ Seq(
+      // [SPARK-17163] Unify logistic regression interface. Private constructor has new signature.
+      ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionModel.this")
     )
   }
author	sethah <seth.hendrickson16@gmail.com>	2016-09-19 21:33:54 -0700
committer	DB Tsai <dbt@netflix.com>	2016-09-19 21:33:54 -0700
commit	26145a5af9a88053c0eaf280206ca2621c8919f6 (patch)
tree	edf2965b4903c10504183006e0b60ee011dc286c /project/MimaExcludes.scala
parent	e719b1c045ba185d242d21bbfcdee2c84dafc587 (diff)
download	spark-26145a5af9a88053c0eaf280206ca2621c8919f6.tar.gz spark-26145a5af9a88053c0eaf280206ca2621c8919f6.tar.bz2 spark-26145a5af9a88053c0eaf280206ca2621c8919f6.zip