[SPARK-8601] [ML] Add an option to disable standardization for linear regression

All compressed sensing applications, and some of the regression use-cases will have better result by turning the feature scaling off. However, if we implement this naively by training the dataset without doing any standardization, the rate of convergency will not be good. This can be implemented by still standardizing the training dataset but we penalize each component differently to get effectively the same objective function but a better numerical problem. As a result, for those columns with high variances, they will be penalized less, and vice versa. Without this, since all the features are standardized, so they will be penalized the same. In R, there is an option for this. standardize Logical flag for x variable standardization, prior to fitting the model sequence. The coefficients are always returned on the original scale. Default is standardize=TRUE. If variables are in the same units already, you might not wish to standardize. See details below for y standardization with family="gaussian". Note that the primary author for this PR is holdenk Author: Holden Karau <holden@pigscanfly.ca> Author: DB Tsai <dbt@netflix.com> Closes #7875 from dbtsai/SPARK-8522 and squashes the following commits: e856036 [DB Tsai] scala doc 596e96c [DB Tsai] minor bbff347 [DB Tsai] naming baa0805 [DB Tsai] touch up d6234ba [DB Tsai] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 6b1dc09 [Holden Karau] Merge branch 'master' into SPARK-8522-Disable-Linear_featureScaling-Spark-8601-in-Linear_regression 332f140 [Holden Karau] Merge in master eebe10a [Holden Karau] Use same comparision operator throughout the test 3f92935 [Holden Karau] merge b83a41e [Holden Karau] Expand the tests and make them similar to the other PR also providing an option to disable standardization (but for LoR). 0c334a2 [Holden Karau] Remove extra line 99ce053 [Holden Karau] merge in master e54a8a9 [Holden Karau] Fix long line e47c574 [Holden Karau] Add support for L2 without standardization. 55d3a66 [Holden Karau] Add standardization param for linear regression 00a1dc5 [Holden Karau] Add the param to the linearregression impl
author: Holden Karau <holden@pigscanfly.ca> 2015-08-04 18:15:26 -0700
committer: Joseph K. Bradley <joseph@databricks.com> 2015-08-04 18:15:26 -0700
commit: d92fa14179287c996407d9c7d249103109f9cdef (patch)
tree: c7b0cf69980dbafad77fa675d53270d49680ba1a /mllib/src/test
parent: 629e26f7ee916e70f59b017cb6083aa441b26b2c (diff)
download: spark-d92fa14179287c996407d9c7d249103109f9cdef.tar.gz
spark-d92fa14179287c996407d9c7d249103109f9cdef.tar.bz2
spark-d92fa14179287c996407d9c7d249103109f9cdef.zip
1 files changed, 205 insertions, 73 deletions
diff --git a/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
index 7cdda3db88..21ad8225bd 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala
@@ -70,6 +70,7 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
     assert(lir.getRegParam === 0.0)
     assert(lir.getElasticNetParam === 0.0)
     assert(lir.getFitIntercept)
+    assert(lir.getStandardization)
     val model = lir.fit(dataset)
     model.transform(dataset)
       .select("label", "prediction")
@@ -81,8 +82,11 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
   }
 
   test("linear regression with intercept without regularization") {
-    val trainer = new LinearRegression
-    val model = trainer.fit(dataset)
+    val trainer1 = new LinearRegression
+    // The result should be the same regardless of standardization without regularization
+    val trainer2 = (new LinearRegression).setStandardization(false)
+    val model1 = trainer1.fit(dataset)
+    val model2 = trainer2.fit(dataset)
 
     /*
        Using the following R code to load the data and train the model using glmnet package.
@@ -95,28 +99,36 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
        > weights
         3 x 1 sparse Matrix of class "dgCMatrix"
                                  s0
-       (Intercept)         6.300528
-       as.numeric.data.V2. 4.701024
-       as.numeric.data.V3. 7.198257
+       (Intercept)         6.298698
+       as.numeric.data.V2. 4.700706
+       as.numeric.data.V3. 7.199082
      */
     val interceptR = 6.298698
     val weightsR = Vectors.dense(4.700706, 7.199082)
 
-    assert(model.intercept ~== interceptR relTol 1E-3)
-    assert(model.weights ~= weightsR relTol 1E-3)
+    assert(model1.intercept ~== interceptR relTol 1E-3)
+    assert(model1.weights ~= weightsR relTol 1E-3)
+    assert(model2.intercept ~== interceptR relTol 1E-3)
+    assert(model2.weights ~= weightsR relTol 1E-3)
 
-    model.transform(dataset).select("features", "prediction").collect().foreach {
+
+    model1.transform(dataset).select("features", "prediction").collect().foreach {
       case Row(features: DenseVector, prediction1: Double) =>
         val prediction2 =
-          features(0) * model.weights(0) + features(1) * model.weights(1) + model.intercept
+          features(0) * model1.weights(0) + features(1) * model1.weights(1) + model1.intercept
         assert(prediction1 ~== prediction2 relTol 1E-5)
     }
   }
 
   test("linear regression without intercept without regularization") {
-    val trainer = (new LinearRegression).setFitIntercept(false)
-    val model = trainer.fit(dataset)
-    val modelWithoutIntercept = trainer.fit(datasetWithoutIntercept)
+    val trainer1 = (new LinearRegression).setFitIntercept(false)
+    // Without regularization the results should be the same
+    val trainer2 = (new LinearRegression).setFitIntercept(false).setStandardization(false)
+    val model1 = trainer1.fit(dataset)
+    val modelWithoutIntercept1 = trainer1.fit(datasetWithoutIntercept)
+    val model2 = trainer2.fit(dataset)
+    val modelWithoutIntercept2 = trainer2.fit(datasetWithoutIntercept)
+
 
     /*
        weights <- coef(glmnet(features, label, family="gaussian", alpha = 0, lambda = 0,
@@ -130,26 +142,34 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
      */
     val weightsR = Vectors.dense(6.995908, 5.275131)
 
-    assert(model.intercept ~== 0 absTol 1E-3)
-    assert(model.weights ~= weightsR relTol 1E-3)
+    assert(model1.intercept ~== 0 absTol 1E-3)
+    assert(model1.weights ~= weightsR relTol 1E-3)
+    assert(model2.intercept ~== 0 absTol 1E-3)
+    assert(model2.weights ~= weightsR relTol 1E-3)
+
     /*
        Then again with the data with no intercept:
        > weightsWithoutIntercept
        3 x 1 sparse Matrix of class "dgCMatrix"
-                                   s0
+                                 s0
        (Intercept)           .
        as.numeric.data3.V2. 4.70011
        as.numeric.data3.V3. 7.19943
      */
     val weightsWithoutInterceptR = Vectors.dense(4.70011, 7.19943)
 
-    assert(modelWithoutIntercept.intercept ~== 0 absTol 1E-3)
-    assert(modelWithoutIntercept.weights ~= weightsWithoutInterceptR relTol 1E-3)
+    assert(modelWithoutIntercept1.intercept ~== 0 absTol 1E-3)
+    assert(modelWithoutIntercept1.weights ~= weightsWithoutInterceptR relTol 1E-3)
+    assert(modelWithoutIntercept2.intercept ~== 0 absTol 1E-3)
+    assert(modelWithoutIntercept2.weights ~= weightsWithoutInterceptR relTol 1E-3)
   }
 
   test("linear regression with intercept with L1 regularization") {
-    val trainer = (new LinearRegression).setElasticNetParam(1.0).setRegParam(0.57)
-    val model = trainer.fit(dataset)
+    val trainer1 = (new LinearRegression).setElasticNetParam(1.0).setRegParam(0.57)
+    val trainer2 = (new LinearRegression).setElasticNetParam(1.0).setRegParam(0.57)
+      .setStandardization(false)
+    val model1 = trainer1.fit(dataset)
+    val model2 = trainer2.fit(dataset)
 
     /*
        weights <- coef(glmnet(features, label, family="gaussian", alpha = 1.0, lambda = 0.57))
@@ -160,24 +180,44 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
        as.numeric.data.V2. 4.024821
        as.numeric.data.V3. 6.679841
      */
-    val interceptR = 6.24300
-    val weightsR = Vectors.dense(4.024821, 6.679841)
+    val interceptR1 = 6.24300
+    val weightsR1 = Vectors.dense(4.024821, 6.679841)
 
-    assert(model.intercept ~== interceptR relTol 1E-3)
-    assert(model.weights ~= weightsR relTol 1E-3)
+    assert(model1.intercept ~== interceptR1 relTol 1E-3)
+    assert(model1.weights ~= weightsR1 relTol 1E-3)
 
-    model.transform(dataset).select("features", "prediction").collect().foreach {
+    /*
+      weights <- coef(glmnet(features, label, family="gaussian", alpha = 1.0, lambda = 0.57,
+        standardize=FALSE))
+      > weights
+       3 x 1 sparse Matrix of class "dgCMatrix"
+                                s0
+      (Intercept)         6.416948
+      as.numeric.data.V2. 3.893869
+      as.numeric.data.V3. 6.724286
+     */
+    val interceptR2 = 6.416948
+    val weightsR2 = Vectors.dense(3.893869, 6.724286)
+
+    assert(model2.intercept ~== interceptR2 relTol 1E-3)
+    assert(model2.weights ~= weightsR2 relTol 1E-3)
+
+
+    model1.transform(dataset).select("features", "prediction").collect().foreach {
       case Row(features: DenseVector, prediction1: Double) =>
         val prediction2 =
-          features(0) * model.weights(0) + features(1) * model.weights(1) + model.intercept
+          features(0) * model1.weights(0) + features(1) * model1.weights(1) + model1.intercept
         assert(prediction1 ~== prediction2 relTol 1E-5)
     }
   }
 
   test("linear regression without intercept with L1 regularization") {
-    val trainer = (new LinearRegression).setElasticNetParam(1.0).setRegParam(0.57)
+    val trainer1 = (new LinearRegression).setElasticNetParam(1.0).setRegParam(0.57)
       .setFitIntercept(false)
-    val model = trainer.fit(dataset)
+    val trainer2 = (new LinearRegression).setElasticNetParam(1.0).setRegParam(0.57)
+      .setFitIntercept(false).setStandardization(false)
+    val model1 = trainer1.fit(dataset)
+    val model2 = trainer2.fit(dataset)
 
     /*
        weights <- coef(glmnet(features, label, family="gaussian", alpha = 1.0, lambda = 0.57,
@@ -189,51 +229,90 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
        as.numeric.data.V2. 6.299752
        as.numeric.data.V3. 4.772913
      */
-    val interceptR = 0.0
-    val weightsR = Vectors.dense(6.299752, 4.772913)
+    val interceptR1 = 0.0
+    val weightsR1 = Vectors.dense(6.299752, 4.772913)
 
-    assert(model.intercept ~== interceptR absTol 1E-5)
-    assert(model.weights ~= weightsR relTol 1E-3)
+    assert(model1.intercept ~== interceptR1 absTol 1E-3)
+    assert(model1.weights ~= weightsR1 relTol 1E-3)
 
-    model.transform(dataset).select("features", "prediction").collect().foreach {
+    /*
+       weights <- coef(glmnet(features, label, family="gaussian", alpha = 1.0, lambda = 0.57,
+         intercept=FALSE, standardize=FALSE))
+       > weights
+       3 x 1 sparse Matrix of class "dgCMatrix"
+                                 s0
+       (Intercept)         .
+       as.numeric.data.V2. 6.232193
+       as.numeric.data.V3. 4.764229
+     */
+    val interceptR2 = 0.0
+    val weightsR2 = Vectors.dense(6.232193, 4.764229)
+
+    assert(model2.intercept ~== interceptR2 absTol 1E-3)
+    assert(model2.weights ~= weightsR2 relTol 1E-3)
+
+
+    model1.transform(dataset).select("features", "prediction").collect().foreach {
       case Row(features: DenseVector, prediction1: Double) =>
         val prediction2 =
-          features(0) * model.weights(0) + features(1) * model.weights(1) + model.intercept
+          features(0) * model1.weights(0) + features(1) * model1.weights(1) + model1.intercept
         assert(prediction1 ~== prediction2 relTol 1E-5)
     }
   }
 
   test("linear regression with intercept with L2 regularization") {
-    val trainer = (new LinearRegression).setElasticNetParam(0.0).setRegParam(2.3)
-    val model = trainer.fit(dataset)
+    val trainer1 = (new LinearRegression).setElasticNetParam(0.0).setRegParam(2.3)
+    val trainer2 = (new LinearRegression).setElasticNetParam(0.0).setRegParam(2.3)
+      .setStandardization(false)
+    val model1 = trainer1.fit(dataset)
+    val model2 = trainer2.fit(dataset)
 
     /*
-       weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.0, lambda = 2.3))
-       > weights
-        3 x 1 sparse Matrix of class "dgCMatrix"
-                                 s0
-       (Intercept)         6.328062
-       as.numeric.data.V2. 3.222034
-       as.numeric.data.V3. 4.926260
+      weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.0, lambda = 2.3))
+      > weights
+       3 x 1 sparse Matrix of class "dgCMatrix"
+                                s0
+      (Intercept)         5.269376
+      as.numeric.data.V2. 3.736216
+      as.numeric.data.V3. 5.712356)
      */
-    val interceptR = 5.269376
-    val weightsR = Vectors.dense(3.736216, 5.712356)
+    val interceptR1 = 5.269376
+    val weightsR1 = Vectors.dense(3.736216, 5.712356)
 
-    assert(model.intercept ~== interceptR relTol 1E-3)
-    assert(model.weights ~= weightsR relTol 1E-3)
+    assert(model1.intercept ~== interceptR1 relTol 1E-3)
+    assert(model1.weights ~= weightsR1 relTol 1E-3)
 
-    model.transform(dataset).select("features", "prediction").collect().foreach {
+    /*
+      weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.0, lambda = 2.3,
+        standardize=FALSE))
+      > weights
+       3 x 1 sparse Matrix of class "dgCMatrix"
+                                s0
+      (Intercept)         5.791109
+      as.numeric.data.V2. 3.435466
+      as.numeric.data.V3. 5.910406
+     */
+    val interceptR2 = 5.791109
+    val weightsR2 = Vectors.dense(3.435466, 5.910406)
+
+    assert(model2.intercept ~== interceptR2 relTol 1E-3)
+    assert(model2.weights ~= weightsR2 relTol 1E-3)
+
+    model1.transform(dataset).select("features", "prediction").collect().foreach {
       case Row(features: DenseVector, prediction1: Double) =>
         val prediction2 =
-          features(0) * model.weights(0) + features(1) * model.weights(1) + model.intercept
+          features(0) * model1.weights(0) + features(1) * model1.weights(1) + model1.intercept
         assert(prediction1 ~== prediction2 relTol 1E-5)
     }
   }
 
   test("linear regression without intercept with L2 regularization") {
-    val trainer = (new LinearRegression).setElasticNetParam(0.0).setRegParam(2.3)
+    val trainer1 = (new LinearRegression).setElasticNetParam(0.0).setRegParam(2.3)
       .setFitIntercept(false)
-    val model = trainer.fit(dataset)
+    val trainer2 = (new LinearRegression).setElasticNetParam(0.0).setRegParam(2.3)
+      .setFitIntercept(false).setStandardization(false)
+    val model1 = trainer1.fit(dataset)
+    val model2 = trainer2.fit(dataset)
 
     /*
        weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.0, lambda = 2.3,
@@ -245,23 +324,42 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
        as.numeric.data.V2. 5.522875
        as.numeric.data.V3. 4.214502
      */
-    val interceptR = 0.0
-    val weightsR = Vectors.dense(5.522875, 4.214502)
+    val interceptR1 = 0.0
+    val weightsR1 = Vectors.dense(5.522875, 4.214502)
 
-    assert(model.intercept ~== interceptR absTol 1E-3)
-    assert(model.weights ~== weightsR relTol 1E-3)
+    assert(model1.intercept ~== interceptR1 absTol 1E-3)
+    assert(model1.weights ~= weightsR1 relTol 1E-3)
 
-    model.transform(dataset).select("features", "prediction").collect().foreach {
+    /*
+       weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.0, lambda = 2.3,
+         intercept = FALSE, standardize=FALSE))
+       > weights
+        3 x 1 sparse Matrix of class "dgCMatrix"
+                                 s0
+       (Intercept)         .
+       as.numeric.data.V2. 5.263704
+       as.numeric.data.V3. 4.187419
+     */
+    val interceptR2 = 0.0
+    val weightsR2 = Vectors.dense(5.263704, 4.187419)
+
+    assert(model2.intercept ~== interceptR2 absTol 1E-3)
+    assert(model2.weights ~= weightsR2 relTol 1E-3)
+
+    model1.transform(dataset).select("features", "prediction").collect().foreach {
       case Row(features: DenseVector, prediction1: Double) =>
         val prediction2 =
-          features(0) * model.weights(0) + features(1) * model.weights(1) + model.intercept
+          features(0) * model1.weights(0) + features(1) * model1.weights(1) + model1.intercept
         assert(prediction1 ~== prediction2 relTol 1E-5)
     }
   }
 
   test("linear regression with intercept with ElasticNet regularization") {
-    val trainer = (new LinearRegression).setElasticNetParam(0.3).setRegParam(1.6)
-    val model = trainer.fit(dataset)
+    val trainer1 = (new LinearRegression).setElasticNetParam(0.3).setRegParam(1.6)
+    val trainer2 = (new LinearRegression).setElasticNetParam(0.3).setRegParam(1.6)
+      .setStandardization(false)
+    val model1 = trainer1.fit(dataset)
+    val model2 = trainer2.fit(dataset)
 
     /*
        weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.3, lambda = 1.6))
@@ -272,24 +370,43 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
        as.numeric.data.V2. 3.168435
        as.numeric.data.V3. 5.200403
      */
-    val interceptR = 5.696056
-    val weightsR = Vectors.dense(3.670489, 6.001122)
+    val interceptR1 = 5.696056
+    val weightsR1 = Vectors.dense(3.670489, 6.001122)
 
-    assert(model.intercept ~== interceptR relTol 1E-3)
-    assert(model.weights ~== weightsR relTol 1E-3)
+    assert(model1.intercept ~== interceptR1 relTol 1E-3)
+    assert(model1.weights ~= weightsR1 relTol 1E-3)
 
-    model.transform(dataset).select("features", "prediction").collect().foreach {
+    /*
+      weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.3, lambda = 1.6
+       standardize=FALSE))
+      > weights
+      3 x 1 sparse Matrix of class "dgCMatrix"
+      s0
+      (Intercept)         6.114723
+      as.numeric.data.V2. 3.409937
+      as.numeric.data.V3. 6.146531
+     */
+    val interceptR2 = 6.114723
+    val weightsR2 = Vectors.dense(3.409937, 6.146531)
+
+    assert(model2.intercept ~== interceptR2 relTol 1E-3)
+    assert(model2.weights ~= weightsR2 relTol 1E-3)
+
+    model1.transform(dataset).select("features", "prediction").collect().foreach {
       case Row(features: DenseVector, prediction1: Double) =>
         val prediction2 =
-          features(0) * model.weights(0) + features(1) * model.weights(1) + model.intercept
+          features(0) * model1.weights(0) + features(1) * model1.weights(1) + model1.intercept
         assert(prediction1 ~== prediction2 relTol 1E-5)
     }
   }
 
   test("linear regression without intercept with ElasticNet regularization") {
-    val trainer = (new LinearRegression).setElasticNetParam(0.3).setRegParam(1.6)
+    val trainer1 = (new LinearRegression).setElasticNetParam(0.3).setRegParam(1.6)
       .setFitIntercept(false)
-    val model = trainer.fit(dataset)
+    val trainer2 = (new LinearRegression).setElasticNetParam(0.3).setRegParam(1.6)
+      .setFitIntercept(false).setStandardization(false)
+    val model1 = trainer1.fit(dataset)
+    val model2 = trainer2.fit(dataset)
 
     /*
        weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.3, lambda = 1.6,
@@ -301,16 +418,32 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
        as.numeric.dataM.V2. 5.673348
        as.numeric.dataM.V3. 4.322251
      */
-    val interceptR = 0.0
-    val weightsR = Vectors.dense(5.673348, 4.322251)
+    val interceptR1 = 0.0
+    val weightsR1 = Vectors.dense(5.673348, 4.322251)
 
-    assert(model.intercept ~== interceptR absTol 1E-3)
-    assert(model.weights ~= weightsR relTol 1E-3)
+    assert(model1.intercept ~== interceptR1 absTol 1E-3)
+    assert(model1.weights ~= weightsR1 relTol 1E-3)
 
-    model.transform(dataset).select("features", "prediction").collect().foreach {
+    /*
+       weights <- coef(glmnet(features, label, family="gaussian", alpha = 0.3, lambda = 1.6,
+         intercept=FALSE, standardize=FALSE))
+       > weights
+       3 x 1 sparse Matrix of class "dgCMatrix"
+       s0
+       (Intercept)         .
+       as.numeric.data.V2. 5.477988
+       as.numeric.data.V3. 4.297622
+     */
+    val interceptR2 = 0.0
+    val weightsR2 = Vectors.dense(5.477988, 4.297622)
+
+    assert(model2.intercept ~== interceptR2 absTol 1E-3)
+    assert(model2.weights ~= weightsR2 relTol 1E-3)
+
+    model1.transform(dataset).select("features", "prediction").collect().foreach {
       case Row(features: DenseVector, prediction1: Double) =>
         val prediction2 =
-          features(0) * model.weights(0) + features(1) * model.weights(1) + model.intercept
+          features(0) * model1.weights(0) + features(1) * model1.weights(1) + model1.intercept
         assert(prediction1 ~== prediction2 relTol 1E-5)
     }
   }
@@ -372,5 +505,4 @@ class LinearRegressionSuite extends SparkFunSuite with MLlibTestSparkContext {
       .zip(testSummary.residuals.select("residuals").collect())
       .forall { case (Row(r1: Double), Row(r2: Double)) => r1 ~== r2 relTol 1E-5 }
   }
-
 }
author	Holden Karau <holden@pigscanfly.ca>	2015-08-04 18:15:26 -0700
committer	Joseph K. Bradley <joseph@databricks.com>	2015-08-04 18:15:26 -0700
commit	d92fa14179287c996407d9c7d249103109f9cdef (patch)
tree	c7b0cf69980dbafad77fa675d53270d49680ba1a /mllib/src/test
parent	629e26f7ee916e70f59b017cb6083aa441b26b2c (diff)
download	spark-d92fa14179287c996407d9c7d249103109f9cdef.tar.gz spark-d92fa14179287c996407d9c7d249103109f9cdef.tar.bz2 spark-d92fa14179287c996407d9c7d249103109f9cdef.zip