diff options
author | Xiangrui Meng <meng@databricks.com> | 2015-02-16 22:09:04 -0800 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2015-02-16 22:09:12 -0800 |
commit | dfe0fa01cce2fefc272c0f05f7d63216be553e03 (patch) | |
tree | 30df91f06ca04f6e6d5bda717d88691ec39ab2ac /mllib/src/main | |
parent | d0701d9bfb238bb4f53a0454eb809ab160e17cec (diff) | |
download | spark-dfe0fa01cce2fefc272c0f05f7d63216be553e03.tar.gz spark-dfe0fa01cce2fefc272c0f05f7d63216be553e03.tar.bz2 spark-dfe0fa01cce2fefc272c0f05f7d63216be553e03.zip |
[SPARK-5802][MLLIB] cache transformed data in glm
If we need to transform the input data, we should cache the output to avoid re-computing feature vectors every iteration. dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #4593 from mengxr/SPARK-5802 and squashes the following commits:
ae3be84 [Xiangrui Meng] cache transformed data in glm
(cherry picked from commit fd84229e2aeb6a03760703c9dccd2db853779400)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Diffstat (limited to 'mllib/src/main')
-rw-r--r-- | mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala | 29 |
1 files changed, 15 insertions, 14 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala index 17de215b97..2b7145362a 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala @@ -205,7 +205,7 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel] throw new SparkException("Input validation failed.") } - /** + /* * Scaling columns to unit variance as a heuristic to reduce the condition number: * * During the optimization process, the convergence (rate) depends on the condition number of @@ -225,26 +225,27 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel] * Currently, it's only enabled in LogisticRegressionWithLBFGS */ val scaler = if (useFeatureScaling) { - (new StandardScaler(withStd = true, withMean = false)).fit(input.map(x => x.features)) + new StandardScaler(withStd = true, withMean = false).fit(input.map(_.features)) } else { null } // Prepend an extra variable consisting of all 1.0's for the intercept. - val data = if (addIntercept) { - if (useFeatureScaling) { - input.map(labeledPoint => - (labeledPoint.label, appendBias(scaler.transform(labeledPoint.features)))) - } else { - input.map(labeledPoint => (labeledPoint.label, appendBias(labeledPoint.features))) - } - } else { - if (useFeatureScaling) { - input.map(labeledPoint => (labeledPoint.label, scaler.transform(labeledPoint.features))) + // TODO: Apply feature scaling to the weight vector instead of input data. + val data = + if (addIntercept) { + if (useFeatureScaling) { + input.map(lp => (lp.label, appendBias(scaler.transform(lp.features)))).cache() + } else { + input.map(lp => (lp.label, appendBias(lp.features))).cache() + } } else { - input.map(labeledPoint => (labeledPoint.label, labeledPoint.features)) + if (useFeatureScaling) { + input.map(lp => (lp.label, scaler.transform(lp.features))).cache() + } else { + input.map(lp => (lp.label, lp.features)) + } } - } /** * TODO: For better convergence, in logistic regression, the intercepts should be computed |