[SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached data

`GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache. The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning. Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #8395 from SlavikBaranov/SPARK-10182.
author: Vyacheslav Baranov <slavik.baranov@gmail.com> 2015-08-27 18:56:18 +0100
committer: Sean Owen <sowen@cloudera.com> 2015-08-27 18:56:18 +0100
commit: fdd466bed7a7151dd066d732ef98d225f4acda4a (patch)
tree: 4d8291b830846f72ca89c59d1f04cb1a9ee8bb79 /mllib
parent: e1f4de4a7d15d4ca4b5c64ff929ac3980f5d706f (diff)
download: spark-fdd466bed7a7151dd066d732ef98d225f4acda4a.tar.gz
spark-fdd466bed7a7151dd066d732ef98d225f4acda4a.tar.bz2
spark-fdd466bed7a7151dd066d732ef98d225f4acda4a.zip
1 files changed, 5 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
index 7e3b4d5648..8f657bfb9c 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/regression/GeneralizedLinearAlgorithm.scala
@@ -359,6 +359,11 @@ abstract class GeneralizedLinearAlgorithm[M <: GeneralizedLinearModel]
         + " parent RDDs are also uncached.")
     }
 
+    // Unpersist cached data
+    if (data.getStorageLevel != StorageLevel.NONE) {
+      data.unpersist(false)
+    }
+
     createModel(weights, intercept)
   }
 }
author	Vyacheslav Baranov <slavik.baranov@gmail.com>	2015-08-27 18:56:18 +0100
committer	Sean Owen <sowen@cloudera.com>	2015-08-27 18:56:18 +0100
commit	fdd466bed7a7151dd066d732ef98d225f4acda4a (patch)
tree	4d8291b830846f72ca89c59d1f04cb1a9ee8bb79 /mllib
parent	e1f4de4a7d15d4ca4b5c64ff929ac3980f5d706f (diff)
download	spark-fdd466bed7a7151dd066d732ef98d225f4acda4a.tar.gz spark-fdd466bed7a7151dd066d732ef98d225f4acda4a.tar.bz2 spark-fdd466bed7a7151dd066d732ef98d225f4acda4a.zip