diff options
author | omgteam <Kimlong.Liu@gmail.com> | 2014-10-13 09:59:41 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2014-10-13 09:59:41 -0700 |
commit | 942847fd94c920f7954ddf01f97263926e512b0e (patch) | |
tree | b60c81cd8310bda92cb3d7ac8178d22d247f938f /mllib | |
parent | 92e017fb894be1e8e2b2b5274fec4c31a7a4412e (diff) | |
download | spark-942847fd94c920f7954ddf01f97263926e512b0e.tar.gz spark-942847fd94c920f7954ddf01f97263926e512b0e.tar.bz2 spark-942847fd94c920f7954ddf01f97263926e512b0e.zip |
Bug Fix: without unpersist method in RandomForest.scala
During trainning Gradient Boosting Decision Tree on large-scale sparse data, spark spill hundreds of data onto disk. And find the bug below:
In version 1.1.0 DecisionTree.scala, train Method, treeInput has been persisted in Memory, but without unpersist. It caused heavy DISK usage.
In github version(1.2.0 maybe), RandomForest.scala, train Method, baggedInput has been persisted but without unpersisted too.
After added unpersist, it works right.
https://issues.apache.org/jira/browse/SPARK-3918
Author: omgteam <Kimlong.Liu@gmail.com>
Closes #2775 from omgteam/master and squashes the following commits:
815d543 [omgteam] adjust tab to spaces
1a36f83 [omgteam] Bug: fix without unpersist baggedInput in RandomForest.scala
Diffstat (limited to 'mllib')
-rw-r--r-- | mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala | 2 |
1 files changed, 2 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala index fa7a26f17c..ebbd8e0257 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/RandomForest.scala @@ -176,6 +176,8 @@ private class RandomForest ( timer.stop("findBestSplits") } + baggedInput.unpersist() + timer.stop("total") logInfo("Internal timing for DecisionTree:") |