[SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one - spark

diff options

author	Joseph K. Bradley <joseph@databricks.com>	2016-03-23 21:16:00 -0700
committer	Joseph K. Bradley <joseph@databricks.com>	2016-03-23 21:16:00 -0700
commit	cf823bead18c5be86b36da59b4bbf935c4804d04 (patch)
tree	7e48dd6b225e4f2ce670d6c5513215c914053194 /R
parent	f42eaf42bdca8bc6f390f1f31ee60faa1662489b (diff)
download	spark-cf823bead18c5be86b36da59b4bbf935c4804d04.tar.gz spark-cf823bead18c5be86b36da59b4bbf935c4804d04.tar.bz2 spark-cf823bead18c5be86b36da59b4bbf935c4804d04.zip

[SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one

Primary change: * Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning. * spark.mllib now calls the spark.ml implementation. * Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed. ml.tree.DecisionTreeModel * Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses. These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation. ml.tree.Node * Added ```private[tree] def deepCopy```, used by unit tests Copied developer comments from spark.mllib implementation to spark.ml one. Moving unit tests * Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite. * Those tests were all moved to spark.ml.tree.impl.RandomForestSuite. The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side. * I made minimal changes to each test to allow it to run. Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values. * No new unit tests were added. * mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in. Those same split calculations were already being tested in other unit tests, for each dataset type. **Changes of behavior** (to be noted in SPARK-13448 once this PR is merged) * spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB. This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit(). Once this PR is merged, I will note the change of behavior on SPARK-13448. * spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats. This does not remove information from the tree, and it will save a bit of storage. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11855 from jkbradley/remove-mllib-tree-impl.

Diffstat (limited to 'R')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: