diff options
author | Sean Owen <sowen@cloudera.com> | 2015-07-30 17:26:18 -0700 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2015-07-30 17:26:18 -0700 |
commit | 65fa4181c35135080870c1e4c1f904ada3a8cf59 (patch) | |
tree | df7cef6db7095640e72e0e3e46e3172ef3dadce9 | |
parent | 351eda0e2fd47c183c4298469970032097ad07a0 (diff) | |
download | spark-65fa4181c35135080870c1e4c1f904ada3a8cf59.tar.gz spark-65fa4181c35135080870c1e4c1f904ada3a8cf59.tar.bz2 spark-65fa4181c35135080870c1e4c1f904ada3a8cf59.zip |
[SPARK-9077] [MLLIB] Improve error message for decision trees when numExamples < maxCategoriesPerFeature
Improve error message when number of examples is less than arity of high-arity categorical feature
CC jkbradley is this about what you had in mind? I know it's a starter, but was on my list to close out in the short term.
Author: Sean Owen <sowen@cloudera.com>
Closes #7800 from srowen/SPARK-9077 and squashes the following commits:
b8f6cdb [Sean Owen] Improve error message when number of examples is less than arity of high-arity categorical feature
-rw-r--r-- | mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala | 8 |
1 files changed, 6 insertions, 2 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala b/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala index 380291ac22..9fe264656e 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DecisionTreeMetadata.scala @@ -128,9 +128,13 @@ private[spark] object DecisionTreeMetadata extends Logging { // based on the number of training examples. if (strategy.categoricalFeaturesInfo.nonEmpty) { val maxCategoriesPerFeature = strategy.categoricalFeaturesInfo.values.max + val maxCategory = + strategy.categoricalFeaturesInfo.find(_._2 == maxCategoriesPerFeature).get._1 require(maxCategoriesPerFeature <= maxPossibleBins, - s"DecisionTree requires maxBins (= $maxPossibleBins) >= max categories " + - s"in categorical features (= $maxCategoriesPerFeature)") + s"DecisionTree requires maxBins (= $maxPossibleBins) to be at least as large as the " + + s"number of values in each categorical feature, but categorical feature $maxCategory " + + s"has $maxCategoriesPerFeature values. Considering remove this and other categorical " + + "features with a large number of values, or add more training examples.") } val unorderedFeatures = new mutable.HashSet[Int]() |