[SPARK-15957][ML] RFormula supports forcing to index label

## What changes were proposed in this pull request? ```RFormula``` will index label only when it is string type currently. If the label is numeric type and we use ```RFormula``` to present a classification model, there is no label attributes in label column metadata. The label attributes are useful when making prediction for classification, so we can force to index label by ```StringIndexer``` whether it is numeric or string type for classification. Then SparkR wrappers can extract label attributes from label column metadata successfully. This feature can help us to fix bug similar with [SPARK-15153](https://issues.apache.org/jira/browse/SPARK-15153). For regression, we will still to keep label as numeric type. In this PR, we add a param ```indexLabel``` to control whether to force to index label for ```RFormula```. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13675 from yanboliang/spark-15957.
author: Yanbo Liang <ybliang8@gmail.com> 2016-10-10 22:50:59 -0700
committer: Yanbo Liang <ybliang8@gmail.com> 2016-10-10 22:50:59 -0700
commit: 19401a203b441e3355f0d3fc3fd062b6d5bdee1f (patch)
tree: 8a5d1158afcdaa9468eb978f301586333f2e3d47 /mllib/src/main
parent: b515768f2668749ad37a3bdf9d265ce45ec447b1 (diff)
download: spark-19401a203b441e3355f0d3fc3fd062b6d5bdee1f.tar.gz
spark-19401a203b441e3355f0d3fc3fd062b6d5bdee1f.tar.bz2
spark-19401a203b441e3355f0d3fc3fd062b6d5bdee1f.zip
1 files changed, 26 insertions, 3 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index 2ee899bcca..389898666e 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -26,7 +26,7 @@ import org.apache.spark.annotation.{Experimental, Since}
 import org.apache.spark.ml.{Estimator, Model, Pipeline, PipelineModel, PipelineStage, Transformer}
 import org.apache.spark.ml.attribute.AttributeGroup
 import org.apache.spark.ml.linalg.VectorUDT
-import org.apache.spark.ml.param.{Param, ParamMap}
+import org.apache.spark.ml.param.{BooleanParam, Param, ParamMap}
 import org.apache.spark.ml.param.shared.{HasFeaturesCol, HasLabelCol}
 import org.apache.spark.ml.util._
 import org.apache.spark.sql.{DataFrame, Dataset}
@@ -104,6 +104,27 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
   @Since("1.5.0")
   def setLabelCol(value: String): this.type = set(labelCol, value)
 
+  /**
+   * Force to index label whether it is numeric or string type.
+   * Usually we index label only when it is string type.
+   * If the formula was used by classification algorithms,
+   * we can force to index label even it is numeric type by setting this param with true.
+   * Default: false.
+   * @group param
+   */
+  @Since("2.1.0")
+  val forceIndexLabel: BooleanParam = new BooleanParam(this, "forceIndexLabel",
+    "Force to index label whether it is numeric or string")
+  setDefault(forceIndexLabel -> false)
+
+  /** @group getParam */
+  @Since("2.1.0")
+  def getForceIndexLabel: Boolean = $(forceIndexLabel)
+
+  /** @group setParam */
+  @Since("2.1.0")
+  def setForceIndexLabel(value: Boolean): this.type = set(forceIndexLabel, value)
+
   /** Whether the formula specifies fitting an intercept. */
   private[ml] def hasIntercept: Boolean = {
     require(isDefined(formula), "Formula must be defined first.")
@@ -167,8 +188,8 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
     encoderStages += new VectorAttributeRewriter($(featuresCol), prefixesToRewrite.toMap)
     encoderStages += new ColumnPruner(tempColumns.toSet)
 
-    if (dataset.schema.fieldNames.contains(resolvedFormula.label) &&
-      dataset.schema(resolvedFormula.label).dataType == StringType) {
+    if ((dataset.schema.fieldNames.contains(resolvedFormula.label) &&
+      dataset.schema(resolvedFormula.label).dataType == StringType) || $(forceIndexLabel)) {
       encoderStages += new StringIndexer()
         .setInputCol(resolvedFormula.label)
         .setOutputCol($(labelCol))
@@ -181,6 +202,8 @@ class RFormula @Since("1.5.0") (@Since("1.5.0") override val uid: String)
   @Since("1.5.0")
   // optimistic schema; does not contain any ML attributes
   override def transformSchema(schema: StructType): StructType = {
+    require(!hasLabelCol(schema) || !$(forceIndexLabel),
+      "If label column already exists, forceIndexLabel can not be set with true.")
     if (hasLabelCol(schema)) {
       StructType(schema.fields :+ StructField($(featuresCol), new VectorUDT, true))
     } else {
author	Yanbo Liang <ybliang8@gmail.com>	2016-10-10 22:50:59 -0700
committer	Yanbo Liang <ybliang8@gmail.com>	2016-10-10 22:50:59 -0700
commit	19401a203b441e3355f0d3fc3fd062b6d5bdee1f (patch)
tree	8a5d1158afcdaa9468eb978f301586333f2e3d47 /mllib/src/main
parent	b515768f2668749ad37a3bdf9d265ce45ec447b1 (diff)
download	spark-19401a203b441e3355f0d3fc3fd062b6d5bdee1f.tar.gz spark-19401a203b441e3355f0d3fc3fd062b6d5bdee1f.tar.bz2 spark-19401a203b441e3355f0d3fc3fd062b6d5bdee1f.zip