[SPARK-17645][MLLIB][ML] add feature selector method based on: False Discovery Rate (FDR) and Family wise error rate (FWE)

## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes #15212 from mpjlu/fdr_fwe.
author: Peng <peng.meng@intel.com> 2016-12-28 00:49:36 -0800
committer: Yanbo Liang <ybliang8@gmail.com> 2016-12-28 00:49:36 -0800
commit: 79ff8536315aef97ee940c52d71cd8de777c7ce6 (patch)
tree: 97008066cdca759546e876c5379aa150f91ef27b /mllib/src
parent: 2af8b5cffa97cd2ca11afe504f6756fe5721dfb6 (diff)
download: spark-79ff8536315aef97ee940c52d71cd8de777c7ce6.tar.gz
spark-79ff8536315aef97ee940c52d71cd8de777c7ce6.tar.bz2
spark-79ff8536315aef97ee940c52d71cd8de777c7ce6.zip
5 files changed, 222 insertions, 45 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
index 8699929bab..353bd186da 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/ChiSqSelector.scala
@@ -92,8 +92,36 @@ private[feature] trait ChiSqSelectorParams extends Params
   def getFpr: Double = $(fpr)
 
   /**
+   * The upper bound of the expected false discovery rate.
+   * Only applicable when selectorType = "fdr".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val fdr = new DoubleParam(this, "fdr",
+    "The upper bound of the expected false discovery rate.", ParamValidators.inRange(0, 1))
+  setDefault(fdr -> 0.05)
+
+  /** @group getParam */
+  def getFdr: Double = $(fdr)
+
+  /**
+   * The upper bound of the expected family-wise error rate.
+   * Only applicable when selectorType = "fwe".
+   * Default value is 0.05.
+   * @group param
+   */
+  @Since("2.2.0")
+  final val fwe = new DoubleParam(this, "fwe",
+    "The upper bound of the expected family-wise error rate.", ParamValidators.inRange(0, 1))
+  setDefault(fwe -> 0.05)
+
+  /** @group getParam */
+  def getFwe: Double = $(fwe)
+
+  /**
    * The selector type of the ChisqSelector.
-   * Supported options: "numTopFeatures" (default), "percentile", "fpr".
+   * Supported options: "numTopFeatures" (default), "percentile", "fpr", "fdr", "fwe".
    * @group param
    */
   @Since("2.1.0")
@@ -111,11 +139,17 @@ private[feature] trait ChiSqSelectorParams extends Params
 /**
  * Chi-Squared feature selection, which selects categorical features to use for predicting a
  * categorical label.
- * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
  *  - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
  *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
  *  - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
  *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values is below a threshold,
+ *    thus controlling the family-wise error rate of selection.
  * By default, the selection method is `numTopFeatures`, with the default number of top features
  * set to 50.
  */
@@ -139,6 +173,14 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str
   def setFpr(value: Double): this.type = set(fpr, value)
 
   /** @group setParam */
+  @Since("2.2.0")
+  def setFdr(value: Double): this.type = set(fdr, value)
+
+  /** @group setParam */
+  @Since("2.2.0")
+  def setFwe(value: Double): this.type = set(fwe, value)
+
+  /** @group setParam */
   @Since("2.1.0")
   def setSelectorType(value: String): this.type = set(selectorType, value)
 
@@ -167,6 +209,8 @@ final class ChiSqSelector @Since("1.6.0") (@Since("1.6.0") override val uid: Str
       .setNumTopFeatures($(numTopFeatures))
       .setPercentile($(percentile))
       .setFpr($(fpr))
+      .setFdr($(fdr))
+      .setFwe($(fwe))
     val model = selector.fit(input)
     copyValues(new ChiSqSelectorModel(uid, model).setParent(this))
   }
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
index 034e3625e8..b32d3f252a 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
@@ -639,12 +639,16 @@ private[python] class PythonMLLibAPI extends Serializable {
       numTopFeatures: Int,
       percentile: Double,
       fpr: Double,
+      fdr: Double,
+      fwe: Double,
       data: JavaRDD[LabeledPoint]): ChiSqSelectorModel = {
     new ChiSqSelector()
       .setSelectorType(selectorType)
       .setNumTopFeatures(numTopFeatures)
       .setPercentile(percentile)
       .setFpr(fpr)
+      .setFdr(fdr)
+      .setFwe(fwe)
       .fit(data.rdd)
   }
 
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
index 7ef2a95b96..9dea3c3e84 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/ChiSqSelector.scala
@@ -171,11 +171,17 @@ object ChiSqSelectorModel extends Loader[ChiSqSelectorModel] {
 
 /**
  * Creates a ChiSquared feature selector.
- * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`.
+ * The selector supports different selection methods: `numTopFeatures`, `percentile`, `fpr`,
+ * `fdr`, `fwe`.
  *  - `numTopFeatures` chooses a fixed number of top features according to a chi-squared test.
  *  - `percentile` is similar but chooses a fraction of all features instead of a fixed number.
  *  - `fpr` chooses all features whose p-value is below a threshold, thus controlling the false
  *    positive rate of selection.
+ *  - `fdr` uses the [Benjamini-Hochberg procedure]
+ *    (https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure)
+ *    to choose all features whose false discovery rate is below a threshold.
+ *  - `fwe` chooses all features whose p-values is below a threshold,
+ *    thus controlling the family-wise error rate of selection.
  * By default, the selection method is `numTopFeatures`, with the default number of top features
  * set to 50.
  */
@@ -184,6 +190,8 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
   var numTopFeatures: Int = 50
   var percentile: Double = 0.1
   var fpr: Double = 0.05
+  var fdr: Double = 0.05
+  var fwe: Double = 0.05
   var selectorType = ChiSqSelector.NumTopFeatures
 
   /**
@@ -215,6 +223,20 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
     this
   }
 
+  @Since("2.2.0")
+  def setFdr(value: Double): this.type = {
+    require(0.0 <= value && value <= 1.0, "FDR must be in [0,1]")
+    fdr = value
+    this
+  }
+
+  @Since("2.2.0")
+  def setFwe(value: Double): this.type = {
+    require(0.0 <= value && value <= 1.0, "FWE must be in [0,1]")
+    fwe = value
+    this
+  }
+
   @Since("2.1.0")
   def setSelectorType(value: String): this.type = {
     require(ChiSqSelector.supportedSelectorTypes.contains(value),
@@ -245,6 +267,21 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
       case ChiSqSelector.FPR =>
         chiSqTestResult
           .filter { case (res, _) => res.pValue < fpr }
+      case ChiSqSelector.FDR =>
+        // This uses the Benjamini-Hochberg procedure.
+        // https://en.wikipedia.org/wiki/False_discovery_rate#Benjamini.E2.80.93Hochberg_procedure
+        val tempRes = chiSqTestResult
+          .sortBy { case (res, _) => res.pValue }
+        val maxIndex = tempRes
+          .zipWithIndex
+          .filter { case ((res, _), index) =>
+            res.pValue <= fdr * (index + 1) / chiSqTestResult.length }
+          .map { case (_, index) => index }
+          .max
+        tempRes.take(maxIndex + 1)
+      case ChiSqSelector.FWE =>
+        chiSqTestResult
+          .filter { case (res, _) => res.pValue < fwe / chiSqTestResult.length }
       case errorType =>
         throw new IllegalStateException(s"Unknown ChiSqSelector Type: $errorType")
     }
@@ -255,19 +292,22 @@ class ChiSqSelector @Since("2.1.0") () extends Serializable {
 
 private[spark] object ChiSqSelector {
 
-  /**
-   * String name for `numTopFeatures` selector type.
-   */
-  val NumTopFeatures: String = "numTopFeatures"
+  /** String name for `numTopFeatures` selector type. */
+  private[spark] val NumTopFeatures: String = "numTopFeatures"
 
-  /**
-   * String name for `percentile` selector type.
-   */
-  val Percentile: String = "percentile"
+  /** String name for `percentile` selector type. */
+  private[spark] val Percentile: String = "percentile"
 
   /** String name for `fpr` selector type. */
-  val FPR: String = "fpr"
+  private[spark] val FPR: String = "fpr"
+
+  /** String name for `fdr` selector type. */
+  private[spark] val FDR: String = "fdr"
+
+  /** String name for `fwe` selector type. */
+  private[spark] val FWE: String = "fwe"
+
 
   /** Set of selector types that ChiSqSelector supports. */
-  val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, Percentile, FPR)
+  val supportedSelectorTypes: Array[String] = Array(NumTopFeatures, Percentile, FPR, FDR, FWE)
 }
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala
index 80970fd744..f6c68b9314 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/ChiSqSelectorSuite.scala
@@ -79,6 +79,12 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext
     ChiSqSelectorSuite.testSelector(selector, dataset)
   }
 
+  test("Test Chi-Square selector: fwe") {
+    val selector = new ChiSqSelector()
+      .setOutputCol("filtered").setSelectorType("fwe").setFwe(0.6)
+    ChiSqSelectorSuite.testSelector(selector, dataset)
+  }
+
   test("read/write") {
     def checkModelData(model: ChiSqSelectorModel, model2: ChiSqSelectorModel): Unit = {
       assert(model.selectedFeatures === model2.selectedFeatures)
diff --git a/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala b/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala
index 77219e5006..305cb4cbbd 100644
--- a/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/mllib/feature/ChiSqSelectorSuite.scala
@@ -27,60 +27,143 @@ class ChiSqSelectorSuite extends SparkFunSuite with MLlibTestSparkContext {
 
   /*
    *  Contingency tables
-   *  feature0 = {8.0, 0.0}
+   *  feature0 = {6.0, 0.0, 8.0}
    *  class  0 1 2
-   *    8.0||1|0|1|
-   *    0.0||0|2|0|
+   *    6.0||1|0|0|
+   *    0.0||0|3|0|
+   *    8.0||0|0|2|
+   *  degree of freedom = 4, statistic = 12, pValue = 0.017
    *
    *  feature1 = {7.0, 9.0}
    *  class  0 1 2
    *    7.0||1|0|0|
-   *    9.0||0|2|1|
+   *    9.0||0|3|2|
+   *  degree of freedom = 2, statistic = 6, pValue = 0.049
    *
-   *  feature2 = {0.0, 6.0, 8.0, 5.0}
+   *  feature2 = {0.0, 6.0, 3.0, 8.0}
    *  class  0 1 2
    *    0.0||1|0|0|
-   *    6.0||0|1|0|
+   *    6.0||0|1|2|
+   *    3.0||0|1|0|
    *    8.0||0|1|0|
-   *    5.0||0|0|1|
+   *  degree of freedom = 6, statistic = 8.66, pValue = 0.193
+   *
+   *  feature3 = {7.0, 0.0, 5.0, 4.0}
+   *  class  0 1 2
+   *    7.0||1|0|0|
+   *    0.0||0|2|0|
+   *    5.0||0|1|1|
+   *    4.0||0|0|1|
+   *  degree of freedom = 6, statistic = 9.5, pValue = 0.147
+   *
+   *  feature4 = {6.0, 5.0, 4.0, 0.0}
+   *  class  0 1 2
+   *    6.0||1|1|0|
+   *    5.0||0|2|0|
+   *    4.0||0|0|1|
+   *    0.0||0|0|1|
+   *  degree of freedom = 6, statistic = 8.0, pValue = 0.238
+   *
+   *  feature5 = {0.0, 9.0, 5.0, 4.0}
+   *  class  0 1 2
+   *    0.0||1|0|1|
+   *    9.0||0|1|0|
+   *    5.0||0|1|0|
+   *    4.0||0|1|1|
+   *  degree of freedom = 6, statistic = 5, pValue = 0.54
    *
    *  Use chi-squared calculator from Internet
    */
 
-  test("ChiSqSelector transform test (sparse & dense vector)") {
-    val labeledDiscreteData = sc.parallelize(
-      Seq(LabeledPoint(0.0, Vectors.sparse(3, Array((0, 8.0), (1, 7.0)))),
-        LabeledPoint(1.0, Vectors.sparse(3, Array((1, 9.0), (2, 6.0)))),
-        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0))),
-        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0)))), 2)
+  lazy val labeledDiscreteData = sc.parallelize(
+    Seq(LabeledPoint(0.0, Vectors.sparse(6, Array((0, 6.0), (1, 7.0), (3, 7.0), (4, 6.0)))),
+      LabeledPoint(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 6.0), (4, 5.0), (5, 9.0)))),
+      LabeledPoint(1.0, Vectors.sparse(6, Array((1, 9.0), (2, 3.0), (4, 5.0), (5, 5.0)))),
+      LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 5.0, 6.0, 4.0))),
+      LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 5.0, 4.0, 4.0))),
+      LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 6.0, 4.0, 0.0, 0.0)))), 2)
+
+  test("ChiSqSelector transform by numTopFeatures test (sparse & dense vector)") {
     val preFilteredData =
-      Seq(LabeledPoint(0.0, Vectors.dense(Array(8.0))),
-        LabeledPoint(1.0, Vectors.dense(Array(0.0))),
-        LabeledPoint(1.0, Vectors.dense(Array(0.0))),
-        LabeledPoint(2.0, Vectors.dense(Array(8.0))))
-    val model = new ChiSqSelector(1).fit(labeledDiscreteData)
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0))))
+
+    val model = new ChiSqSelector(3).fit(labeledDiscreteData)
     val filteredData = labeledDiscreteData.map { lp =>
       LabeledPoint(lp.label, model.transform(lp.features))
-    }.collect().toSeq
+    }.collect().toSet
     assert(filteredData === preFilteredData)
   }
 
-  test("ChiSqSelector by fpr transform test (sparse & dense vector)") {
-    val labeledDiscreteData = sc.parallelize(
-      Seq(LabeledPoint(0.0, Vectors.sparse(4, Array((0, 8.0), (1, 7.0)))),
-        LabeledPoint(1.0, Vectors.sparse(4, Array((1, 9.0), (2, 6.0), (3, 4.0)))),
-        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 8.0, 4.0))),
-        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0, 9.0)))), 2)
+  test("ChiSqSelector transform by Percentile test (sparse & dense vector)") {
     val preFilteredData =
-      Seq(LabeledPoint(0.0, Vectors.dense(Array(0.0))),
-        LabeledPoint(1.0, Vectors.dense(Array(4.0))),
-        LabeledPoint(1.0, Vectors.dense(Array(4.0))),
-        LabeledPoint(2.0, Vectors.dense(Array(9.0))))
-    val model: ChiSqSelectorModel = new ChiSqSelector().setSelectorType("fpr")
-      .setFpr(0.1).fit(labeledDiscreteData)
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0))))
+
+    val model = new ChiSqSelector().setSelectorType("percentile").setPercentile(0.5)
+      .fit(labeledDiscreteData)
+    val filteredData = labeledDiscreteData.map { lp =>
+      LabeledPoint(lp.label, model.transform(lp.features))
+    }.collect().toSet
+    assert(filteredData === preFilteredData)
+  }
+
+  test("ChiSqSelector transform by FPR test (sparse & dense vector)") {
+    val preFilteredData =
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 0.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 5.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0, 4.0))))
+
+    val model = new ChiSqSelector().setSelectorType("fpr").setFpr(0.15)
+      .fit(labeledDiscreteData)
+    val filteredData = labeledDiscreteData.map { lp =>
+      LabeledPoint(lp.label, model.transform(lp.features))
+    }.collect().toSet
+    assert(filteredData === preFilteredData)
+  }
+
+  test("ChiSqSelector transform by FDR test (sparse & dense vector)") {
+    val preFilteredData =
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))))
+
+    val model = new ChiSqSelector().setSelectorType("fdr").setFdr(0.15)
+      .fit(labeledDiscreteData)
+    val filteredData = labeledDiscreteData.map { lp =>
+      LabeledPoint(lp.label, model.transform(lp.features))
+    }.collect().toSet
+    assert(filteredData === preFilteredData)
+  }
+
+  test("ChiSqSelector transform by FWE test (sparse & dense vector)") {
+    val preFilteredData =
+      Set(LabeledPoint(0.0, Vectors.dense(Array(6.0, 7.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(1.0, Vectors.dense(Array(0.0, 9.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))),
+        LabeledPoint(2.0, Vectors.dense(Array(8.0, 9.0))))
+
+    val model = new ChiSqSelector().setSelectorType("fwe").setFwe(0.3)
+      .fit(labeledDiscreteData)
     val filteredData = labeledDiscreteData.map { lp =>
       LabeledPoint(lp.label, model.transform(lp.features))
-    }.collect().toSeq
+    }.collect().toSet
     assert(filteredData === preFilteredData)
   }
author	Peng <peng.meng@intel.com>	2016-12-28 00:49:36 -0800
committer	Yanbo Liang <ybliang8@gmail.com>	2016-12-28 00:49:36 -0800
commit	79ff8536315aef97ee940c52d71cd8de777c7ce6 (patch)
tree	97008066cdca759546e876c5379aa150f91ef27b /mllib/src
parent	2af8b5cffa97cd2ca11afe504f6756fe5721dfb6 (diff)
download	spark-79ff8536315aef97ee940c52d71cd8de777c7ce6.tar.gz spark-79ff8536315aef97ee940c52d71cd8de777c7ce6.tar.bz2 spark-79ff8536315aef97ee940c52d71cd8de777c7ce6.zip