[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames

## What changes were proposed in this pull request? Change line 113 of QuantileDiscretizer.scala to `val requiredSamples = math.max(numBins * numBins, 10000.0)` so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count` ## How was the this patch tested? Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected. Author: Oliver Pierson <ocp@gatech.edu> Author: Oliver Pierson <opierson@umd.edu> Closes #11319 from oliverpierson/SPARK-13444.
author: Oliver Pierson <ocp@gatech.edu> 2016-02-25 13:24:46 +0000
committer: Sean Owen <sowen@cloudera.com> 2016-02-25 13:24:46 +0000
commit: 6f8e835c68dff6fcf97326dc617132a41ff9d043 (patch)
tree: d0842e5e46ef3e8c7a3bd0f3873a7bd67af34ba1 /mllib/src/test/scala/org/apache
parent: 3fa6491be66dad690ca5329dd32e7c82037ae8c1 (diff)
download: spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.tar.gz
spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.tar.bz2
spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.zip
1 files changed, 20 insertions, 0 deletions
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
index 6a2c601bbe..25fabf64d5 100644
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala
@@ -71,6 +71,26 @@ class QuantileDiscretizerSuite
     }
   }
 
+  test("Test splits on dataset larger than minSamplesRequired") {
+    val sqlCtx = SQLContext.getOrCreate(sc)
+    import sqlCtx.implicits._
+
+    val datasetSize = QuantileDiscretizer.minSamplesRequired + 1
+    val numBuckets = 5
+    val df = sc.parallelize((1.0 to datasetSize by 1.0).map(Tuple1.apply)).toDF("input")
+    val discretizer = new QuantileDiscretizer()
+      .setInputCol("input")
+      .setOutputCol("result")
+      .setNumBuckets(numBuckets)
+      .setSeed(1)
+
+    val result = discretizer.fit(df).transform(df)
+    val observedNumBuckets = result.select("result").distinct.count
+
+    assert(observedNumBuckets === numBuckets,
+      "Observed number of buckets does not equal expected number of buckets.")
+  }
+
   test("read/write") {
     val t = new QuantileDiscretizer()
       .setInputCol("myInputCol")
author	Oliver Pierson <ocp@gatech.edu>	2016-02-25 13:24:46 +0000
committer	Sean Owen <sowen@cloudera.com>	2016-02-25 13:24:46 +0000
commit	6f8e835c68dff6fcf97326dc617132a41ff9d043 (patch)
tree	d0842e5e46ef3e8c7a3bd0f3873a7bd67af34ba1 /mllib/src/test/scala/org/apache
parent	3fa6491be66dad690ca5329dd32e7c82037ae8c1 (diff)
download	spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.tar.gz spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.tar.bz2 spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.zip