diff options
author | Oliver Pierson <ocp@gatech.edu> | 2016-02-25 13:24:46 +0000 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-02-25 13:24:46 +0000 |
commit | 6f8e835c68dff6fcf97326dc617132a41ff9d043 (patch) | |
tree | d0842e5e46ef3e8c7a3bd0f3873a7bd67af34ba1 /mllib/src/test/scala/org/apache | |
parent | 3fa6491be66dad690ca5329dd32e7c82037ae8c1 (diff) | |
download | spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.tar.gz spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.tar.bz2 spark-6f8e835c68dff6fcf97326dc617132a41ff9d043.zip |
[SPARK-13444][MLLIB] QuantileDiscretizer chooses bad splits on large DataFrames
## What changes were proposed in this pull request?
Change line 113 of QuantileDiscretizer.scala to
`val requiredSamples = math.max(numBins * numBins, 10000.0)`
so that `requiredSamples` is a `Double`. This will fix the division in line 114 which currently results in zero if `requiredSamples < dataset.count`
## How was the this patch tested?
Manual tests. I was having a problems using QuantileDiscretizer with my a dataset and after making this change QuantileDiscretizer behaves as expected.
Author: Oliver Pierson <ocp@gatech.edu>
Author: Oliver Pierson <opierson@umd.edu>
Closes #11319 from oliverpierson/SPARK-13444.
Diffstat (limited to 'mllib/src/test/scala/org/apache')
-rw-r--r-- | mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala | 20 |
1 files changed, 20 insertions, 0 deletions
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala index 6a2c601bbe..25fabf64d5 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/feature/QuantileDiscretizerSuite.scala @@ -71,6 +71,26 @@ class QuantileDiscretizerSuite } } + test("Test splits on dataset larger than minSamplesRequired") { + val sqlCtx = SQLContext.getOrCreate(sc) + import sqlCtx.implicits._ + + val datasetSize = QuantileDiscretizer.minSamplesRequired + 1 + val numBuckets = 5 + val df = sc.parallelize((1.0 to datasetSize by 1.0).map(Tuple1.apply)).toDF("input") + val discretizer = new QuantileDiscretizer() + .setInputCol("input") + .setOutputCol("result") + .setNumBuckets(numBuckets) + .setSeed(1) + + val result = discretizer.fit(df).transform(df) + val observedNumBuckets = result.select("result").distinct.count + + assert(observedNumBuckets === numBuckets, + "Observed number of buckets does not equal expected number of buckets.") + } + test("read/write") { val t = new QuantileDiscretizer() .setInputCol("myInputCol") |