[SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer

This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala. Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`). Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`. ## How was this patch tested? A little doctest and built API docs locally to check HTML doc generation. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
author: Nick Pentreath <nickp@za.ibm.com> 2016-05-24 10:02:10 +0200
committer: Nick Pentreath <nickp@za.ibm.com> 2016-05-24 10:02:10 +0200
commit: 6075f5b4d8e98483d26c31576f58e2229024b4f4 (patch)
tree: b49308cd5da2fb5ab3ffe80546887016b6a794cd /mllib/src/main/scala
parent: d642b273544bb77ef7f584326aa2d214649ac61b (diff)
download: spark-6075f5b4d8e98483d26c31576f58e2229024b4f4.tar.gz
spark-6075f5b4d8e98483d26c31576f58e2229024b4f4.tar.bz2
spark-6075f5b4d8e98483d26c31576f58e2229024b4f4.zip
1 files changed, 8 insertions, 5 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
index 5a6daa06ef..61483590cd 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala
@@ -50,13 +50,13 @@ private[feature] trait QuantileDiscretizerBase extends Params
   /**
    * Relative error (see documentation for
    * [[org.apache.spark.sql.DataFrameStatFunctions.approxQuantile approxQuantile]] for description)
-   * Must be a number in [0, 1].
+   * Must be in the range [0, 1].
    * default: 0.001
    * @group param
    */
   val relativeError = new DoubleParam(this, "relativeError", "The relative target precision " +
-    "for approxQuantile",
-    ParamValidators.inRange(0.0, 1.0))
+    "for the approximate quantile algorithm used to generate buckets. " +
+    "Must be in the range [0, 1].", ParamValidators.inRange(0.0, 1.0))
   setDefault(relativeError -> 0.001)
 
   /** @group getParam */
@@ -66,8 +66,11 @@ private[feature] trait QuantileDiscretizerBase extends Params
 /**
  * :: Experimental ::
  * `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned
- * categorical features. The bin ranges are chosen by taking a sample of the data and dividing it
- * into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity,
+ * categorical features. The number of bins can be set using the `numBuckets` parameter.
+ * The bin ranges are chosen using an approximate algorithm (see the documentation for
+ * [[org.apache.spark.sql.DataFrameStatFunctions.approxQuantile approxQuantile]]
+ * for a detailed description). The precision of the approximation can be controlled with the
+ * `relativeError` parameter. The lower and upper bin bounds will be `-Infinity` and `+Infinity`,
  * covering all real values.
  */
 @Experimental
author	Nick Pentreath <nickp@za.ibm.com>	2016-05-24 10:02:10 +0200
committer	Nick Pentreath <nickp@za.ibm.com>	2016-05-24 10:02:10 +0200
commit	6075f5b4d8e98483d26c31576f58e2229024b4f4 (patch)
tree	b49308cd5da2fb5ab3ffe80546887016b6a794cd /mllib/src/main/scala
parent	d642b273544bb77ef7f584326aa2d214649ac61b (diff)
download	spark-6075f5b4d8e98483d26c31576f58e2229024b4f4.tar.gz spark-6075f5b4d8e98483d26c31576f58e2229024b4f4.tar.bz2 spark-6075f5b4d8e98483d26c31576f58e2229024b4f4.zip