From 6075f5b4d8e98483d26c31576f58e2229024b4f4 Mon Sep 17 00:00:00 2001 From: Nick Pentreath Date: Tue, 24 May 2016 10:02:10 +0200 Subject: [SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala. Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`). Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`. ## How was this patch tested? A little doctest and built API docs locally to check HTML doc generation. Author: Nick Pentreath Closes #13228 from MLnick/SPARK-15442-py-relerror-param. --- .../org/apache/spark/ml/feature/QuantileDiscretizer.scala | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) (limited to 'mllib/src/main/scala') diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala index 5a6daa06ef..61483590cd 100644 --- a/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala +++ b/mllib/src/main/scala/org/apache/spark/ml/feature/QuantileDiscretizer.scala @@ -50,13 +50,13 @@ private[feature] trait QuantileDiscretizerBase extends Params /** * Relative error (see documentation for * [[org.apache.spark.sql.DataFrameStatFunctions.approxQuantile approxQuantile]] for description) - * Must be a number in [0, 1]. + * Must be in the range [0, 1]. * default: 0.001 * @group param */ val relativeError = new DoubleParam(this, "relativeError", "The relative target precision " + - "for approxQuantile", - ParamValidators.inRange(0.0, 1.0)) + "for the approximate quantile algorithm used to generate buckets. " + + "Must be in the range [0, 1].", ParamValidators.inRange(0.0, 1.0)) setDefault(relativeError -> 0.001) /** @group getParam */ @@ -66,8 +66,11 @@ private[feature] trait QuantileDiscretizerBase extends Params /** * :: Experimental :: * `QuantileDiscretizer` takes a column with continuous features and outputs a column with binned - * categorical features. The bin ranges are chosen by taking a sample of the data and dividing it - * into roughly equal parts. The lower and upper bin bounds will be -Infinity and +Infinity, + * categorical features. The number of bins can be set using the `numBuckets` parameter. + * The bin ranges are chosen using an approximate algorithm (see the documentation for + * [[org.apache.spark.sql.DataFrameStatFunctions.approxQuantile approxQuantile]] + * for a detailed description). The precision of the approximation can be controlled with the + * `relativeError` parameter. The lower and upper bin bounds will be `-Infinity` and `+Infinity`, * covering all real values. */ @Experimental -- cgit v1.2.3