aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorJoseph K. Bradley <joseph@databricks.com>2016-04-20 11:48:30 -0700
committerJoseph K. Bradley <joseph@databricks.com>2016-04-20 11:48:30 -0700
commitacc7e592c4ee5b4a6f42945329fc289fd11e1793 (patch)
tree1acd3764fdd3158986383dbc823d624ed7218e74 /mllib
parent08f84d7a9a7429b3d6651b5af4d7740027b53d39 (diff)
downloadspark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.tar.gz
spark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.tar.bz2
spark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.zip
[SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected sample std
## What changes were proposed in this pull request? Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does. This PR documents this fact. ## How was this patch tested? doc only Author: Joseph K. Bradley <joseph@databricks.com> Closes #12519 from jkbradley/scaler-variance-doc.
Diffstat (limited to 'mllib')
-rw-r--r--mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala5
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala5
2 files changed, 10 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala
index 118a6e3e6a..626e97efb4 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala
@@ -66,6 +66,11 @@ private[feature] trait StandardScalerParams extends Params with HasInputCol with
* :: Experimental ::
* Standardizes features by removing the mean and scaling to unit variance using column summary
* statistics on the samples in the training set.
+ *
+ * The "unit std" is computed using the
+ * [[https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation
+ * corrected sample standard deviation]],
+ * which is computed as the square root of the unbiased sample variance.
*/
@Experimental
class StandardScaler(override val uid: String) extends Estimator[StandardScalerModel]
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala
index 5c35e1b91c..ee97045f34 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala
@@ -27,6 +27,11 @@ import org.apache.spark.rdd.RDD
* Standardizes features by removing the mean and scaling to unit std using column summary
* statistics on the samples in the training set.
*
+ * The "unit std" is computed using the
+ * [[https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation
+ * corrected sample standard deviation]],
+ * which is computed as the square root of the unbiased sample variance.
+ *
* @param withMean False by default. Centers the data with mean before scaling. It will build a
* dense output, so this does not work on sparse input and will raise an exception.
* @param withStd True by default. Scales the data to unit standard deviation.