[SPARK-14478][ML][MLLIB][DOC] Doc that StandardScaler uses the corrected sample std

## What changes were proposed in this pull request? Currently, MLlib's StandardScaler scales columns using the corrected standard deviation (sqrt of unbiased variance). This matches what R's scale package does. This PR documents this fact. ## How was this patch tested? doc only Author: Joseph K. Bradley <joseph@databricks.com> Closes #12519 from jkbradley/scaler-variance-doc.
author: Joseph K. Bradley <joseph@databricks.com> 2016-04-20 11:48:30 -0700
committer: Joseph K. Bradley <joseph@databricks.com> 2016-04-20 11:48:30 -0700
commit: acc7e592c4ee5b4a6f42945329fc289fd11e1793 (patch)
tree: 1acd3764fdd3158986383dbc823d624ed7218e74 /mllib
parent: 08f84d7a9a7429b3d6651b5af4d7740027b53d39 (diff)
download: spark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.tar.gz
spark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.tar.bz2
spark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.zip
2 files changed, 10 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala
index 118a6e3e6a..626e97efb4 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/StandardScaler.scala
@@ -66,6 +66,11 @@ private[feature] trait StandardScalerParams extends Params with HasInputCol with
  * :: Experimental ::
  * Standardizes features by removing the mean and scaling to unit variance using column summary
  * statistics on the samples in the training set.
+ *
+ * The "unit std" is computed using the
+ * [[https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation
+ *   corrected sample standard deviation]],
+ * which is computed as the square root of the unbiased sample variance.
  */
 @Experimental
 class StandardScaler(override val uid: String) extends Estimator[StandardScalerModel]
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala
index 5c35e1b91c..ee97045f34 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/StandardScaler.scala
@@ -27,6 +27,11 @@ import org.apache.spark.rdd.RDD
  * Standardizes features by removing the mean and scaling to unit std using column summary
  * statistics on the samples in the training set.
  *
+ * The "unit std" is computed using the
+ * [[https://en.wikipedia.org/wiki/Standard_deviation#Corrected_sample_standard_deviation
+ *   corrected sample standard deviation]],
+ * which is computed as the square root of the unbiased sample variance.
+ *
  * @param withMean False by default. Centers the data with mean before scaling. It will build a
  *                 dense output, so this does not work on sparse input and will raise an exception.
  * @param withStd True by default. Scales the data to unit standard deviation.
author	Joseph K. Bradley <joseph@databricks.com>	2016-04-20 11:48:30 -0700
committer	Joseph K. Bradley <joseph@databricks.com>	2016-04-20 11:48:30 -0700
commit	acc7e592c4ee5b4a6f42945329fc289fd11e1793 (patch)
tree	1acd3764fdd3158986383dbc823d624ed7218e74 /mllib
parent	08f84d7a9a7429b3d6651b5af4d7740027b53d39 (diff)
download	spark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.tar.gz spark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.tar.bz2 spark-acc7e592c4ee5b4a6f42945329fc289fd11e1793.zip