[SPARK-5207] [MLLIB] StandardScalerModel mean and variance re-use

This seems complete, the duplication of tests for provided means/variances might be overkill, would appreciate some feedback. Author: Octavian Geagla <ogeagla@gmail.com> Closes #4140 from ogeagla/SPARK-5207 and squashes the following commits: fa64dfa [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel to take stddev instead of variance 9078fe0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] Incorporate code review feedback: change arg ordering, add dev api annotations, do better null checking, add another test and some doc for this. 997d2e0 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] make withMean and withStd public, add constructor which uses defaults, un-refactor test class 64408a4 [Octavian Geagla] [SPARK-5207] [MLLIB] [WIP] change StandardScalerModel contructor to not be private to mllib, added tests for newly-exposed functionality
author: Octavian Geagla <ogeagla@gmail.com> 2015-02-01 09:21:14 -0800
committer: Xiangrui Meng <meng@databricks.com> 2015-02-01 09:21:14 -0800
commit: bdb0680d37614ccdec8933d2dec53793825e43d7 (patch)
tree: 4a665b0a605a63b8b19886022d4f5246e3fdedc4 /docs
parent: 80bd715a3e2c39449ed5e4d4e7058d75281ef3cb (diff)
download: spark-bdb0680d37614ccdec8933d2dec53793825e43d7.tar.gz
spark-bdb0680d37614ccdec8933d2dec53793825e43d7.tar.bz2
spark-bdb0680d37614ccdec8933d2dec53793825e43d7.zip
1 files changed, 8 insertions, 3 deletions
diff --git a/docs/mllib-feature-extraction.md b/docs/mllib-feature-extraction.md
index 197bc77d50..d4a61a7fbf 100644
--- a/docs/mllib-feature-extraction.md
+++ b/docs/mllib-feature-extraction.md
@@ -240,11 +240,11 @@ following parameters in the constructor:
 
 * `withMean` False by default. Centers the data with mean before scaling. It will build a dense
 output, so this does not work on sparse input and will raise an exception.
-* `withStd` True by default. Scales the data to unit variance.
+* `withStd` True by default. Scales the data to unit standard deviation.
 
 We provide a [`fit`](api/scala/index.html#org.apache.spark.mllib.feature.StandardScaler) method in
 `StandardScaler` which can take an input of `RDD[Vector]`, learn the summary statistics, and then
-return a model which can transform the input dataset into unit variance and/or zero mean features
+return a model which can transform the input dataset into unit standard deviation and/or zero mean features
 depending how we configure the `StandardScaler`.
 
 This model implements [`VectorTransformer`](api/scala/index.html#org.apache.spark.mllib.feature.VectorTransformer)
@@ -257,7 +257,7 @@ for that feature.
 ### Example
 
 The example below demonstrates how to load a dataset in libsvm format, and standardize the features
-so that the new features have unit variance and/or zero mean.
+so that the new features have unit standard deviation and/or zero mean.
 
 <div class="codetabs">
 <div data-lang="scala">
@@ -271,6 +271,8 @@ val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
 
 val scaler1 = new StandardScaler().fit(data.map(x => x.features))
 val scaler2 = new StandardScaler(withMean = true, withStd = true).fit(data.map(x => x.features))
+// scaler3 is an identical model to scaler2, and will produce identical transformations
+val scaler3 = new StandardScalerModel(scaler2.std, scaler2.mean)
 
 // data1 will be unit variance.
 val data1 = data.map(x => (x.label, scaler1.transform(x.features)))
@@ -294,6 +296,9 @@ features = data.map(lambda x: x.features)
 
 scaler1 = StandardScaler().fit(features)
 scaler2 = StandardScaler(withMean=True, withStd=True).fit(features)
+# scaler3 is an identical model to scaler2, and will produce identical transformations
+scaler3 = StandardScalerModel(scaler2.std, scaler2.mean)
+
 
 # data1 will be unit variance.
 data1 = label.zip(scaler1.transform(features))
author	Octavian Geagla <ogeagla@gmail.com>	2015-02-01 09:21:14 -0800
committer	Xiangrui Meng <meng@databricks.com>	2015-02-01 09:21:14 -0800
commit	bdb0680d37614ccdec8933d2dec53793825e43d7 (patch)
tree	4a665b0a605a63b8b19886022d4f5246e3fdedc4 /docs
parent	80bd715a3e2c39449ed5e4d4e7058d75281ef3cb (diff)
download	spark-bdb0680d37614ccdec8933d2dec53793825e43d7.tar.gz spark-bdb0680d37614ccdec8933d2dec53793825e43d7.tar.bz2 spark-bdb0680d37614ccdec8933d2dec53793825e43d7.zip