diff options
author | Sean Owen <sowen@cloudera.com> | 2016-08-27 08:48:56 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-08-27 08:48:56 +0100 |
commit | e07baf14120bc94b783649dabf5fffea58bff0de (patch) | |
tree | 557979925874c18034e793057a9706c3ee6924fa /docs/ml-features.md | |
parent | 9fbced5b25c2f24d50c50516b4b7737f7e3eaf86 (diff) | |
download | spark-e07baf14120bc94b783649dabf5fffea58bff0de.tar.gz spark-e07baf14120bc94b783649dabf5fffea58bff0de.tar.bz2 spark-e07baf14120bc94b783649dabf5fffea58bff0de.zip |
[SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True
## What changes were proposed in this pull request?
Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
## How was this patch tested?
Jenkins tests, including new caes to reflect the new behavior.
Author: Sean Owen <sowen@cloudera.com>
Closes #14663 from srowen/SPARK-17001.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r-- | docs/ml-features.md | 2 |
1 files changed, 1 insertions, 1 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md index e41bf78521..746593fb9e 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -768,7 +768,7 @@ for more details on the API. `StandardScaler` transforms a dataset of `Vector` rows, normalizing each feature to have unit standard deviation and/or zero mean. It takes parameters: * `withStd`: True by default. Scales the data to unit standard deviation. -* `withMean`: False by default. Centers the data with mean before scaling. It will build a dense output, so this does not work on sparse input and will raise an exception. +* `withMean`: False by default. Centers the data with mean before scaling. It will build a dense output, so take care when applying to sparse input. `StandardScaler` is an `Estimator` which can be `fit` on a dataset to produce a `StandardScalerModel`; this amounts to computing summary statistics. The model can then transform a `Vector` column in a dataset to have unit standard deviation and/or zero mean features. |