diff options
author | Yuhao Yang <hhbyyh@gmail.com> | 2016-06-21 00:47:36 -0700 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2016-06-21 00:47:36 -0700 |
commit | a58f40239444d42adbc480ddde02cbb02a79bbe4 (patch) | |
tree | c8bbf6103f2138f7430559df86a7a971356a1e97 /docs | |
parent | 37494a18e8d6e22113338523d6498e00ac9725ea (diff) | |
download | spark-a58f40239444d42adbc480ddde02cbb02a79bbe4.tar.gz spark-a58f40239444d42adbc480ddde02cbb02a79bbe4.tar.bz2 spark-a58f40239444d42adbc480ddde02cbb02a79bbe4.zip |
[SPARK-16045][ML][DOC] Spark 2.0 ML.feature: doc update for stopwords and binarizer
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16045
2.0 Audit: Update document for StopWordsRemover and Binarizer.
## How was this patch tested?
manual review for doc
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes #13375 from hhbyyh/stopdoc.
Diffstat (limited to 'docs')
-rw-r--r-- | docs/ml-features.md | 16 |
1 files changed, 10 insertions, 6 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md index 3db24a3840..3cb26443b9 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -251,11 +251,12 @@ frequently and don't carry as much meaning. `StopWordsRemover` takes as input a sequence of strings (e.g. the output of a [Tokenizer](ml-features.html#tokenizer)) and drops all the stop words from the input sequences. The list of stopwords is specified by -the `stopWords` parameter. We provide [a list of stop -words](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) by -default, accessible by calling `getStopWords` on a newly instantiated -`StopWordsRemover` instance. A boolean parameter `caseSensitive` indicates -if the matches should be case sensitive (false by default). +the `stopWords` parameter. Default stop words for some languages are accessible +by calling `StopWordsRemover.loadDefaultStopWords(language)`, for which available +options are "danish", "dutch", "english", "finnish", "french", "german", "hungarian", +"italian", "norwegian", "portuguese", "russian", "spanish", "swedish" and "turkish". +A boolean parameter `caseSensitive` indicates if the matches should be case sensitive +(false by default). **Examples** @@ -346,7 +347,10 @@ for more details on the API. Binarization is the process of thresholding numerical features to binary (0/1) features. -`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` for binarization. Feature values greater than the threshold are binarized to 1.0; values equal to or less than the threshold are binarized to 0.0. +`Binarizer` takes the common parameters `inputCol` and `outputCol`, as well as the `threshold` +for binarization. Feature values greater than the threshold are binarized to 1.0; values equal +to or less than the threshold are binarized to 0.0. Both Vector and Double types are supported +for `inputCol`. <div class="codetabs"> <div data-lang="scala" markdown="1"> |