[SPARK-17498][ML] StringIndexer enhancement for handling unseen labels

## What changes were proposed in this pull request? This PR is an enhancement to ML StringIndexer. Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records. But those unseen records might still be useful and user would like to keep the unseen labels in certain use cases, This PR enables StringIndexer to support keeping unseen labels as indices [numLabels]. '''Before StringIndexer().setHandleInvalid("skip") StringIndexer().setHandleInvalid("error") '''After support the third option "keep" StringIndexer().setHandleInvalid("keep") ## How was this patch tested? Test added in StringIndexerSuite Signed-off-by: VinceShieh <vincent.xieintel.com> (Please fill in changes proposed in this fix) Author: VinceShieh <vincent.xie@intel.com> Closes #16883 from VinceShieh/spark-17498.
author: VinceShieh <vincent.xie@intel.com> 2017-03-07 11:24:20 -0800
committer: Joseph K. Bradley <joseph@databricks.com> 2017-03-07 11:24:20 -0800
commit: 4a9034b17374cf19c77cb74e36c86cd085d59602 (patch)
tree: 1bb589f4efd445295897173aff2146ecfec425f8 /docs/ml-features.md
parent: c05baabf10dd4c808929b4ae7a6d118aba6dd665 (diff)
download: spark-4a9034b17374cf19c77cb74e36c86cd085d59602.tar.gz
spark-4a9034b17374cf19c77cb74e36c86cd085d59602.tar.bz2
spark-4a9034b17374cf19c77cb74e36c86cd085d59602.zip
1 files changed, 20 insertions, 2 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 57605bafbf..dad1c6db18 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -503,6 +503,7 @@ for more details on the API.
 
 `StringIndexer` encodes a string column of labels to a column of label indices.
 The indices are in `[0, numLabels)`, ordered by label frequencies, so the most frequent label gets index `0`.
+The unseen labels will be put at index numLabels if user chooses to keep them.
 If the input column is numeric, we cast it to string and index the string
 values. When downstream pipeline components such as `Estimator` or
 `Transformer` make use of this string-indexed label, you must set the input
@@ -542,12 +543,13 @@ column, we should get the following:
 "a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
 index `2`.
 
-Additionally, there are two strategies regarding how `StringIndexer` will handle
+Additionally, there are three strategies regarding how `StringIndexer` will handle
 unseen labels when you have fit a `StringIndexer` on one dataset and then use it
 to transform another:
 
 - throw an exception (which is the default)
 - skip the row containing the unseen label entirely
+- put unseen labels in a special additional bucket, at index numLabels
 
 **Examples**
 
@@ -561,6 +563,7 @@ Let's go back to our previous example but this time reuse our previously defined
  1  | b
  2  | c
  3  | d
+ 4  | e
 ~~~~
 
 If you've not set how `StringIndexer` handles unseen labels or set it to
@@ -576,7 +579,22 @@ will be generated:
  2  | c        | 1.0
 ~~~~
 
-Notice that the row containing "d" does not appear.
+Notice that the rows containing "d" or "e" do not appear.
+
+If you call `setHandleInvalid("keep")`, the following dataset
+will be generated:
+
+~~~~
+ id | category | categoryIndex
+----|----------|---------------
+ 0  | a        | 0.0
+ 1  | b        | 2.0
+ 2  | c        | 1.0
+ 3  | d        | 3.0
+ 4  | e        | 3.0
+~~~~
+
+Notice that the rows containing "d" or "e" are mapped to index "3.0"
 
 <div class="codetabs">
author	VinceShieh <vincent.xie@intel.com>	2017-03-07 11:24:20 -0800
committer	Joseph K. Bradley <joseph@databricks.com>	2017-03-07 11:24:20 -0800
commit	4a9034b17374cf19c77cb74e36c86cd085d59602 (patch)
tree	1bb589f4efd445295897173aff2146ecfec425f8 /docs/ml-features.md
parent	c05baabf10dd4c808929b4ae7a6d118aba6dd665 (diff)
download	spark-4a9034b17374cf19c77cb74e36c86cd085d59602.tar.gz spark-4a9034b17374cf19c77cb74e36c86cd085d59602.tar.bz2 spark-4a9034b17374cf19c77cb74e36c86cd085d59602.zip