aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
diff options
context:
space:
mode:
authorBenFradet <benjamin.fradet@gmail.com>2015-12-11 15:43:00 -0800
committerJoseph K. Bradley <joseph@databricks.com>2015-12-11 15:43:00 -0800
commitaea676ca2d07c72b1a752e9308c961118e5bfc3c (patch)
tree8140f9e7bfa90e7a89f826c8372450c60afff7eb /docs/ml-features.md
parent1b8220387e6903564f765fabb54be0420c3e99d7 (diff)
downloadspark-aea676ca2d07c72b1a752e9308c961118e5bfc3c.tar.gz
spark-aea676ca2d07c72b1a752e9308c961118e5bfc3c.tar.bz2
spark-aea676ca2d07c72b1a752e9308c961118e5bfc3c.zip
[SPARK-12217][ML] Document invalid handling for StringIndexer
Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10257 from BenFradet/SPARK-12217.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r--docs/ml-features.md36
1 files changed, 36 insertions, 0 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 6494fed0a0..8b00cc652d 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -459,6 +459,42 @@ column, we should get the following:
"a" gets index `0` because it is the most frequent, followed by "c" with index `1` and "b" with
index `2`.
+Additionaly, there are two strategies regarding how `StringIndexer` will handle
+unseen labels when you have fit a `StringIndexer` on one dataset and then use it
+to transform another:
+
+- throw an exception (which is the default)
+- skip the row containing the unseen label entirely
+
+**Examples**
+
+Let's go back to our previous example but this time reuse our previously defined
+`StringIndexer` on the following dataset:
+
+~~~~
+ id | category
+----|----------
+ 0 | a
+ 1 | b
+ 2 | c
+ 3 | d
+~~~~
+
+If you've not set how `StringIndexer` handles unseen labels or set it to
+"error", an exception will be thrown.
+However, if you had called `setHandleInvalid("skip")`, the following dataset
+will be generated:
+
+~~~~
+ id | category | categoryIndex
+----|----------|---------------
+ 0 | a | 0.0
+ 1 | b | 2.0
+ 2 | c | 1.0
+~~~~
+
+Notice that the row containing "d" does not appear.
+
<div class="codetabs">
<div data-lang="scala" markdown="1">