aboutsummaryrefslogtreecommitdiff
path: root/project
diff options
context:
space:
mode:
authorVinceShieh <vincent.xie@intel.com>2017-03-07 11:24:20 -0800
committerJoseph K. Bradley <joseph@databricks.com>2017-03-07 11:24:20 -0800
commit4a9034b17374cf19c77cb74e36c86cd085d59602 (patch)
tree1bb589f4efd445295897173aff2146ecfec425f8 /project
parentc05baabf10dd4c808929b4ae7a6d118aba6dd665 (diff)
downloadspark-4a9034b17374cf19c77cb74e36c86cd085d59602.tar.gz
spark-4a9034b17374cf19c77cb74e36c86cd085d59602.tar.bz2
spark-4a9034b17374cf19c77cb74e36c86cd085d59602.zip
[SPARK-17498][ML] StringIndexer enhancement for handling unseen labels
## What changes were proposed in this pull request? This PR is an enhancement to ML StringIndexer. Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records. But those unseen records might still be useful and user would like to keep the unseen labels in certain use cases, This PR enables StringIndexer to support keeping unseen labels as indices [numLabels]. '''Before StringIndexer().setHandleInvalid("skip") StringIndexer().setHandleInvalid("error") '''After support the third option "keep" StringIndexer().setHandleInvalid("keep") ## How was this patch tested? Test added in StringIndexerSuite Signed-off-by: VinceShieh <vincent.xieintel.com> (Please fill in changes proposed in this fix) Author: VinceShieh <vincent.xie@intel.com> Closes #16883 from VinceShieh/spark-17498.
Diffstat (limited to 'project')
-rw-r--r--project/MimaExcludes.scala4
1 files changed, 4 insertions, 0 deletions
diff --git a/project/MimaExcludes.scala b/project/MimaExcludes.scala
index 56b8c0b95e..bd4528bd21 100644
--- a/project/MimaExcludes.scala
+++ b/project/MimaExcludes.scala
@@ -915,6 +915,10 @@ object MimaExcludes {
// [SPARK-17163] Unify logistic regression interface. Private constructor has new signature.
ProblemFilters.exclude[DirectMissingMethodProblem]("org.apache.spark.ml.classification.LogisticRegressionModel.this")
) ++ Seq(
+ // [SPARK-17498] StringIndexer enhancement for handling unseen labels
+ ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.ml.feature.StringIndexer"),
+ ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.ml.feature.StringIndexerModel")
+ ) ++ Seq(
// [SPARK-17365][Core] Remove/Kill multiple executors together to reduce RPC call time
ProblemFilters.exclude[MissingTypesProblem]("org.apache.spark.SparkContext")
) ++ Seq(