[SPARK-18374][ML] Incorrect words in StopWords/english.txt

## What changes were proposed in this pull request? Currently English stop words list in MLlib contains only the argumented words after removing all the apostrophes, so "wouldn't" become "wouldn" and "t". Yet by default Tokenizer and RegexTokenizer don't split on apostrophes or quotes. Adding original form to stop words list to match the behavior of Tokenizer and StopwordsRemover. Also remove "won" from list. see more discussion in the jira: https://issues.apache.org/jira/browse/SPARK-18374 ## How was this patch tested? existing ut Author: Yuhao <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #16103 from hhbyyh/addstopwords.
author: Yuhao <yuhao.yang@intel.com> 2016-12-07 05:12:24 +0800
committer: Sean Owen <sowen@cloudera.com> 2016-12-07 05:12:24 +0800
commit: fac5b75b74b2d76b6314c69be3c769f1f321688c (patch)
tree: e926e5d43b0cf97e258ec8b919c781eb9c0fb9b2
parent: 1ef6b296d7cd2d93cdfd5f54940842d6bb915ce0 (diff)
download: spark-fac5b75b74b2d76b6314c69be3c769f1f321688c.tar.gz
spark-fac5b75b74b2d76b6314c69be3c769f1f321688c.tar.bz2
spark-fac5b75b74b2d76b6314c69be3c769f1f321688c.zip
2 files changed, 55 insertions, 27 deletions
diff --git a/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
index d075cc0bab..d6094d774a 100644
--- a/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
+++ b/mllib/src/main/resources/org/apache/spark/ml/feature/stopwords/english.txt
@@ -125,29 +125,57 @@ just
 don
 should
 now
-d
-ll
-m
-o
-re
-ve
-y
-ain
-aren
-couldn
-didn
-doesn
-hadn
-hasn
-haven
-isn
-ma
-mightn
-mustn
-needn
-shan
-shouldn
-wasn
-weren
-won
-wouldn
+i'll
+you'll
+he'll
+she'll
+we'll
+they'll
+i'd
+you'd
+he'd
+she'd
+we'd
+they'd
+i'm
+you're
+he's
+she's
+it's
+we're
+they're
+i've
+we've
+you've
+they've
+isn't
+aren't
+wasn't
+weren't
+haven't
+hasn't
+hadn't
+don't
+doesn't
+didn't
+won't
+wouldn't
+shan't
+shouldn't
+mustn't
+can't
+couldn't
+cannot
+could
+here's
+how's
+let's
+ought
+that's
+there's
+what's
+when's
+where's
+who's
+why's
+would
+\ No newline at end of file
diff --git a/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
index 957cf58a68..5262b146b1 100755
--- a/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
+++ b/mllib/src/test/scala/org/apache/spark/ml/feature/StopWordsRemoverSuite.scala
@@ -45,7 +45,7 @@ class StopWordsRemoverSuite
       .setOutputCol("filtered")
     val dataSet = Seq(
       (Seq("test", "test"), Seq("test", "test")),
-      (Seq("a", "b", "c", "d"), Seq("b", "c")),
+      (Seq("a", "b", "c", "d"), Seq("b", "c", "d")),
       (Seq("a", "the", "an"), Seq()),
       (Seq("A", "The", "AN"), Seq()),
       (Seq(null), Seq(null)),
author	Yuhao <yuhao.yang@intel.com>	2016-12-07 05:12:24 +0800
committer	Sean Owen <sowen@cloudera.com>	2016-12-07 05:12:24 +0800
commit	fac5b75b74b2d76b6314c69be3c769f1f321688c (patch)
tree	e926e5d43b0cf97e258ec8b919c781eb9c0fb9b2
parent	1ef6b296d7cd2d93cdfd5f54940842d6bb915ce0 (diff)
download	spark-fac5b75b74b2d76b6314c69be3c769f1f321688c.tar.gz spark-fac5b75b74b2d76b6314c69be3c769f1f321688c.tar.bz2 spark-fac5b75b74b2d76b6314c69be3c769f1f321688c.zip