[SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector

## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes #15105 from willb/fix/findSynonyms.
author: William Benton <willb@redhat.com> 2016-09-17 12:49:58 +0100
committer: Sean Owen <sowen@cloudera.com> 2016-09-17 12:49:58 +0100
commit: 25cbbe6ca334140204e7035ab8b9d304da9b8a8a (patch)
tree: 7e0ec70179b52f4b39336c2fbb841a8584e83a48 /python
parent: f15d41be3ce7569736ccbf2ffe1bec265865f55d (diff)
download: spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.tar.gz
spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.tar.bz2
spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.zip
1 files changed, 9 insertions, 3 deletions
diff --git a/python/pyspark/mllib/feature.py b/python/pyspark/mllib/feature.py
index b32d0c70ec..5d99644fca 100644
--- a/python/pyspark/mllib/feature.py
+++ b/python/pyspark/mllib/feature.py
@@ -544,8 +544,7 @@ class Word2VecModel(JavaVectorTransformer, JavaSaveable, JavaLoader):
 
 @ignore_unicode_prefix
 class Word2Vec(object):
-    """
-    Word2Vec creates vector representation of words in a text corpus.
+    """Word2Vec creates vector representation of words in a text corpus.
     The algorithm first constructs a vocabulary from the corpus
     and then learns vector representation of words in the vocabulary.
     The vector representation can be used as features in
@@ -567,13 +566,19 @@ class Word2Vec(object):
     >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
     >>> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc)
 
+    Querying for synonyms of a word will not return that word:
+
     >>> syms = model.findSynonyms("a", 2)
     >>> [s[0] for s in syms]
     [u'b', u'c']
+
+    But querying for synonyms of a vector may return the word whose
+    representation is that vector:
+
     >>> vec = model.transform("a")
     >>> syms = model.findSynonyms(vec, 2)
     >>> [s[0] for s in syms]
-    [u'b', u'c']
+    [u'a', u'b']
 
     >>> import os, tempfile
     >>> path = tempfile.mkdtemp()
@@ -591,6 +596,7 @@ class Word2Vec(object):
     ...     pass
 
     .. versionadded:: 1.2.0
+
     """
     def __init__(self):
         """
author	William Benton <willb@redhat.com>	2016-09-17 12:49:58 +0100
committer	Sean Owen <sowen@cloudera.com>	2016-09-17 12:49:58 +0100
commit	25cbbe6ca334140204e7035ab8b9d304da9b8a8a (patch)
tree	7e0ec70179b52f4b39336c2fbb841a8584e83a48 /python
parent	f15d41be3ce7569736ccbf2ffe1bec265865f55d (diff)
download	spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.tar.gz spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.tar.bz2 spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.zip