aboutsummaryrefslogtreecommitdiff
path: root/python
diff options
context:
space:
mode:
authorWilliam Benton <willb@redhat.com>2016-09-17 12:49:58 +0100
committerSean Owen <sowen@cloudera.com>2016-09-17 12:49:58 +0100
commit25cbbe6ca334140204e7035ab8b9d304da9b8a8a (patch)
tree7e0ec70179b52f4b39336c2fbb841a8584e83a48 /python
parentf15d41be3ce7569736ccbf2ffe1bec265865f55d (diff)
downloadspark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.tar.gz
spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.tar.bz2
spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.zip
[SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector
## What changes were proposed in this pull request? This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector. ## How was this patch tested? I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected. Author: William Benton <willb@redhat.com> Closes #15105 from willb/fix/findSynonyms.
Diffstat (limited to 'python')
-rw-r--r--python/pyspark/mllib/feature.py12
1 files changed, 9 insertions, 3 deletions
diff --git a/python/pyspark/mllib/feature.py b/python/pyspark/mllib/feature.py
index b32d0c70ec..5d99644fca 100644
--- a/python/pyspark/mllib/feature.py
+++ b/python/pyspark/mllib/feature.py
@@ -544,8 +544,7 @@ class Word2VecModel(JavaVectorTransformer, JavaSaveable, JavaLoader):
@ignore_unicode_prefix
class Word2Vec(object):
- """
- Word2Vec creates vector representation of words in a text corpus.
+ """Word2Vec creates vector representation of words in a text corpus.
The algorithm first constructs a vocabulary from the corpus
and then learns vector representation of words in the vocabulary.
The vector representation can be used as features in
@@ -567,13 +566,19 @@ class Word2Vec(object):
>>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" "))
>>> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc)
+ Querying for synonyms of a word will not return that word:
+
>>> syms = model.findSynonyms("a", 2)
>>> [s[0] for s in syms]
[u'b', u'c']
+
+ But querying for synonyms of a vector may return the word whose
+ representation is that vector:
+
>>> vec = model.transform("a")
>>> syms = model.findSynonyms(vec, 2)
>>> [s[0] for s in syms]
- [u'b', u'c']
+ [u'a', u'b']
>>> import os, tempfile
>>> path = tempfile.mkdtemp()
@@ -591,6 +596,7 @@ class Word2Vec(object):
... pass
.. versionadded:: 1.2.0
+
"""
def __init__(self):
"""