diff options
author | William Benton <willb@redhat.com> | 2016-09-17 12:49:58 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2016-09-17 12:49:58 +0100 |
commit | 25cbbe6ca334140204e7035ab8b9d304da9b8a8a (patch) | |
tree | 7e0ec70179b52f4b39336c2fbb841a8584e83a48 /python | |
parent | f15d41be3ce7569736ccbf2ffe1bec265865f55d (diff) | |
download | spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.tar.gz spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.tar.bz2 spark-25cbbe6ca334140204e7035ab8b9d304da9b8a8a.zip |
[SPARK-17548][MLLIB] Word2VecModel.findSynonyms no longer spuriously rejects the best match when invoked with a vector
## What changes were proposed in this pull request?
This pull request changes the behavior of `Word2VecModel.findSynonyms` so that it will not spuriously reject the best match when invoked with a vector that does not correspond to a word in the model's vocabulary. Instead of blindly discarding the best match, the changed implementation discards a match that corresponds to the query word (in cases where `findSynonyms` is invoked with a word) or that has an identical angle to the query vector.
## How was this patch tested?
I added a test to `Word2VecSuite` to ensure that the word with the most similar vector from a supplied vector would not be spuriously rejected.
Author: William Benton <willb@redhat.com>
Closes #15105 from willb/fix/findSynonyms.
Diffstat (limited to 'python')
-rw-r--r-- | python/pyspark/mllib/feature.py | 12 |
1 files changed, 9 insertions, 3 deletions
diff --git a/python/pyspark/mllib/feature.py b/python/pyspark/mllib/feature.py index b32d0c70ec..5d99644fca 100644 --- a/python/pyspark/mllib/feature.py +++ b/python/pyspark/mllib/feature.py @@ -544,8 +544,7 @@ class Word2VecModel(JavaVectorTransformer, JavaSaveable, JavaLoader): @ignore_unicode_prefix class Word2Vec(object): - """ - Word2Vec creates vector representation of words in a text corpus. + """Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in @@ -567,13 +566,19 @@ class Word2Vec(object): >>> doc = sc.parallelize(localDoc).map(lambda line: line.split(" ")) >>> model = Word2Vec().setVectorSize(10).setSeed(42).fit(doc) + Querying for synonyms of a word will not return that word: + >>> syms = model.findSynonyms("a", 2) >>> [s[0] for s in syms] [u'b', u'c'] + + But querying for synonyms of a vector may return the word whose + representation is that vector: + >>> vec = model.transform("a") >>> syms = model.findSynonyms(vec, 2) >>> [s[0] for s in syms] - [u'b', u'c'] + [u'a', u'b'] >>> import os, tempfile >>> path = tempfile.mkdtemp() @@ -591,6 +596,7 @@ class Word2Vec(object): ... pass .. versionadded:: 1.2.0 + """ def __init__(self): """ |