diff options
author | Xusen Yin <yinxusen@gmail.com> | 2015-05-20 10:41:18 +0100 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2015-05-20 10:44:06 +0100 |
commit | b3abf0b8d9bca13840eb759953d76905c2ba9b8a (patch) | |
tree | 75b9ad1e525868fb385f2398239b865350ff6089 | |
parent | 60336e3bc02a2587fdf315f9011bbe7c9d3a58c4 (diff) | |
download | spark-b3abf0b8d9bca13840eb759953d76905c2ba9b8a.tar.gz spark-b3abf0b8d9bca13840eb759953d76905c2ba9b8a.tar.bz2 spark-b3abf0b8d9bca13840eb759953d76905c2ba9b8a.zip |
[SPARK-7663] [MLLIB] Add requirement for word2vec model
JIRA issue [link](https://issues.apache.org/jira/browse/SPARK-7663).
We should check the model size of word2vec, to prevent the unexpected empty.
CC srowen.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #6228 from yinxusen/SPARK-7663 and squashes the following commits:
21770c5 [Xusen Yin] check the vocab size
54ae63e [Xusen Yin] add requirement for word2vec model
-rw-r--r-- | mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala | 3 |
1 files changed, 3 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala index 731f7576c2..f65f78299d 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala @@ -158,6 +158,9 @@ class Word2Vec extends Serializable with Logging { .sortWith((a, b) => a.cn > b.cn) vocabSize = vocab.length + require(vocabSize > 0, "The vocabulary size should be > 0. You may need to check " + + "the setting of minCount, which could be large enough to remove all your words in sentences.") + var a = 0 while (a < vocabSize) { vocabHash += vocab(a).word -> a |