aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorXusen Yin <yinxusen@gmail.com>2015-05-20 10:41:18 +0100
committerSean Owen <sowen@cloudera.com>2015-05-20 10:44:06 +0100
commitb3abf0b8d9bca13840eb759953d76905c2ba9b8a (patch)
tree75b9ad1e525868fb385f2398239b865350ff6089 /mllib
parent60336e3bc02a2587fdf315f9011bbe7c9d3a58c4 (diff)
downloadspark-b3abf0b8d9bca13840eb759953d76905c2ba9b8a.tar.gz
spark-b3abf0b8d9bca13840eb759953d76905c2ba9b8a.tar.bz2
spark-b3abf0b8d9bca13840eb759953d76905c2ba9b8a.zip
[SPARK-7663] [MLLIB] Add requirement for word2vec model
JIRA issue [link](https://issues.apache.org/jira/browse/SPARK-7663). We should check the model size of word2vec, to prevent the unexpected empty. CC srowen. Author: Xusen Yin <yinxusen@gmail.com> Closes #6228 from yinxusen/SPARK-7663 and squashes the following commits: 21770c5 [Xusen Yin] check the vocab size 54ae63e [Xusen Yin] add requirement for word2vec model
Diffstat (limited to 'mllib')
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala3
1 files changed, 3 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index 731f7576c2..f65f78299d 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -158,6 +158,9 @@ class Word2Vec extends Serializable with Logging {
.sortWith((a, b) => a.cn > b.cn)
vocabSize = vocab.length
+ require(vocabSize > 0, "The vocabulary size should be > 0. You may need to check " +
+ "the setting of minCount, which could be large enough to remove all your words in sentences.")
+
var a = 0
while (a < vocabSize) {
vocabHash += vocab(a).word -> a