aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorYuhao Yang <hhbyyh@gmail.com>2015-12-05 15:27:31 +0000
committerSean Owen <sowen@cloudera.com>2015-12-05 15:27:31 +0000
commitee94b70ce56661ea26c5aad17778ade32f3f1d3d (patch)
tree95f1d75df182253e4e418a8e598d1ff277b0fc59 /mllib
parent3af53e61fd604fe8000e1fdf656d60b79c842d1c (diff)
downloadspark-ee94b70ce56661ea26c5aad17778ade32f3f1d3d.tar.gz
spark-ee94b70ce56661ea26c5aad17778ade32f3f1d3d.tar.bz2
spark-ee94b70ce56661ea26c5aad17778ade32f3f1d3d.zip
[SPARK-12096][MLLIB] remove the old constraint in word2vec
jira: https://issues.apache.org/jira/browse/SPARK-12096 word2vec now can handle much bigger vocabulary. The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed. new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue) I tested with vocabsize over 18M and vectorsize = 100. srowen jkbradley Sorry to miss this in last PR. I was reminded today. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10103 from hhbyyh/w2vCapacity.
Diffstat (limited to 'mllib')
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala4
1 files changed, 2 insertions, 2 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index 655ac0bb55..be12d45286 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -306,10 +306,10 @@ class Word2Vec extends Serializable with Logging {
val newSentences = sentences.repartition(numPartitions).cache()
val initRandom = new XORShiftRandom(seed)
- if (vocabSize.toLong * vectorSize * 8 >= Int.MaxValue) {
+ if (vocabSize.toLong * vectorSize >= Int.MaxValue) {
throw new RuntimeException("Please increase minCount or decrease vectorSize in Word2Vec" +
" to avoid an OOM. You are highly recommended to make your vocabSize*vectorSize, " +
- "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue/8`.")
+ "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue`.")
}
val syn0Global =