diff options
author | Yuhao Yang <hhbyyh@gmail.com> | 2015-12-05 15:27:31 +0000 |
---|---|---|
committer | Sean Owen <sowen@cloudera.com> | 2015-12-05 15:27:31 +0000 |
commit | ee94b70ce56661ea26c5aad17778ade32f3f1d3d (patch) | |
tree | 95f1d75df182253e4e418a8e598d1ff277b0fc59 /mllib/src/main | |
parent | 3af53e61fd604fe8000e1fdf656d60b79c842d1c (diff) | |
download | spark-ee94b70ce56661ea26c5aad17778ade32f3f1d3d.tar.gz spark-ee94b70ce56661ea26c5aad17778ade32f3f1d3d.tar.bz2 spark-ee94b70ce56661ea26c5aad17778ade32f3f1d3d.zip |
[SPARK-12096][MLLIB] remove the old constraint in word2vec
jira: https://issues.apache.org/jira/browse/SPARK-12096
word2vec now can handle much bigger vocabulary.
The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed.
new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue)
I tested with vocabsize over 18M and vectorsize = 100.
srowen jkbradley Sorry to miss this in last PR. I was reminded today.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #10103 from hhbyyh/w2vCapacity.
Diffstat (limited to 'mllib/src/main')
-rw-r--r-- | mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala | 4 |
1 files changed, 2 insertions, 2 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala index 655ac0bb55..be12d45286 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala @@ -306,10 +306,10 @@ class Word2Vec extends Serializable with Logging { val newSentences = sentences.repartition(numPartitions).cache() val initRandom = new XORShiftRandom(seed) - if (vocabSize.toLong * vectorSize * 8 >= Int.MaxValue) { + if (vocabSize.toLong * vectorSize >= Int.MaxValue) { throw new RuntimeException("Please increase minCount or decrease vectorSize in Word2Vec" + " to avoid an OOM. You are highly recommended to make your vocabSize*vectorSize, " + - "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue/8`.") + "which is " + vocabSize + "*" + vectorSize + " for now, less than `Int.MaxValue`.") } val syn0Global = |