[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec

jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9803 from hhbyyh/w2vVocab.
author: Yuhao Yang <hhbyyh@gmail.com> 2015-11-18 13:25:15 -0800
committer: Xiangrui Meng <meng@databricks.com> 2015-11-18 13:25:15 -0800
commit: e391abdf2cb6098a35347bd123b815ee9ac5b689 (patch)
tree: 4f08ede312f02f383e8484231de51c7894ab83cb /mllib
parent: 2acdf10b1f3bb1242dba64efa798c672fde9f0d2 (diff)
download: spark-e391abdf2cb6098a35347bd123b815ee9ac5b689.tar.gz
spark-e391abdf2cb6098a35347bd123b815ee9ac5b689.tar.bz2
spark-e391abdf2cb6098a35347bd123b815ee9ac5b689.zip
1 files changed, 2 insertions, 2 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
index f3e4d346e3..7ab0d89d23 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala
@@ -145,8 +145,8 @@ class Word2Vec extends Serializable with Logging {
 
   private var trainWordsCount = 0
   private var vocabSize = 0
-  private var vocab: Array[VocabWord] = null
-  private var vocabHash = mutable.HashMap.empty[String, Int]
+  @transient private var vocab: Array[VocabWord] = null
+  @transient private var vocabHash = mutable.HashMap.empty[String, Int]
 
   private def learnVocab(words: RDD[String]): Unit = {
     vocab = words.map(w => (w, 1))
author	Yuhao Yang <hhbyyh@gmail.com>	2015-11-18 13:25:15 -0800
committer	Xiangrui Meng <meng@databricks.com>	2015-11-18 13:25:15 -0800
commit	e391abdf2cb6098a35347bd123b815ee9ac5b689 (patch)
tree	4f08ede312f02f383e8484231de51c7894ab83cb /mllib
parent	2acdf10b1f3bb1242dba64efa798c672fde9f0d2 (diff)
download	spark-e391abdf2cb6098a35347bd123b815ee9ac5b689.tar.gz spark-e391abdf2cb6098a35347bd123b815ee9ac5b689.tar.bz2 spark-e391abdf2cb6098a35347bd123b815ee9ac5b689.zip