[SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec - spark

diff options

author	Yuhao Yang <hhbyyh@gmail.com>	2015-12-01 09:26:58 +0000
committer	Sean Owen <sowen@cloudera.com>	2015-12-01 09:26:58 +0000
commit	a0af0e351e45a8be47a6f65efd132eaa4a00c9e4 (patch)
tree	4627576439dac39019d2945108baa992d10b4d33 /CONTRIBUTING.md
parent	9693b0d5a55bc1d2da96f04fe2c6de59a8dfcc1b (diff)
download	spark-a0af0e351e45a8be47a6f65efd132eaa4a00c9e4.tar.gz spark-a0af0e351e45a8be47a6f65efd132eaa4a00c9e4.tar.bz2 spark-a0af0e351e45a8be47a6f65efd132eaa4a00c9e4.zip

[SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec

jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization. Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help, 1. decrease the worker memory consumption by 45%. 2. decrease running time by 40%. This will also help extend the upper limit for Word2Vec. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9878 from hhbyyh/w2vBC.

Diffstat (limited to 'CONTRIBUTING.md')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: