Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec
jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization. Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help, 1. decrease the worker memory consumption by 45%. 2. decrease running time by 40%. This will also help extend the upper limit for Word2Vec. Author: Yuhao Yang <[email protected]> Closes apache#9878 from hhbyyh/w2vBC.
- Loading branch information