diff options
author | Burak Yavuz <brkyvz@gmail.com> | 2015-10-16 15:30:07 -0700 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2015-10-16 15:30:07 -0700 |
commit | 10046ea76cf8f0d08fe7ef548e4dbec69d9c73b8 (patch) | |
tree | 55a35672cffbfab9e941f010641e34689b616bf6 /python/pyspark/mllib/linalg/__init__.py | |
parent | 1ec0a0dc2819d3db3555799cb78c2946f652bff4 (diff) | |
download | spark-10046ea76cf8f0d08fe7ef548e4dbec69d9c73b8.tar.gz spark-10046ea76cf8f0d08fe7ef548e4dbec69d9c73b8.tar.bz2 spark-10046ea76cf8f0d08fe7ef548e4dbec69d9c73b8.zip |
[SPARK-10599] [MLLIB] Lower communication for block matrix multiplication
This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:
- Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
- Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition
**NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.
Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle.
Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s
cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #8757 from brkyvz/opt-bmm.
Diffstat (limited to 'python/pyspark/mllib/linalg/__init__.py')
0 files changed, 0 insertions, 0 deletions