aboutsummaryrefslogtreecommitdiff
path: root/examples/src/main/python/sql/hive.py
diff options
context:
space:
mode:
authorYanbo Liang <ybliang8@gmail.com>2017-01-09 21:38:46 -0800
committerYanbo Liang <ybliang8@gmail.com>2017-01-09 21:38:46 -0800
commit3ef6d98a803fdff182ab4556c3273ec5fa0ff002 (patch)
tree1d8dba974664353e4146b7d9ca801e0473d0d9f4 /examples/src/main/python/sql/hive.py
parentfaabe69cc081145f43f9c68db1a7a8c5c39684fb (diff)
downloadspark-3ef6d98a803fdff182ab4556c3273ec5fa0ff002.tar.gz
spark-3ef6d98a803fdff182ab4556c3273ec5fa0ff002.tar.bz2
spark-3ef6d98a803fdff182ab4556c3273ec5fa0ff002.zip
[SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
## What changes were proposed in this pull request? Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it. I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons: - mllib `GaussianMixture` allows k == 1, but ml does not. - mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future) We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean. Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution. Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png) ## How was this patch tested? Existing tests and added new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15413 from yanboliang/spark-17847.
Diffstat (limited to 'examples/src/main/python/sql/hive.py')
0 files changed, 0 insertions, 0 deletions