[SPARK-3541][MLLIB] New ALS implementation with improved storage - spark

diff options

author	Xiangrui Meng <meng@databricks.com>	2015-01-22 22:09:13 -0800
committer	Xiangrui Meng <meng@databricks.com>	2015-01-22 22:09:13 -0800
commit	ea74365b7c5a3ac29cae9ba66f140f1fa5e8d312 (patch)
tree	88fbd65face2e0ea7545d60e614083e6a93ec5ad /python/pyspark/rdd.py
parent	e0f7fb7f9f497b34d42f9ba147197cf9ffc51607 (diff)
download	spark-ea74365b7c5a3ac29cae9ba66f140f1fa5e8d312.tar.gz spark-ea74365b7c5a3ac29cae9ba66f140f1fa5e8d312.tar.bz2 spark-ea74365b7c5a3ac29cae9ba66f140f1fa5e8d312.zip

[SPARK-3541][MLLIB] New ALS implementation with improved storage

This PR adds a new ALS implementation to `spark.ml` using the pipeline API, which should be able to scale to billions of ratings. Compared with the ALS under `spark.mllib`, the new implementation 1. uses the same algorithm, 2. uses float type for ratings, 3. uses primitive arrays to avoid GC, 4. sorts and compresses ratings on each block so that we can solve least squares subproblems one by one using only one normal equation instance. The following figure shows performance comparison on copies of the Amazon Reviews dataset using a 16-node (m3.2xlarge) EC2 cluster (the same setup as in http://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html): ![als-wip](https://cloud.githubusercontent.com/assets/829644/5659447/4c4ff8e0-96c7-11e4-87a9-73c1c63d07f3.png) I keep the `spark.mllib`'s ALS untouched for easy comparison. If the new implementation works well, I'm going to match the features of the ALS under `spark.mllib` and then make it a wrapper of the new implementation, in a separate PR. TODO: - [X] Add unit tests for implicit preferences. Author: Xiangrui Meng <meng@databricks.com> Closes #3720 from mengxr/SPARK-3541 and squashes the following commits: 1b9e852 [Xiangrui Meng] fix compile 5129be9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3541 dd0d0e8 [Xiangrui Meng] simplify test code c627de3 [Xiangrui Meng] add tests for implicit feedback b84f41c [Xiangrui Meng] address comments a76da7b [Xiangrui Meng] update ALS tests 2a8deb3 [Xiangrui Meng] add some ALS tests 857e876 [Xiangrui Meng] add tests for rating block and encoded block d3c1ac4 [Xiangrui Meng] rename some classes for better code readability add more doc and comments 213d163 [Xiangrui Meng] org imports 771baf3 [Xiangrui Meng] chol doc update ca9ad9d [Xiangrui Meng] add unit tests for chol b4fd17c [Xiangrui Meng] add unit tests for NormalEquation d0f99d3 [Xiangrui Meng] add tests for LocalIndexEncoder 80b8e61 [Xiangrui Meng] fix imports 4937fd4 [Xiangrui Meng] update ALS example 56c253c [Xiangrui Meng] rename product to item bce8692 [Xiangrui Meng] doc for parameters and project the output columns 3f2d81a [Xiangrui Meng] add doc 1efaecf [Xiangrui Meng] add example code 8ae86b5 [Xiangrui Meng] add a working copy of the new ALS implementation

Diffstat (limited to 'python/pyspark/rdd.py')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: