aboutsummaryrefslogtreecommitdiff
path: root/project/MimaExcludes.scala
diff options
context:
space:
mode:
authorLiang-Chi Hsieh <viirya@gmail.com>2015-07-02 10:18:23 -0700
committerXiangrui Meng <meng@databricks.com>2015-07-02 10:18:23 -0700
commit0e553a3e9360a736920e2214d634373fef0dbcf7 (patch)
tree317ac5efd679292c4aaadb91b06b03c3e122f229 /project/MimaExcludes.scala
parent52302a803967114b29a8bf6b74459477364c5b88 (diff)
downloadspark-0e553a3e9360a736920e2214d634373fef0dbcf7.tar.gz
spark-0e553a3e9360a736920e2214d634373fef0dbcf7.tar.bz2
spark-0e553a3e9360a736920e2214d634373fef0dbcf7.zip
[SPARK-8708] [MLLIB] Paritition ALS ratings based on both users and products
JIRA: https://issues.apache.org/jira/browse/SPARK-8708 Previously the partitions of ratings are only based on the given products. So if the `usersProducts` given for prediction contains only few products or even one product, the generated ratings will be pushed into few or single partition and can't use high parallelism. The following codes are the example reported in the JIRA. Because it asks the predictions for users on product 2. There is only one partition in the result. >>> r1 = (1, 1, 1.0) >>> r2 = (1, 2, 2.0) >>> r3 = (2, 1, 2.0) >>> r4 = (2, 2, 2.0) >>> r5 = (3, 1, 1.0) >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) >>> users = ratings.map(itemgetter(0)).distinct() >>> model = ALS.trainImplicit(ratings, 1, seed=10) >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) >>> predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] This PR uses user and product instead of only product to partition the ratings. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7121 from viirya/mfm_fix_partition and squashes the following commits: 779946d [Liang-Chi Hsieh] Calculate approximate numbers of users and products in one pass. 4336dc2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into mfm_fix_partition 83e56c1 [Liang-Chi Hsieh] Instead of additional join, use the numbers of users and products to decide how to perform join. b534dc8 [Liang-Chi Hsieh] Paritition ratings based on both users and products.
Diffstat (limited to 'project/MimaExcludes.scala')
0 files changed, 0 insertions, 0 deletions