diff options
7 files changed, 164 insertions, 101 deletions
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md index b8f0566d87..5c33292aaf 100644 --- a/docs/mllib-collaborative-filtering.md +++ b/docs/mllib-collaborative-filtering.md @@ -21,7 +21,8 @@ following parameters: * *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure). * *rank* is the number of latent factors in the model. -* *iterations* is the number of iterations to run. +* *iterations* is the number of iterations of ALS to run. ALS typically converges to a reasonable + solution in 20 iterations or less. * *lambda* specifies the regularization parameter in ALS. * *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for *implicit feedback* data. diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala index 1250bc1a07..85d609386f 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala @@ -152,7 +152,7 @@ object FPGrowthModel extends Loader[FPGrowthModel[_]] { * [[http://dx.doi.org/10.1145/335191.335372 Han et al., Mining frequent patterns without candidate * generation]]. * - * @param minSupport the minimal support level of the frequent pattern, any pattern appears + * @param minSupport the minimal support level of the frequent pattern, any pattern that appears * more than (minSupport * size-of-the-dataset) times will be output * @param numPartitions number of partitions used by parallel FP-growth * diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala index ed49c9492f..94a24b527b 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala @@ -38,9 +38,9 @@ import org.apache.spark.storage.StorageLevel * The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns * Efficiently by Prefix-Projected Pattern Growth ([[http://doi.org/10.1109/ICDE.2001.914830]]). * - * @param minSupport the minimal support level of the sequential pattern, any pattern appears - * more than (minSupport * size-of-the-dataset) times will be output - * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears + * @param minSupport the minimal support level of the sequential pattern, any pattern that appears + * more than (minSupport * size-of-the-dataset) times will be output + * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears * less than maxPatternLength will be output * @param maxLocalProjDBSize The maximum number of items (including delimiters used in the internal * storage format) allowed in a projected database before local diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala index 33aaf853e5..3e619c4264 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala @@ -218,7 +218,7 @@ class ALS private ( } /** - * Run ALS with the configured parameters on an input RDD of (user, product, rating) triples. + * Run ALS with the configured parameters on an input RDD of [[Rating]] objects. * Returns a MatrixFactorizationModel with feature vectors for each user and product. */ @Since("0.8.0") @@ -279,18 +279,17 @@ class ALS private ( @Since("0.8.0") object ALS { /** - * Train a matrix factorization model given an RDD of ratings given by users to some products, - * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - * product of two lower-rank matrices of a given rank (number of features). To solve for these - * features, we run a given number of iterations of ALS. This is done using a level of - * parallelism given by `blocks`. + * Train a matrix factorization model given an RDD of ratings by users for a subset of products. + * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank + * (number of features). To solve for these features, ALS is run iteratively with a configurable + * level of parallelism. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param blocks level of parallelism to split computation into - * @param seed random seed + * @param seed random seed for initial matrix factorization model */ @Since("0.9.1") def train( @@ -305,16 +304,15 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of ratings given by users to some products, - * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - * product of two lower-rank matrices of a given rank (number of features). To solve for these - * features, we run a given number of iterations of ALS. This is done using a level of - * parallelism given by `blocks`. + * Train a matrix factorization model given an RDD of ratings by users for a subset of products. + * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank + * (number of features). To solve for these features, ALS is run iteratively with a configurable + * level of parallelism. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param blocks level of parallelism to split computation into */ @Since("0.8.0") @@ -329,16 +327,15 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of ratings given by users to some products, - * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - * product of two lower-rank matrices of a given rank (number of features). To solve for these - * features, we run a given number of iterations of ALS. The level of parallelism is determined - * automatically based on the number of partitions in `ratings`. + * Train a matrix factorization model given an RDD of ratings by users for a subset of products. + * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank + * (number of features). To solve for these features, ALS is run iteratively with a level of + * parallelism automatically based on the number of partitions in `ratings`. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter */ @Since("0.8.0") def train(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double) @@ -347,15 +344,14 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of ratings given by users to some products, - * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - * product of two lower-rank matrices of a given rank (number of features). To solve for these - * features, we run a given number of iterations of ALS. The level of parallelism is determined - * automatically based on the number of partitions in `ratings`. + * Train a matrix factorization model given an RDD of ratings by users for a subset of products. + * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank + * (number of features). To solve for these features, ALS is run iteratively with a level of + * parallelism automatically based on the number of partitions in `ratings`. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) + * @param iterations number of iterations of ALS */ @Since("0.8.0") def train(ratings: RDD[Rating], rank: Int, iterations: Int) @@ -372,11 +368,11 @@ object ALS { * * @param ratings RDD of (userID, productID, rating) pairs * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param blocks level of parallelism to split computation into * @param alpha confidence parameter - * @param seed random seed + * @param seed random seed for initial matrix factorization model */ @Since("0.8.1") def trainImplicit( @@ -392,16 +388,15 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of 'implicit preferences' given by users - * to some products, in the form of (userID, productID, preference) pairs. We approximate the - * ratings matrix as the product of two lower-rank matrices of a given rank (number of features). - * To solve for these features, we run a given number of iterations of ALS. This is done using - * a level of parallelism given by `blocks`. + * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a + * subset of products. The ratings matrix is approximated as the product of two lower-rank + * matrices of a given rank (number of features). To solve for these features, ALS is run + * iteratively with a configurable level of parallelism. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param blocks level of parallelism to split computation into * @param alpha confidence parameter */ @@ -418,16 +413,16 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of 'implicit preferences' given by users to - * some products, in the form of (userID, productID, preference) pairs. We approximate the - * ratings matrix as the product of two lower-rank matrices of a given rank (number of features). - * To solve for these features, we run a given number of iterations of ALS. The level of - * parallelism is determined automatically based on the number of partitions in `ratings`. + * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a + * subset of products. The ratings matrix is approximated as the product of two lower-rank + * matrices of a given rank (number of features). To solve for these features, ALS is run + * iteratively with a level of parallelism determined automatically based on the number of + * partitions in `ratings`. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) - * @param lambda regularization factor (recommended: 0.01) + * @param iterations number of iterations of ALS + * @param lambda regularization parameter * @param alpha confidence parameter */ @Since("0.8.1") @@ -437,16 +432,15 @@ object ALS { } /** - * Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by - * users to some products, in the form of (userID, productID, rating) pairs. We approximate the - * ratings matrix as the product of two lower-rank matrices of a given rank (number of features). - * To solve for these features, we run a given number of iterations of ALS. The level of - * parallelism is determined automatically based on the number of partitions in `ratings`. - * Model parameters `alpha` and `lambda` are set to reasonable default values + * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a + * subset of products. The ratings matrix is approximated as the product of two lower-rank + * matrices of a given rank (number of features). To solve for these features, ALS is run + * iteratively with a level of parallelism determined automatically based on the number of + * partitions in `ratings`. * - * @param ratings RDD of (userID, productID, rating) pairs + * @param ratings RDD of [[Rating]] objects with userID, productID, and rating * @param rank number of features to use - * @param iterations number of iterations of ALS (recommended: 10-20) + * @param iterations number of iterations of ALS */ @Since("0.8.1") def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int) diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala index 0dc40483dd..628cf1dd57 100644 --- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala +++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala @@ -206,7 +206,7 @@ class MatrixFactorizationModel @Since("0.8.0") ( } /** - * Recommends topK products for all users. + * Recommends top products for all users. * * @param num how many products to return for every user. * @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of @@ -224,7 +224,7 @@ class MatrixFactorizationModel @Since("0.8.0") ( /** - * Recommends topK users for all products. + * Recommends top users for all products. * * @param num how many users to return for every product. * @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array diff --git a/python/pyspark/mllib/fpm.py b/python/pyspark/mllib/fpm.py index 2039decc0c..7a2d77a4da 100644 --- a/python/pyspark/mllib/fpm.py +++ b/python/pyspark/mllib/fpm.py @@ -29,7 +29,6 @@ __all__ = ['FPGrowth', 'FPGrowthModel', 'PrefixSpan', 'PrefixSpanModel'] @inherit_doc @ignore_unicode_prefix class FPGrowthModel(JavaModelWrapper): - """ .. note:: Experimental @@ -68,11 +67,15 @@ class FPGrowth(object): """ Computes an FP-Growth model that contains frequent itemsets. - :param data: The input data set, each element contains a - transaction. - :param minSupport: The minimal support level (default: `0.3`). - :param numPartitions: The number of partitions used by - parallel FP-growth (default: same as input data). + :param data: + The input data set, each element contains a transaction. + :param minSupport: + The minimal support level. + (default: 0.3) + :param numPartitions: + The number of partitions used by parallel FP-growth. A value + of -1 will use the same number as input data. + (default: -1) """ model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), int(numPartitions)) return FPGrowthModel(model) @@ -128,17 +131,27 @@ class PrefixSpan(object): @since("1.6.0") def train(cls, data, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000): """ - Finds the complete set of frequent sequential patterns in the input sequences of itemsets. - - :param data: The input data set, each element contains a sequnce of itemsets. - :param minSupport: the minimal support level of the sequential pattern, any pattern appears - more than (minSupport * size-of-the-dataset) times will be output (default: `0.1`) - :param maxPatternLength: the maximal length of the sequential pattern, any pattern appears - less than maxPatternLength will be output. (default: `10`) - :param maxLocalProjDBSize: The maximum number of items (including delimiters used in - the internal storage format) allowed in a projected database before local - processing. If a projected database exceeds this size, another - iteration of distributed prefix growth is run. (default: `32000000`) + Finds the complete set of frequent sequential patterns in the + input sequences of itemsets. + + :param data: + The input data set, each element contains a sequence of + itemsets. + :param minSupport: + The minimal support level of the sequential pattern, any + pattern that appears more than (minSupport * + size-of-the-dataset) times will be output. + (default: 0.1) + :param maxPatternLength: + The maximal length of the sequential pattern, any pattern + that appears less than maxPatternLength will be output. + (default: 10) + :param maxLocalProjDBSize: + The maximum number of items (including delimiters used in the + internal storage format) allowed in a projected database before + local processing. If a projected database exceeds this size, + another iteration of distributed prefix growth is run. + (default: 32000000) """ model = callMLlibFunc("trainPrefixSpanModel", data, minSupport, maxPatternLength, maxLocalProjDBSize) diff --git a/python/pyspark/mllib/recommendation.py b/python/pyspark/mllib/recommendation.py index 93e47a797f..7e60255d43 100644 --- a/python/pyspark/mllib/recommendation.py +++ b/python/pyspark/mllib/recommendation.py @@ -138,7 +138,8 @@ class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): @since("0.9.0") def predictAll(self, user_product): """ - Returns a list of predicted ratings for input user and product pairs. + Returns a list of predicted ratings for input user and product + pairs. """ assert isinstance(user_product, RDD), "user_product should be RDD of (user, product)" first = user_product.first() @@ -165,28 +166,33 @@ class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader): @since("1.4.0") def recommendUsers(self, product, num): """ - Recommends the top "num" number of users for a given product and returns a list - of Rating objects sorted by the predicted rating in descending order. + Recommends the top "num" number of users for a given product and + returns a list of Rating objects sorted by the predicted rating in + descending order. """ return list(self.call("recommendUsers", product, num)) @since("1.4.0") def recommendProducts(self, user, num): """ - Recommends the top "num" number of products for a given user and returns a list - of Rating objects sorted by the predicted rating in descending order. + Recommends the top "num" number of products for a given user and + returns a list of Rating objects sorted by the predicted rating in + descending order. """ return list(self.call("recommendProducts", user, num)) def recommendProductsForUsers(self, num): """ - Recommends top "num" products for all users. The number returned may be less than this. + Recommends the top "num" number of products for all users. The + number of recommendations returned per user may be less than "num". """ return self.call("wrappedRecommendProductsForUsers", num) def recommendUsersForProducts(self, num): """ - Recommends top "num" users for all products. The number returned may be less than this. + Recommends the top "num" number of users for all products. The + number of recommendations returned per product may be less than + "num". """ return self.call("wrappedRecommendUsersForProducts", num) @@ -234,11 +240,34 @@ class ALS(object): def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False, seed=None): """ - Train a matrix factorization model given an RDD of ratings given by users to some products, - in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the - product of two lower-rank matrices of a given rank (number of features). To solve for these - features, we run a given number of iterations of ALS. This is done using a level of - parallelism given by `blocks`. + Train a matrix factorization model given an RDD of ratings by users + for a subset of products. The ratings matrix is approximated as the + product of two lower-rank matrices of a given rank (number of + features). To solve for these features, ALS is run iteratively with + a configurable level of parallelism. + + :param ratings: + RDD of `Rating` or (userID, productID, rating) tuple. + :param rank: + Rank of the feature matrices computed (number of features). + :param iterations: + Number of iterations of ALS. + (default: 5) + :param lambda_: + Regularization parameter. + (default: 0.01) + :param blocks: + Number of blocks used to parallelize the computation. A value + of -1 will use an auto-configured number of blocks. + (default: -1) + :param nonnegative: + A value of True will solve least-squares with nonnegativity + constraints. + (default: False) + :param seed: + Random seed for initial matrix factorization model. A value + of None will use system time as the seed. + (default: None) """ model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations, lambda_, blocks, nonnegative, seed) @@ -249,11 +278,37 @@ class ALS(object): def trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, alpha=0.01, nonnegative=False, seed=None): """ - Train a matrix factorization model given an RDD of 'implicit preferences' given by users - to some products, in the form of (userID, productID, preference) pairs. We approximate the - ratings matrix as the product of two lower-rank matrices of a given rank (number of - features). To solve for these features, we run a given number of iterations of ALS. - This is done using a level of parallelism given by `blocks`. + Train a matrix factorization model given an RDD of 'implicit + preferences' of users for a subset of products. The ratings matrix + is approximated as the product of two lower-rank matrices of a + given rank (number of features). To solve for these features, ALS + is run iteratively with a configurable level of parallelism. + + :param ratings: + RDD of `Rating` or (userID, productID, rating) tuple. + :param rank: + Rank of the feature matrices computed (number of features). + :param iterations: + Number of iterations of ALS. + (default: 5) + :param lambda_: + Regularization parameter. + (default: 0.01) + :param blocks: + Number of blocks used to parallelize the computation. A value + of -1 will use an auto-configured number of blocks. + (default: -1) + :param alpha: + A constant used in computing confidence. + (default: 0.01) + :param nonnegative: + A value of True will solve least-squares with nonnegativity + constraints. + (default: False) + :param seed: + Random seed for initial matrix factorization model. A value + of None will use system time as the seed. + (default: None) """ model = callMLlibFunc("trainImplicitALSModel", cls._prepare(ratings), rank, iterations, lambda_, blocks, alpha, nonnegative, seed) |