aboutsummaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--docs/mllib-collaborative-filtering.md3
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala2
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala6
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala114
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala4
-rw-r--r--python/pyspark/mllib/fpm.py47
-rw-r--r--python/pyspark/mllib/recommendation.py89
7 files changed, 164 insertions, 101 deletions
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index b8f0566d87..5c33292aaf 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -21,7 +21,8 @@ following parameters:
* *numBlocks* is the number of blocks used to parallelize computation (set to -1 to auto-configure).
* *rank* is the number of latent factors in the model.
-* *iterations* is the number of iterations to run.
+* *iterations* is the number of iterations of ALS to run. ALS typically converges to a reasonable
+ solution in 20 iterations or less.
* *lambda* specifies the regularization parameter in ALS.
* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
*implicit feedback* data.
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
index 1250bc1a07..85d609386f 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/FPGrowth.scala
@@ -152,7 +152,7 @@ object FPGrowthModel extends Loader[FPGrowthModel[_]] {
* [[http://dx.doi.org/10.1145/335191.335372 Han et al., Mining frequent patterns without candidate
* generation]].
*
- * @param minSupport the minimal support level of the frequent pattern, any pattern appears
+ * @param minSupport the minimal support level of the frequent pattern, any pattern that appears
* more than (minSupport * size-of-the-dataset) times will be output
* @param numPartitions number of partitions used by parallel FP-growth
*
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
index ed49c9492f..94a24b527b 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/fpm/PrefixSpan.scala
@@ -38,9 +38,9 @@ import org.apache.spark.storage.StorageLevel
* The PrefixSpan algorithm is described in J. Pei, et al., PrefixSpan: Mining Sequential Patterns
* Efficiently by Prefix-Projected Pattern Growth ([[http://doi.org/10.1109/ICDE.2001.914830]]).
*
- * @param minSupport the minimal support level of the sequential pattern, any pattern appears
- * more than (minSupport * size-of-the-dataset) times will be output
- * @param maxPatternLength the maximal length of the sequential pattern, any pattern appears
+ * @param minSupport the minimal support level of the sequential pattern, any pattern that appears
+ * more than (minSupport * size-of-the-dataset) times will be output
+ * @param maxPatternLength the maximal length of the sequential pattern, any pattern that appears
* less than maxPatternLength will be output
* @param maxLocalProjDBSize The maximum number of items (including delimiters used in the internal
* storage format) allowed in a projected database before local
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
index 33aaf853e5..3e619c4264 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/ALS.scala
@@ -218,7 +218,7 @@ class ALS private (
}
/**
- * Run ALS with the configured parameters on an input RDD of (user, product, rating) triples.
+ * Run ALS with the configured parameters on an input RDD of [[Rating]] objects.
* Returns a MatrixFactorizationModel with feature vectors for each user and product.
*/
@Since("0.8.0")
@@ -279,18 +279,17 @@ class ALS private (
@Since("0.8.0")
object ALS {
/**
- * Train a matrix factorization model given an RDD of ratings given by users to some products,
- * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the
- * product of two lower-rank matrices of a given rank (number of features). To solve for these
- * features, we run a given number of iterations of ALS. This is done using a level of
- * parallelism given by `blocks`.
+ * Train a matrix factorization model given an RDD of ratings by users for a subset of products.
+ * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank
+ * (number of features). To solve for these features, ALS is run iteratively with a configurable
+ * level of parallelism.
*
- * @param ratings RDD of (userID, productID, rating) pairs
+ * @param ratings RDD of [[Rating]] objects with userID, productID, and rating
* @param rank number of features to use
- * @param iterations number of iterations of ALS (recommended: 10-20)
- * @param lambda regularization factor (recommended: 0.01)
+ * @param iterations number of iterations of ALS
+ * @param lambda regularization parameter
* @param blocks level of parallelism to split computation into
- * @param seed random seed
+ * @param seed random seed for initial matrix factorization model
*/
@Since("0.9.1")
def train(
@@ -305,16 +304,15 @@ object ALS {
}
/**
- * Train a matrix factorization model given an RDD of ratings given by users to some products,
- * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the
- * product of two lower-rank matrices of a given rank (number of features). To solve for these
- * features, we run a given number of iterations of ALS. This is done using a level of
- * parallelism given by `blocks`.
+ * Train a matrix factorization model given an RDD of ratings by users for a subset of products.
+ * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank
+ * (number of features). To solve for these features, ALS is run iteratively with a configurable
+ * level of parallelism.
*
- * @param ratings RDD of (userID, productID, rating) pairs
+ * @param ratings RDD of [[Rating]] objects with userID, productID, and rating
* @param rank number of features to use
- * @param iterations number of iterations of ALS (recommended: 10-20)
- * @param lambda regularization factor (recommended: 0.01)
+ * @param iterations number of iterations of ALS
+ * @param lambda regularization parameter
* @param blocks level of parallelism to split computation into
*/
@Since("0.8.0")
@@ -329,16 +327,15 @@ object ALS {
}
/**
- * Train a matrix factorization model given an RDD of ratings given by users to some products,
- * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the
- * product of two lower-rank matrices of a given rank (number of features). To solve for these
- * features, we run a given number of iterations of ALS. The level of parallelism is determined
- * automatically based on the number of partitions in `ratings`.
+ * Train a matrix factorization model given an RDD of ratings by users for a subset of products.
+ * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank
+ * (number of features). To solve for these features, ALS is run iteratively with a level of
+ * parallelism automatically based on the number of partitions in `ratings`.
*
- * @param ratings RDD of (userID, productID, rating) pairs
+ * @param ratings RDD of [[Rating]] objects with userID, productID, and rating
* @param rank number of features to use
- * @param iterations number of iterations of ALS (recommended: 10-20)
- * @param lambda regularization factor (recommended: 0.01)
+ * @param iterations number of iterations of ALS
+ * @param lambda regularization parameter
*/
@Since("0.8.0")
def train(ratings: RDD[Rating], rank: Int, iterations: Int, lambda: Double)
@@ -347,15 +344,14 @@ object ALS {
}
/**
- * Train a matrix factorization model given an RDD of ratings given by users to some products,
- * in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the
- * product of two lower-rank matrices of a given rank (number of features). To solve for these
- * features, we run a given number of iterations of ALS. The level of parallelism is determined
- * automatically based on the number of partitions in `ratings`.
+ * Train a matrix factorization model given an RDD of ratings by users for a subset of products.
+ * The ratings matrix is approximated as the product of two lower-rank matrices of a given rank
+ * (number of features). To solve for these features, ALS is run iteratively with a level of
+ * parallelism automatically based on the number of partitions in `ratings`.
*
- * @param ratings RDD of (userID, productID, rating) pairs
+ * @param ratings RDD of [[Rating]] objects with userID, productID, and rating
* @param rank number of features to use
- * @param iterations number of iterations of ALS (recommended: 10-20)
+ * @param iterations number of iterations of ALS
*/
@Since("0.8.0")
def train(ratings: RDD[Rating], rank: Int, iterations: Int)
@@ -372,11 +368,11 @@ object ALS {
*
* @param ratings RDD of (userID, productID, rating) pairs
* @param rank number of features to use
- * @param iterations number of iterations of ALS (recommended: 10-20)
- * @param lambda regularization factor (recommended: 0.01)
+ * @param iterations number of iterations of ALS
+ * @param lambda regularization parameter
* @param blocks level of parallelism to split computation into
* @param alpha confidence parameter
- * @param seed random seed
+ * @param seed random seed for initial matrix factorization model
*/
@Since("0.8.1")
def trainImplicit(
@@ -392,16 +388,15 @@ object ALS {
}
/**
- * Train a matrix factorization model given an RDD of 'implicit preferences' given by users
- * to some products, in the form of (userID, productID, preference) pairs. We approximate the
- * ratings matrix as the product of two lower-rank matrices of a given rank (number of features).
- * To solve for these features, we run a given number of iterations of ALS. This is done using
- * a level of parallelism given by `blocks`.
+ * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a
+ * subset of products. The ratings matrix is approximated as the product of two lower-rank
+ * matrices of a given rank (number of features). To solve for these features, ALS is run
+ * iteratively with a configurable level of parallelism.
*
- * @param ratings RDD of (userID, productID, rating) pairs
+ * @param ratings RDD of [[Rating]] objects with userID, productID, and rating
* @param rank number of features to use
- * @param iterations number of iterations of ALS (recommended: 10-20)
- * @param lambda regularization factor (recommended: 0.01)
+ * @param iterations number of iterations of ALS
+ * @param lambda regularization parameter
* @param blocks level of parallelism to split computation into
* @param alpha confidence parameter
*/
@@ -418,16 +413,16 @@ object ALS {
}
/**
- * Train a matrix factorization model given an RDD of 'implicit preferences' given by users to
- * some products, in the form of (userID, productID, preference) pairs. We approximate the
- * ratings matrix as the product of two lower-rank matrices of a given rank (number of features).
- * To solve for these features, we run a given number of iterations of ALS. The level of
- * parallelism is determined automatically based on the number of partitions in `ratings`.
+ * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a
+ * subset of products. The ratings matrix is approximated as the product of two lower-rank
+ * matrices of a given rank (number of features). To solve for these features, ALS is run
+ * iteratively with a level of parallelism determined automatically based on the number of
+ * partitions in `ratings`.
*
- * @param ratings RDD of (userID, productID, rating) pairs
+ * @param ratings RDD of [[Rating]] objects with userID, productID, and rating
* @param rank number of features to use
- * @param iterations number of iterations of ALS (recommended: 10-20)
- * @param lambda regularization factor (recommended: 0.01)
+ * @param iterations number of iterations of ALS
+ * @param lambda regularization parameter
* @param alpha confidence parameter
*/
@Since("0.8.1")
@@ -437,16 +432,15 @@ object ALS {
}
/**
- * Train a matrix factorization model given an RDD of 'implicit preferences' ratings given by
- * users to some products, in the form of (userID, productID, rating) pairs. We approximate the
- * ratings matrix as the product of two lower-rank matrices of a given rank (number of features).
- * To solve for these features, we run a given number of iterations of ALS. The level of
- * parallelism is determined automatically based on the number of partitions in `ratings`.
- * Model parameters `alpha` and `lambda` are set to reasonable default values
+ * Train a matrix factorization model given an RDD of 'implicit preferences' of users for a
+ * subset of products. The ratings matrix is approximated as the product of two lower-rank
+ * matrices of a given rank (number of features). To solve for these features, ALS is run
+ * iteratively with a level of parallelism determined automatically based on the number of
+ * partitions in `ratings`.
*
- * @param ratings RDD of (userID, productID, rating) pairs
+ * @param ratings RDD of [[Rating]] objects with userID, productID, and rating
* @param rank number of features to use
- * @param iterations number of iterations of ALS (recommended: 10-20)
+ * @param iterations number of iterations of ALS
*/
@Since("0.8.1")
def trainImplicit(ratings: RDD[Rating], rank: Int, iterations: Int)
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
index 0dc40483dd..628cf1dd57 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/recommendation/MatrixFactorizationModel.scala
@@ -206,7 +206,7 @@ class MatrixFactorizationModel @Since("0.8.0") (
}
/**
- * Recommends topK products for all users.
+ * Recommends top products for all users.
*
* @param num how many products to return for every user.
* @return [(Int, Array[Rating])] objects, where every tuple contains a userID and an array of
@@ -224,7 +224,7 @@ class MatrixFactorizationModel @Since("0.8.0") (
/**
- * Recommends topK users for all products.
+ * Recommends top users for all products.
*
* @param num how many users to return for every product.
* @return [(Int, Array[Rating])] objects, where every tuple contains a productID and an array
diff --git a/python/pyspark/mllib/fpm.py b/python/pyspark/mllib/fpm.py
index 2039decc0c..7a2d77a4da 100644
--- a/python/pyspark/mllib/fpm.py
+++ b/python/pyspark/mllib/fpm.py
@@ -29,7 +29,6 @@ __all__ = ['FPGrowth', 'FPGrowthModel', 'PrefixSpan', 'PrefixSpanModel']
@inherit_doc
@ignore_unicode_prefix
class FPGrowthModel(JavaModelWrapper):
-
"""
.. note:: Experimental
@@ -68,11 +67,15 @@ class FPGrowth(object):
"""
Computes an FP-Growth model that contains frequent itemsets.
- :param data: The input data set, each element contains a
- transaction.
- :param minSupport: The minimal support level (default: `0.3`).
- :param numPartitions: The number of partitions used by
- parallel FP-growth (default: same as input data).
+ :param data:
+ The input data set, each element contains a transaction.
+ :param minSupport:
+ The minimal support level.
+ (default: 0.3)
+ :param numPartitions:
+ The number of partitions used by parallel FP-growth. A value
+ of -1 will use the same number as input data.
+ (default: -1)
"""
model = callMLlibFunc("trainFPGrowthModel", data, float(minSupport), int(numPartitions))
return FPGrowthModel(model)
@@ -128,17 +131,27 @@ class PrefixSpan(object):
@since("1.6.0")
def train(cls, data, minSupport=0.1, maxPatternLength=10, maxLocalProjDBSize=32000000):
"""
- Finds the complete set of frequent sequential patterns in the input sequences of itemsets.
-
- :param data: The input data set, each element contains a sequnce of itemsets.
- :param minSupport: the minimal support level of the sequential pattern, any pattern appears
- more than (minSupport * size-of-the-dataset) times will be output (default: `0.1`)
- :param maxPatternLength: the maximal length of the sequential pattern, any pattern appears
- less than maxPatternLength will be output. (default: `10`)
- :param maxLocalProjDBSize: The maximum number of items (including delimiters used in
- the internal storage format) allowed in a projected database before local
- processing. If a projected database exceeds this size, another
- iteration of distributed prefix growth is run. (default: `32000000`)
+ Finds the complete set of frequent sequential patterns in the
+ input sequences of itemsets.
+
+ :param data:
+ The input data set, each element contains a sequence of
+ itemsets.
+ :param minSupport:
+ The minimal support level of the sequential pattern, any
+ pattern that appears more than (minSupport *
+ size-of-the-dataset) times will be output.
+ (default: 0.1)
+ :param maxPatternLength:
+ The maximal length of the sequential pattern, any pattern
+ that appears less than maxPatternLength will be output.
+ (default: 10)
+ :param maxLocalProjDBSize:
+ The maximum number of items (including delimiters used in the
+ internal storage format) allowed in a projected database before
+ local processing. If a projected database exceeds this size,
+ another iteration of distributed prefix growth is run.
+ (default: 32000000)
"""
model = callMLlibFunc("trainPrefixSpanModel",
data, minSupport, maxPatternLength, maxLocalProjDBSize)
diff --git a/python/pyspark/mllib/recommendation.py b/python/pyspark/mllib/recommendation.py
index 93e47a797f..7e60255d43 100644
--- a/python/pyspark/mllib/recommendation.py
+++ b/python/pyspark/mllib/recommendation.py
@@ -138,7 +138,8 @@ class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
@since("0.9.0")
def predictAll(self, user_product):
"""
- Returns a list of predicted ratings for input user and product pairs.
+ Returns a list of predicted ratings for input user and product
+ pairs.
"""
assert isinstance(user_product, RDD), "user_product should be RDD of (user, product)"
first = user_product.first()
@@ -165,28 +166,33 @@ class MatrixFactorizationModel(JavaModelWrapper, JavaSaveable, JavaLoader):
@since("1.4.0")
def recommendUsers(self, product, num):
"""
- Recommends the top "num" number of users for a given product and returns a list
- of Rating objects sorted by the predicted rating in descending order.
+ Recommends the top "num" number of users for a given product and
+ returns a list of Rating objects sorted by the predicted rating in
+ descending order.
"""
return list(self.call("recommendUsers", product, num))
@since("1.4.0")
def recommendProducts(self, user, num):
"""
- Recommends the top "num" number of products for a given user and returns a list
- of Rating objects sorted by the predicted rating in descending order.
+ Recommends the top "num" number of products for a given user and
+ returns a list of Rating objects sorted by the predicted rating in
+ descending order.
"""
return list(self.call("recommendProducts", user, num))
def recommendProductsForUsers(self, num):
"""
- Recommends top "num" products for all users. The number returned may be less than this.
+ Recommends the top "num" number of products for all users. The
+ number of recommendations returned per user may be less than "num".
"""
return self.call("wrappedRecommendProductsForUsers", num)
def recommendUsersForProducts(self, num):
"""
- Recommends top "num" users for all products. The number returned may be less than this.
+ Recommends the top "num" number of users for all products. The
+ number of recommendations returned per product may be less than
+ "num".
"""
return self.call("wrappedRecommendUsersForProducts", num)
@@ -234,11 +240,34 @@ class ALS(object):
def train(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, nonnegative=False,
seed=None):
"""
- Train a matrix factorization model given an RDD of ratings given by users to some products,
- in the form of (userID, productID, rating) pairs. We approximate the ratings matrix as the
- product of two lower-rank matrices of a given rank (number of features). To solve for these
- features, we run a given number of iterations of ALS. This is done using a level of
- parallelism given by `blocks`.
+ Train a matrix factorization model given an RDD of ratings by users
+ for a subset of products. The ratings matrix is approximated as the
+ product of two lower-rank matrices of a given rank (number of
+ features). To solve for these features, ALS is run iteratively with
+ a configurable level of parallelism.
+
+ :param ratings:
+ RDD of `Rating` or (userID, productID, rating) tuple.
+ :param rank:
+ Rank of the feature matrices computed (number of features).
+ :param iterations:
+ Number of iterations of ALS.
+ (default: 5)
+ :param lambda_:
+ Regularization parameter.
+ (default: 0.01)
+ :param blocks:
+ Number of blocks used to parallelize the computation. A value
+ of -1 will use an auto-configured number of blocks.
+ (default: -1)
+ :param nonnegative:
+ A value of True will solve least-squares with nonnegativity
+ constraints.
+ (default: False)
+ :param seed:
+ Random seed for initial matrix factorization model. A value
+ of None will use system time as the seed.
+ (default: None)
"""
model = callMLlibFunc("trainALSModel", cls._prepare(ratings), rank, iterations,
lambda_, blocks, nonnegative, seed)
@@ -249,11 +278,37 @@ class ALS(object):
def trainImplicit(cls, ratings, rank, iterations=5, lambda_=0.01, blocks=-1, alpha=0.01,
nonnegative=False, seed=None):
"""
- Train a matrix factorization model given an RDD of 'implicit preferences' given by users
- to some products, in the form of (userID, productID, preference) pairs. We approximate the
- ratings matrix as the product of two lower-rank matrices of a given rank (number of
- features). To solve for these features, we run a given number of iterations of ALS.
- This is done using a level of parallelism given by `blocks`.
+ Train a matrix factorization model given an RDD of 'implicit
+ preferences' of users for a subset of products. The ratings matrix
+ is approximated as the product of two lower-rank matrices of a
+ given rank (number of features). To solve for these features, ALS
+ is run iteratively with a configurable level of parallelism.
+
+ :param ratings:
+ RDD of `Rating` or (userID, productID, rating) tuple.
+ :param rank:
+ Rank of the feature matrices computed (number of features).
+ :param iterations:
+ Number of iterations of ALS.
+ (default: 5)
+ :param lambda_:
+ Regularization parameter.
+ (default: 0.01)
+ :param blocks:
+ Number of blocks used to parallelize the computation. A value
+ of -1 will use an auto-configured number of blocks.
+ (default: -1)
+ :param alpha:
+ A constant used in computing confidence.
+ (default: 0.01)
+ :param nonnegative:
+ A value of True will solve least-squares with nonnegativity
+ constraints.
+ (default: False)
+ :param seed:
+ Random seed for initial matrix factorization model. A value
+ of None will use system time as the seed.
+ (default: None)
"""
model = callMLlibFunc("trainImplicitALSModel", cls._prepare(ratings), rank,
iterations, lambda_, blocks, alpha, nonnegative, seed)