[SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and collaborative filtering in general

This documents the implementation of ALS in `spark.ml` with example code in scala, java and python. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10411 from BenFradet/SPARK-12247.
author: BenFradet <benjamin.fradet@gmail.com> 2016-02-16 13:03:28 +0000
committer: Sean Owen <sowen@cloudera.com> 2016-02-16 13:03:28 +0000
commit: 00c72d27bf2e3591c4068fb344fa3edf1662ad81 (patch)
tree: b32ed039fd5f4e3775622a9918173df53b943e30 /docs
parent: 827ed1c06785692d14857bd41f1fd94a0853874a (diff)
download: spark-00c72d27bf2e3591c4068fb344fa3edf1662ad81.tar.gz
spark-00c72d27bf2e3591c4068fb344fa3edf1662ad81.tar.bz2
spark-00c72d27bf2e3591c4068fb344fa3edf1662ad81.zip
4 files changed, 166 insertions, 15 deletions
diff --git a/docs/_data/menu-ml.yaml b/docs/_data/menu-ml.yaml
index 2eea9a917a..3fd3ee2823 100644
--- a/docs/_data/menu-ml.yaml
+++ b/docs/_data/menu-ml.yaml
@@ -6,5 +6,7 @@
   url: ml-classification-regression.html
 - text: Clustering
   url: ml-clustering.html
+- text: Collaborative filtering
+  url: ml-collaborative-filtering.html
 - text: Advanced topics
   url: ml-advanced.html
diff --git a/docs/ml-collaborative-filtering.md b/docs/ml-collaborative-filtering.md
new file mode 100644
index 0000000000..4514a358e1
--- /dev/null
+++ b/docs/ml-collaborative-filtering.md
@@ -0,0 +1,148 @@
+---
+layout: global
+title: Collaborative Filtering - spark.ml
+displayTitle: Collaborative Filtering - spark.ml
+---
+
+* Table of contents
+{:toc}
+
+## Collaborative filtering 
+
+[Collaborative filtering](http://en.wikipedia.org/wiki/Recommender_system#Collaborative_filtering)
+is commonly used for recommender systems.  These techniques aim to fill in the
+missing entries of a user-item association matrix.  `spark.ml` currently supports
+model-based collaborative filtering, in which users and products are described
+by a small set of latent factors that can be used to predict missing entries.
+`spark.ml` uses the [alternating least squares
+(ALS)](http://dl.acm.org/citation.cfm?id=1608614)
+algorithm to learn these latent factors. The implementation in `spark.ml` has the
+following parameters:
+
+* *numBlocks* is the number of blocks the users and items will be partitioned into in order to parallelize computation (defaults to 10).
+* *rank* is the number of latent factors in the model (defaults to 10).
+* *maxIter* is the maximum number of iterations to run (defaults to 10).
+* *regParam* specifies the regularization parameter in ALS (defaults to 1.0).
+* *implicitPrefs* specifies whether to use the *explicit feedback* ALS variant or one adapted for
+  *implicit feedback* data (defaults to `false` which means using *explicit feedback*).
+* *alpha* is a parameter applicable to the implicit feedback variant of ALS that governs the
+  *baseline* confidence in preference observations (defaults to 1.0).
+* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
+
+### Explicit vs. implicit feedback
+
+The standard approach to matrix factorization based collaborative filtering treats 
+the entries in the user-item matrix as *explicit* preferences given by the user to the item,
+for example, users giving ratings to movies.
+
+It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
+clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
+from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
+Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
+as numbers representing the *strength* in observations of user actions (such as the number of clicks,
+or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
+confidence in observed user preferences, rather than explicit ratings given to items. The model
+then tries to find latent factors that can be used to predict the expected preference of a user for
+an item.
+
+### Scaling of the regularization parameter
+
+We scale the regularization parameter `regParam` in solving each least squares problem by
+the number of ratings the user generated in updating user factors,
+or the number of ratings the product received in updating product factors.
+This approach is named "ALS-WR" and discussed in the paper
+"[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
+It makes `regParam` less dependent on the scale of the dataset, so we can apply the
+best parameter learned from a sampled subset to the full dataset and expect similar performance.
+
+## Examples
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+In the following example, we load rating data from the
+[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
+consisting of a user, a movie, a rating and a timestamp.
+We then train an ALS model which assumes, by default, that the ratings are
+explicit (`implicitPrefs` is `false`).
+We evaluate the recommendation model by measuring the root-mean-square error of
+rating prediction.
+
+Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.ml.recommendation.ALS)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/ALSExample.scala %}
+
+If the rating matrix is derived from another source of information (i.e. it is
+inferred from other signals), you can set `implicitPrefs` to `true` to get
+better results:
+
+{% highlight scala %}
+val als = new ALS()
+  .setMaxIter(5)
+  .setRegParam(0.01)
+  .setImplicitPrefs(true)
+  .setUserCol("userId")
+  .setItemCol("movieId")
+  .setRatingCol("rating")
+{% endhighlight %}
+
+</div>
+
+<div data-lang="java" markdown="1">
+
+In the following example, we load rating data from the
+[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
+consisting of a user, a movie, a rating and a timestamp.
+We then train an ALS model which assumes, by default, that the ratings are
+explicit (`implicitPrefs` is `false`).
+We evaluate the recommendation model by measuring the root-mean-square error of
+rating prediction.
+
+Refer to the [`ALS` Java docs](api/java/org/apache/spark/ml/recommendation/ALS.html)
+for more details on the API.
+
+{% include_example java/org/apache/spark/examples/ml/JavaALSExample.java %}
+
+If the rating matrix is derived from another source of information (i.e. it is
+inferred from other signals), you can set `implicitPrefs` to `true` to get
+better results:
+
+{% highlight java %}
+ALS als = new ALS()
+  .setMaxIter(5)
+  .setRegParam(0.01)
+  .setImplicitPrefs(true)
+  .setUserCol("userId")
+  .setItemCol("movieId")
+  .setRatingCol("rating");
+{% endhighlight %}
+
+</div>
+
+<div data-lang="python" markdown="1">
+
+In the following example, we load rating data from the
+[MovieLens dataset](http://grouplens.org/datasets/movielens/), each row
+consisting of a user, a movie, a rating and a timestamp.
+We then train an ALS model which assumes, by default, that the ratings are
+explicit (`implicitPrefs` is `False`).
+We evaluate the recommendation model by measuring the root-mean-square error of
+rating prediction.
+
+Refer to the [`ALS` Python docs](api/python/pyspark.ml.html#pyspark.ml.recommendation.ALS)
+for more details on the API.
+
+{% include_example python/ml/als_example.py %}
+
+If the rating matrix is derived from another source of information (i.e. it is
+inferred from other signals), you can set `implicitPrefs` to `True` to get
+better results:
+
+{% highlight python %}
+als = ALS(maxIter=5, regParam=0.01, implicitPrefs=True,
+          userCol="userId", itemCol="movieId", ratingCol="rating")
+{% endhighlight %}
+
+</div>
+</div>
diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index 1ebb4654ae..b8f0566d87 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -31,17 +31,18 @@ following parameters:
 ### Explicit vs. implicit feedback
 
 The standard approach to matrix factorization based collaborative filtering treats 
-the entries in the user-item matrix as *explicit* preferences given by the user to the item.
+the entries in the user-item matrix as *explicit* preferences given by the user to the item,
+for example, users giving ratings to movies.
 
 It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
 clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
-from
-[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
-Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
-as a combination of binary preferences and *confidence values*. The ratings are then related to the
-level of confidence in observed user preferences, rather than explicit ratings given to items.  The
-model then tries to find latent factors that can be used to predict the expected preference of a
-user for an item.
+from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
+Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
+as numbers representing the *strength* in observations of user actions (such as the number of clicks,
+or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
+confidence in observed user preferences, rather than explicit ratings given to items. The model
+then tries to find latent factors that can be used to predict the expected preference of a user for
+an item.
 
 ### Scaling of the regularization parameter
 
@@ -50,9 +51,8 @@ the number of ratings the user generated in updating user factors,
 or the number of ratings the product received in updating product factors.
 This approach is named "ALS-WR" and discussed in the paper
 "[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
-It makes `lambda` less dependent on the scale of the dataset.
-So we can apply the best parameter learned from a sampled subset to the full dataset
-and expect similar performance.
+It makes `lambda` less dependent on the scale of the dataset, so we can apply the
+best parameter learned from a sampled subset to the full dataset and expect similar performance.
 
 ## Examples
 
@@ -64,11 +64,11 @@ We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.rec
 method which assumes ratings are explicit. We evaluate the
 recommendation model by measuring the Mean Squared Error of rating prediction.
 
-Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for details on the API.
+Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for more details on the API.
 
 {% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}
 
-If the rating matrix is derived from another source of information (e.g., it is inferred from
+If the rating matrix is derived from another source of information (i.e. it is inferred from
 other signals), you can use the `trainImplicit` method to get better results.
 
 {% highlight scala %}
@@ -85,7 +85,7 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a
 calling `.rdd()` on your `JavaRDD` object. A self-contained application example
 that is equivalent to the provided example in Scala is given below:
 
-Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for details on the API.
+Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for more details on the API.
 
 {% include_example java/org/apache/spark/examples/mllib/JavaRecommendationExample.java %}
 </div>
@@ -99,7 +99,7 @@ Refer to the [`ALS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.rec
 
 {% include_example python/mllib/recommendation_example.py %}
 
-If the rating matrix is derived from other source of information (i.e., it is inferred from other
+If the rating matrix is derived from other source of information (i.e. it is inferred from other
 signals), you can use the trainImplicit method to get better results.
 
 {% highlight python %}
diff --git a/docs/mllib-guide.md b/docs/mllib-guide.md
index 7ef91a178c..fa5e906035 100644
--- a/docs/mllib-guide.md
+++ b/docs/mllib-guide.md
@@ -71,6 +71,7 @@ We list major functionality from both below, with links to detailed guides.
 * [Extracting, transforming and selecting features](ml-features.html)
 * [Classification and regression](ml-classification-regression.html)
 * [Clustering](ml-clustering.html)
+* [Collaborative filtering](ml-collaborative-filtering.html)
 * [Advanced topics](ml-advanced.html)
 
 Some techniques are not available yet in spark.ml, most notably dimensionality reduction
author	BenFradet <benjamin.fradet@gmail.com>	2016-02-16 13:03:28 +0000
committer	Sean Owen <sowen@cloudera.com>	2016-02-16 13:03:28 +0000
commit	00c72d27bf2e3591c4068fb344fa3edf1662ad81 (patch)
tree	b32ed039fd5f4e3775622a9918173df53b943e30 /docs
parent	827ed1c06785692d14857bd41f1fd94a0853874a (diff)
download	spark-00c72d27bf2e3591c4068fb344fa3edf1662ad81.tar.gz spark-00c72d27bf2e3591c4068fb344fa3edf1662ad81.tar.bz2 spark-00c72d27bf2e3591c4068fb344fa3edf1662ad81.zip