From 00c72d27bf2e3591c4068fb344fa3edf1662ad81 Mon Sep 17 00:00:00 2001
From: BenFradet <benjamin.fradet@gmail.com>
Date: Tue, 16 Feb 2016 13:03:28 +0000
Subject: [SPARK-12247][ML][DOC] Documentation for spark.ml's ALS and
 collaborative filtering in general

This documents the implementation of ALS in `spark.ml` with example code in scala, java and python.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10411 from BenFradet/SPARK-12247.
---
 docs/mllib-collaborative-filtering.md | 30 +++++++++++++++---------------
 1 file changed, 15 insertions(+), 15 deletions(-)

(limited to 'docs/mllib-collaborative-filtering.md')

diff --git a/docs/mllib-collaborative-filtering.md b/docs/mllib-collaborative-filtering.md
index 1ebb4654ae..b8f0566d87 100644
--- a/docs/mllib-collaborative-filtering.md
+++ b/docs/mllib-collaborative-filtering.md
@@ -31,17 +31,18 @@ following parameters:
 ### Explicit vs. implicit feedback
 
 The standard approach to matrix factorization based collaborative filtering treats 
-the entries in the user-item matrix as *explicit* preferences given by the user to the item.
+the entries in the user-item matrix as *explicit* preferences given by the user to the item,
+for example, users giving ratings to movies.
 
 It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
 clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
-from
-[Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
-Essentially instead of trying to model the matrix of ratings directly, this approach treats the data
-as a combination of binary preferences and *confidence values*. The ratings are then related to the
-level of confidence in observed user preferences, rather than explicit ratings given to items.  The
-model then tries to find latent factors that can be used to predict the expected preference of a
-user for an item.
+from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
+Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
+as numbers representing the *strength* in observations of user actions (such as the number of clicks,
+or the cumulative duration someone spent viewing a movie). Those numbers are then related to the level of
+confidence in observed user preferences, rather than explicit ratings given to items. The model
+then tries to find latent factors that can be used to predict the expected preference of a user for
+an item.
 
 ### Scaling of the regularization parameter
 
@@ -50,9 +51,8 @@ the number of ratings the user generated in updating user factors,
 or the number of ratings the product received in updating product factors.
 This approach is named "ALS-WR" and discussed in the paper
 "[Large-Scale Parallel Collaborative Filtering for the Netflix Prize](http://dx.doi.org/10.1007/978-3-540-68880-8_32)".
-It makes `lambda` less dependent on the scale of the dataset.
-So we can apply the best parameter learned from a sampled subset to the full dataset
-and expect similar performance.
+It makes `lambda` less dependent on the scale of the dataset, so we can apply the
+best parameter learned from a sampled subset to the full dataset and expect similar performance.
 
 ## Examples
 
@@ -64,11 +64,11 @@ We use the default [ALS.train()](api/scala/index.html#org.apache.spark.mllib.rec
 method which assumes ratings are explicit. We evaluate the
 recommendation model by measuring the Mean Squared Error of rating prediction.
 
-Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for details on the API.
+Refer to the [`ALS` Scala docs](api/scala/index.html#org.apache.spark.mllib.recommendation.ALS) for more details on the API.
 
 {% include_example scala/org/apache/spark/examples/mllib/RecommendationExample.scala %}
 
-If the rating matrix is derived from another source of information (e.g., it is inferred from
+If the rating matrix is derived from another source of information (i.e. it is inferred from
 other signals), you can use the `trainImplicit` method to get better results.
 
 {% highlight scala %}
@@ -85,7 +85,7 @@ Spark Java API uses a separate `JavaRDD` class. You can convert a Java RDD to a
 calling `.rdd()` on your `JavaRDD` object. A self-contained application example
 that is equivalent to the provided example in Scala is given below:
 
-Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for details on the API.
+Refer to the [`ALS` Java docs](api/java/org/apache/spark/mllib/recommendation/ALS.html) for more details on the API.
 
 {% include_example java/org/apache/spark/examples/mllib/JavaRecommendationExample.java %}
 </div>
@@ -99,7 +99,7 @@ Refer to the [`ALS` Python docs](api/python/pyspark.mllib.html#pyspark.mllib.rec
 
 {% include_example python/mllib/recommendation_example.py %}
 
-If the rating matrix is derived from other source of information (i.e., it is inferred from other
+If the rating matrix is derived from other source of information (i.e. it is inferred from other
 signals), you can use the trainImplicit method to get better results.
 
 {% highlight python %}
-- 
cgit v1.2.3