aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorYanbo Liang <ybliang8@gmail.com>2016-01-25 11:52:26 -0800
committerXiangrui Meng <meng@databricks.com>2016-01-25 11:52:26 -0800
commitdd2325d9a7de7bef9a6bc2f0d5f26e605545b52d (patch)
treea1cce28acaedf721d574f5b7bdf645903475d39b /mllib
parent4ee8191e57cb823a23ceca17908af86e70354554 (diff)
downloadspark-dd2325d9a7de7bef9a6bc2f0d5f26e605545b52d.tar.gz
spark-dd2325d9a7de7bef9a6bc2f0d5f26e605545b52d.tar.bz2
spark-dd2325d9a7de7bef9a6bc2f0d5f26e605545b52d.zip
[SPARK-11965][ML][DOC] Update user guide for RFormula feature interactions
Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10222 from yanboliang/spark-11965.
Diffstat (limited to 'mllib')
-rw-r--r--mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala21
1 files changed, 21 insertions, 0 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
index 6cc9d02544..c21da218b3 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/RFormula.scala
@@ -45,6 +45,27 @@ private[feature] trait RFormulaBase extends HasFeaturesCol with HasLabelCol {
* Implements the transforms required for fitting a dataset against an R model formula. Currently
* we support a limited subset of the R operators, including '~', '.', ':', '+', and '-'. Also see
* the R formula docs here: http://stat.ethz.ch/R-manual/R-patched/library/stats/html/formula.html
+ *
+ * The basic operators are:
+ * - `~` separate target and terms
+ * - `+` concat terms, "+ 0" means removing intercept
+ * - `-` remove a term, "- 1" means removing intercept
+ * - `:` interaction (multiplication for numeric values, or binarized categorical values)
+ * - `.` all columns except target
+ *
+ * Suppose `a` and `b` are double columns, we use the following simple examples
+ * to illustrate the effect of `RFormula`:
+ * - `y ~ a + b` means model `y ~ w0 + w1 * a + w2 * b` where `w0` is the intercept and `w1, w2`
+ * are coefficients.
+ * - `y ~ a + b + a:b - 1` means model `y ~ w1 * a + w2 * b + w3 * a * b` where `w1, w2, w3`
+ * are coefficients.
+ *
+ * RFormula produces a vector column of features and a double or string column of label.
+ * Like when formulas are used in R for linear regression, string input columns will be one-hot
+ * encoded, and numeric columns will be cast to doubles.
+ * If the label column is of type string, it will be first transformed to double with
+ * `StringIndexer`. If the label column does not exist in the DataFrame, the output label column
+ * will be created from the specified response variable in the formula.
*/
@Experimental
class RFormula(override val uid: String) extends Estimator[RFormulaModel] with RFormulaBase {