aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-classification-regression.md
diff options
context:
space:
mode:
authorsethah <seth.hendrickson16@gmail.com>2016-10-05 18:28:21 +0000
committerDB Tsai <dbtsai@dbtsai.com>2016-10-05 18:28:21 +0000
commit9df54f5325c2942bb77008ff1810e2fb5f6d848b (patch)
treeb178ea4d0033a44e37902426104586be1a2ddf56 /docs/ml-classification-regression.md
parent6a05eb24d043aa93390f353850d56efa6124e063 (diff)
downloadspark-9df54f5325c2942bb77008ff1810e2fb5f6d848b.tar.gz
spark-9df54f5325c2942bb77008ff1810e2fb5f6d848b.tar.bz2
spark-9df54f5325c2942bb77008ff1810e2fb5f6d848b.zip
[SPARK-17239][ML][DOC] Update user guide for multiclass logistic regression
## What changes were proposed in this pull request? Updates user guide to reflect that LogisticRegression now supports multiclass. Also adds new examples to show multiclass training. ## How was this patch tested? Ran locally using spark-submit, run-example, and copy/paste from user guide into shells. Generated docs and verified correct output. Author: sethah <seth.hendrickson16@gmail.com> Closes #15349 from sethah/SPARK-17239.
Diffstat (limited to 'docs/ml-classification-regression.md')
-rw-r--r--docs/ml-classification-regression.md65
1 files changed, 58 insertions, 7 deletions
diff --git a/docs/ml-classification-regression.md b/docs/ml-classification-regression.md
index 7c2437eacd..bb2e404330 100644
--- a/docs/ml-classification-regression.md
+++ b/docs/ml-classification-regression.md
@@ -34,17 +34,22 @@ discussing specific classes of algorithms, such as linear methods, trees, and en
## Logistic regression
-Logistic regression is a popular method to predict a binary response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcome.
-For more background and more details about the implementation, refer to the documentation of the [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression).
+Logistic regression is a popular method to predict a categorical response. It is a special case of [Generalized Linear models](https://en.wikipedia.org/wiki/Generalized_linear_model) that predicts the probability of the outcomes.
+In `spark.ml` logistic regression can be used to predict a binary outcome by using binomial logistic regression, or it can be used to predict a multiclass outcome by using multinomial logistic regression. Use the `family`
+parameter to select between these two algorithms, or leave it unset and Spark will infer the correct variant.
- > The current implementation of logistic regression in `spark.ml` only supports binary classes. Support for multiclass regression will be added in the future.
+ > Multinomial logistic regression can be used for binary classification by setting the `family` param to "multinomial". It will produce two sets of coefficients and two intercepts.
> When fitting LogisticRegressionModel without intercept on dataset with constant nonzero column, Spark MLlib outputs zero coefficients for constant nonzero columns. This behavior is the same as R glmnet but different from LIBSVM.
+### Binomial logistic regression
+
+For more background and more details about the implementation of binomial logistic regression, refer to the documentation of [logistic regression in `spark.mllib`](mllib-linear-methods.html#logistic-regression).
+
**Example**
-The following example shows how to train a logistic regression model
-with elastic net regularization. `elasticNetParam` corresponds to
+The following example shows how to train binomial and multinomial logistic regression
+models for binary classification with elastic net regularization. `elasticNetParam` corresponds to
$\alpha$ and `regParam` corresponds to $\lambda$.
<div class="codetabs">
@@ -92,8 +97,8 @@ provides a summary for a
[`LogisticRegressionModel`](api/java/org/apache/spark/ml/classification/LogisticRegressionModel.html).
Currently, only binary classification is supported and the
summary must be explicitly cast to
-[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
-This will likely change when multiclass classification is supported.
+[`BinaryLogisticRegressionTrainingSummary`](api/java/org/apache/spark/ml/classification/BinaryLogisticRegressionTrainingSummary.html).
+Support for multiclass model summaries will be added in the future.
Continuing the earlier example:
@@ -107,6 +112,52 @@ Logistic regression model summary is not yet supported in Python.
</div>
+### Multinomial logistic regression
+
+Multiclass classification is supported via multinomial logistic (softmax) regression. In multinomial logistic regression,
+the algorithm produces $K$ sets of coefficients, or a matrix of dimension $K \times J$ where $K$ is the number of outcome
+classes and $J$ is the number of features. If the algorithm is fit with an intercept term then a length $K$ vector of
+intercepts is available.
+
+ > Multinomial coefficients are available as `coefficientMatrix` and intercepts are available as `interceptVector`.
+
+ > `coefficients` and `intercept` methods on a logistic regression model trained with multinomial family are not supported. Use `coefficientMatrix` and `interceptVector` instead.
+
+The conditional probabilities of the outcome classes $k \in \{1, 2, ..., K\}$ are modeled using the softmax function.
+
+`\[
+ P(Y=k|\mathbf{X}, \boldsymbol{\beta}_k, \beta_{0k}) = \frac{e^{\boldsymbol{\beta}_k \cdot \mathbf{X} + \beta_{0k}}}{\sum_{k'=0}^{K-1} e^{\boldsymbol{\beta}_{k'} \cdot \mathbf{X} + \beta_{0k'}}}
+\]`
+
+We minimize the weighted negative log-likelihood, using a multinomial response model, with elastic-net penalty to control for overfitting.
+
+`\[
+\min_{\beta, \beta_0} -\left[\sum_{i=1}^L w_i \cdot \log P(Y = y_i|\mathbf{x}_i)\right] + \lambda \left[\frac{1}{2}\left(1 - \alpha\right)||\boldsymbol{\beta}||_2^2 + \alpha ||\boldsymbol{\beta}||_1\right]
+\]`
+
+For a detailed derivation please see [here](https://en.wikipedia.org/wiki/Multinomial_logistic_regression#As_a_log-linear_model).
+
+**Example**
+
+The following example shows how to train a multiclass logistic regression
+model with elastic net regularization.
+
+<div class="codetabs">
+
+<div data-lang="scala" markdown="1">
+{% include_example scala/org/apache/spark/examples/ml/MulticlassLogisticRegressionWithElasticNetExample.scala %}
+</div>
+
+<div data-lang="java" markdown="1">
+{% include_example java/org/apache/spark/examples/ml/JavaMulticlassLogisticRegressionWithElasticNetExample.java %}
+</div>
+
+<div data-lang="python" markdown="1">
+{% include_example python/ml/multiclass_logistic_regression_with_elastic_net.py %}
+</div>
+
+</div>
+
## Decision tree classifier