aboutsummaryrefslogtreecommitdiff
path: root/mllib
diff options
context:
space:
mode:
authorShuai Lin <linshuai2012@gmail.com>2016-08-02 09:14:08 -0700
committerSean Owen <sowen@cloudera.com>2016-08-02 09:14:08 -0700
commit36827ddafeaa7a683362eb8da31065aaff9676d5 (patch)
tree87ff13f11c8c858996a5a85e11c0022de9b1104b /mllib
parent511dede1118f20a7756f614acb6fc88af52c9de9 (diff)
downloadspark-36827ddafeaa7a683362eb8da31065aaff9676d5.tar.gz
spark-36827ddafeaa7a683362eb8da31065aaff9676d5.tar.bz2
spark-36827ddafeaa7a683362eb8da31065aaff9676d5.zip
[SPARK-16822][DOC] Support latex in scaladoc.
## What changes were proposed in this pull request? Support using latex in scaladoc by adding MathJax javascript to the js template. ## How was this patch tested? Generated scaladoc. Preview: - LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) - MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) Author: Shuai Lin <linshuai2012@gmail.com> Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
Diffstat (limited to 'mllib')
-rw-r--r--mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala10
-rw-r--r--mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala94
-rw-r--r--mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala120
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala2
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala2
-rw-r--r--mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala94
6 files changed, 205 insertions, 117 deletions
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
index 068f11a2a5..9f3d2ca6db 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
@@ -76,11 +76,15 @@ private[feature] trait MinMaxScalerParams extends Params with HasInputCol with H
/**
* Rescale each feature individually to a common range [min, max] linearly using column summary
* statistics, which is also known as min-max normalization or Rescaling. The rescaled value for
- * feature E is calculated as,
+ * feature E is calculated as:
*
- * `Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min`
+ * <p><blockquote>
+ * $$
+ * Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
+ * $$
+ * </blockquote></p>
*
- * For the case `E_{max} == E_{min}`, `Rescaled(e_i) = 0.5 * (max + min)`.
+ * For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.
* Note that since zero values will probably be transformed to non-zero values, output of the
* transformer will be DenseVector even for sparse input.
*/
diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
index d4ae59deff..be234f7fea 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
@@ -412,50 +412,72 @@ object AFTSurvivalRegressionModel extends MLReadable[AFTSurvivalRegressionModel]
* Two AFTAggregator can be merged together to have a summary of loss and gradient of
* the corresponding joint dataset.
*
- * Given the values of the covariates x^{'}, for random lifetime t_{i} of subjects i = 1, ..., n,
+ * Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of subjects i = 1,..,n,
* with possible right-censoring, the likelihood function under the AFT model is given as
- * {{{
- * L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}
- * (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}
- * (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
- * }}}
- * Where \delta_{i} is the indicator of the event has occurred i.e. uncensored or not.
- * Using \epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}, the log-likelihood function
+ *
+ * <p><blockquote>
+ * $$
+ * L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}
+ * (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}
+ * (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
+ * $$
+ * </blockquote></p>
+ *
+ * Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
+ * Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
* assumes the form
- * {{{
- * \iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+
- * \delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
- * }}}
- * Where S_{0}(\epsilon_{i}) is the baseline survivor function,
- * and f_{0}(\epsilon_{i}) is corresponding density function.
+ *
+ * <p><blockquote>
+ * $$
+ * \iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+
+ * \delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
+ * $$
+ * </blockquote></p>
+ * Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
+ * and $f_{0}(\epsilon_{i})$ is corresponding density function.
*
* The most commonly used log-linear survival regression method is based on the Weibull
* distribution of the survival time. The Weibull distribution for lifetime corresponding
* to extreme value distribution for log of the lifetime,
- * and the S_{0}(\epsilon) function is
- * {{{
- * S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
- * }}}
- * the f_{0}(\epsilon_{i}) function is
- * {{{
- * f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
- * }}}
+ * and the $S_{0}(\epsilon)$ function is
+ *
+ * <p><blockquote>
+ * $$
+ * S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
+ * $$
+ * </blockquote></p>
+ *
+ * and the $f_{0}(\epsilon_{i})$ function is
+ *
+ * <p><blockquote>
+ * $$
+ * f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
+ * $$
+ * </blockquote></p>
+ *
* The log-likelihood function for Weibull distribution of lifetime is
- * {{{
- * \iota(\beta,\sigma)=
- * -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
- * }}}
+ *
+ * <p><blockquote>
+ * $$
+ * \iota(\beta,\sigma)=
+ * -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
+ * $$
+ * </blockquote></p>
+ *
* Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
- * the loss function we use to optimize is -\iota(\beta,\sigma).
- * The gradient functions for \beta and \log\sigma respectively are
- * {{{
- * \frac{\partial (-\iota)}{\partial \beta}=
- * \sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
- * }}}
- * {{{
- * \frac{\partial (-\iota)}{\partial (\log\sigma)}=
- * \sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
- * }}}
+ * the loss function we use to optimize is $-\iota(\beta,\sigma)$.
+ * The gradient functions for $\beta$ and $\log\sigma$ respectively are
+ *
+ * <p><blockquote>
+ * $$
+ * \frac{\partial (-\iota)}{\partial \beta}=
+ * \sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma} \\
+ *
+ * \frac{\partial (-\iota)}{\partial (\log\sigma)}=
+ * \sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
+ * $$
+ * </blockquote></p>
+ *
* @param parameters including three part: The log of scale parameter, the intercept and
* regression coefficients corresponding to the features.
* @param fitIntercept Whether to fit an intercept term.
diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index f3dc65e0df..6d5e398dfe 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -58,7 +58,12 @@ private[regression] trait LinearRegressionParams extends PredictorParams
*
* The learning objective is to minimize the squared error, with regularization.
* The specific squared error loss function used is:
- * L = 1/2n ||A coefficients - y||^2^
+ *
+ * <p><blockquote>
+ * $$
+ * L = 1/2n ||A coefficients - y||^2^
+ * $$
+ * </blockquote></p>
*
* This supports multiple types of regularization:
* - none (a.k.a. ordinary least squares)
@@ -759,66 +764,103 @@ class LinearRegressionSummary private[regression] (
*
* When training with intercept enabled,
* The objective function in the scaled space is given by
- * {{{
- * L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2,
- * }}}
- * where \bar{x_i} is the mean of x_i, \hat{x_i} is the standard deviation of x_i,
- * \bar{y} is the mean of label, and \hat{y} is the standard deviation of label.
+ *
+ * <p><blockquote>
+ * $$
+ * L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2,
+ * $$
+ * </blockquote></p>
+ *
+ * where $\bar{x_i}$ is the mean of $x_i$, $\hat{x_i}$ is the standard deviation of $x_i$,
+ * $\bar{y}$ is the mean of label, and $\hat{y}$ is the standard deviation of label.
*
* If we fitting the intercept disabled (that is forced through 0.0),
- * we can use the same equation except we set \bar{y} and \bar{x_i} to 0 instead
+ * we can use the same equation except we set $\bar{y}$ and $\bar{x_i}$ to 0 instead
* of the respective means.
*
* This can be rewritten as
- * {{{
- * L = 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i (w_i/\hat{x_i})\bar{x_i} - y / \hat{y}
- * + \bar{y} / \hat{y}||^2
- * = 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n diff^2
- * }}}
- * where w_i^\prime^ is the effective coefficients defined by w_i/\hat{x_i}, offset is
- * {{{
- * - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}.
- * }}}, and diff is
- * {{{
- * \sum_i w_i^\prime x_i - y / \hat{y} + offset
- * }}}
*
+ * <p><blockquote>
+ * $$
+ * \begin{align}
+ * L &= 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i (w_i/\hat{x_i})\bar{x_i} - y / \hat{y}
+ * + \bar{y} / \hat{y}||^2 \\
+ * &= 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n diff^2
+ * \end{align}
+ * $$
+ * </blockquote></p>
+ *
+ * where $w_i^\prime$ is the effective coefficients defined by $w_i/\hat{x_i}$, offset is
+ *
+ * <p><blockquote>
+ * $$
+ * - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}.
+ * $$
+ * </blockquote></p>
+ *
+ * and diff is
+ *
+ * <p><blockquote>
+ * $$
+ * \sum_i w_i^\prime x_i - y / \hat{y} + offset
+ * $$
+ * </blockquote></p>
*
* Note that the effective coefficients and offset don't depend on training dataset,
* so they can be precomputed.
*
* Now, the first derivative of the objective function in scaled space is
- * {{{
- * \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i}
- * }}}
- * However, ($x_i - \bar{x_i}$) will densify the computation, so it's not
+ *
+ * <p><blockquote>
+ * $$
+ * \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i}
+ * $$
+ * </blockquote></p>
+ *
+ * However, $(x_i - \bar{x_i})$ will densify the computation, so it's not
* an ideal formula when the training dataset is sparse format.
*
- * This can be addressed by adding the dense \bar{x_i} / \hat{x_i} terms
+ * This can be addressed by adding the dense $\bar{x_i} / \hat{x_i}$ terms
* in the end by keeping the sum of diff. The first derivative of total
* objective function from all the samples is
- * {{{
- * \frac{\partial L}{\partial w_i} =
- * 1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i}
- * = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i} / \hat{x_i})
- * = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i)
- * }}},
- * where correction_i = - diffSum \bar{x_i} / \hat{x_i}
+ *
+ *
+ * <p><blockquote>
+ * $$
+ * \begin{align}
+ * \frac{\partial L}{\partial w_i} &=
+ * 1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i} \\
+ * &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i} / \hat{x_i}) \\
+ * &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i)
+ * \end{align}
+ * $$
+ * </blockquote></p>
+ *
+ * where $correction_i = - diffSum \bar{x_i} / \hat{x_i}$
*
* A simple math can show that diffSum is actually zero, so we don't even
* need to add the correction terms in the end. From the definition of diff,
- * {{{
- * diffSum = \sum_j (\sum_i w_i(x_{ij} - \bar{x_i}) / \hat{x_i} - (y_j - \bar{y}) / \hat{y})
- * = N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - (\bar{y} - \bar{y}) / \hat{y})
- * = 0
- * }}}
+ *
+ * <p><blockquote>
+ * $$
+ * \begin{align}
+ * diffSum &= \sum_j (\sum_i w_i(x_{ij} - \bar{x_i})
+ * / \hat{x_i} - (y_j - \bar{y}) / \hat{y}) \\
+ * &= N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - (\bar{y} - \bar{y}) / \hat{y}) \\
+ * &= 0
+ * \end{align}
+ * $$
+ * </blockquote></p>
*
* As a result, the first derivative of the total objective function only depends on
* the training dataset, which can be easily computed in distributed fashion, and is
* sparse format friendly.
- * {{{
- * \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i})
- * }}},
+ *
+ * <p><blockquote>
+ * $$
+ * \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i})
+ * $$
+ * </blockquote></p>
*
* @param coefficients The coefficients corresponding to the features.
* @param labelStd The standard deviation value of the label.
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
index 647d37bd82..1f6e1a077f 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
@@ -25,7 +25,7 @@ import breeze.numerics._
private[clustering] object LDAUtils {
/**
* Log Sum Exp with overflow protection using the identity:
- * For any a: \log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N \exp\{x_n - a\}
+ * For any a: $\log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N \exp\{x_n - a\}$
*/
private[clustering] def logSumExp(x: BDV[Double]): Double = {
val a = max(x)
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
index ef45c9fd9e..ce44215151 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
@@ -73,7 +73,7 @@ class RegressionMetrics @Since("2.0.0") (
/**
* Returns the variance explained by regression.
- * explainedVariance = \sum_i (\hat{y_i} - \bar{y})^2 / n
+ * explainedVariance = $\sum_i (\hat{y_i} - \bar{y})^2 / n$
* @see [[https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained]]
*/
@Since("1.2.0")
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala b/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala
index 450ed8f22b..81e64de4e5 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala
@@ -67,43 +67,53 @@ abstract class Gradient extends Serializable {
* http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 119 gives the formula of
* multinomial logistic regression model. A simple calculation shows that
*
- * {{{
- * P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))
- * P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))
- * ...
- * P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))
- * }}}
+ * <p><blockquote>
+ * $$
+ * P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))\\
+ * P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))\\
+ * ...\\
+ * P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))\\
+ * $$
+ * </blockquote></p>
*
* for K classes multiclass classification problem.
*
- * The model weights w = (w_1, w_2, ..., w_{K-1})^T becomes a matrix which has dimension of
+ * The model weights $w = (w_1, w_2, ..., w_{K-1})^T$ becomes a matrix which has dimension of
* (K-1) * (N+1) if the intercepts are added. If the intercepts are not added, the dimension
* will be (K-1) * N.
*
* As a result, the loss of objective function for a single instance of data can be written as
- * {{{
- * l(w, x) = -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) log P(y|x, w)
- * = log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1}
- * = log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
- * }}}
+ * <p><blockquote>
+ * $$
+ * \begin{align}
+ * l(w, x) &= -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) log P(y|x, w) \\
+ * &= log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1} \\
+ * &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
+ * \end{align}
+ * $$
+ * </blockquote></p>
*
- * where \alpha(i) = 1 if i != 0, and
- * \alpha(i) = 0 if i == 0,
- * margins_i = x w_i.
+ * where $\alpha(i) = 1$ if $i \ne 0$, and
+ * $\alpha(i) = 0$ if $i == 0$,
+ * $margins_i = x w_i$.
*
* For optimization, we have to calculate the first derivative of the loss function, and
* a simple calculation shows that
*
- * {{{
- * \frac{\partial l(w, x)}{\partial w_{ij}}
- * = (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - (1-\alpha(y)\delta_{y, i+1})) * x_j
- * = multiplier_i * x_j
- * }}}
+ * <p><blockquote>
+ * $$
+ * \begin{align}
+ * \frac{\partial l(w, x)}{\partial w_{ij}} &=
+ * (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - (1-\alpha(y)\delta_{y, i+1})) * x_j \\
+ * &= multiplier_i * x_j
+ * \end{align}
+ * $$
+ * </blockquote></p>
*
- * where \delta_{i, j} = 1 if i == j,
- * \delta_{i, j} = 0 if i != j, and
+ * where $\delta_{i, j} = 1$ if $i == j$,
+ * $\delta_{i, j} = 0$ if $i != j$, and
* multiplier =
- * \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})
+ * $\exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})$
*
* If any of margins is larger than 709.78, the numerical computation of multiplier and loss
* function will be suffered from arithmetic overflow. This issue occurs when there are outliers
@@ -113,26 +123,36 @@ abstract class Gradient extends Serializable {
* Fortunately, when max(margins) = maxMargin > 0, the loss function and the multiplier can be
* easily rewritten into the following equivalent numerically stable formula.
*
- * {{{
- * l(w, x) = log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
- * = log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin)) + maxMargin
- * - (1-\alpha(y)) margins_{y-1}
- * = log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1}
- * }}}
- *
- * where sum = \exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) - 1.
+ * <p><blockquote>
+ * $$
+ * \begin{align}
+ * l(w, x) &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1} \\
+ * &= log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin)) + maxMargin
+ * - (1-\alpha(y)) margins_{y-1} \\
+ * &= log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1}
+ * \end{align}
+ * $$
+ * </blockquote></p>
+
+ * where sum = $\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) - 1$.
*
- * Note that each term, (margins_i - maxMargin) in \exp is smaller than zero; as a result,
+ * Note that each term, $(margins_i - maxMargin)$ in $\exp$ is smaller than zero; as a result,
* overflow will not happen with this formula.
*
* For multiplier, similar trick can be applied as the following,
*
- * {{{
- * multiplier = \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})
- * = \exp(margins_i - maxMargin) / (1 + sum) - (1-\alpha(y)\delta_{y, i+1})
- * }}}
+ * <p><blockquote>
+ * $$
+ * \begin{align}
+ * multiplier
+ * &= \exp(margins_i) /
+ * (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1}) \\
+ * &= \exp(margins_i - maxMargin) / (1 + sum) - (1-\alpha(y)\delta_{y, i+1})
+ * \end{align}
+ * $$
+ * </blockquote></p>
*
- * where each term in \exp is also smaller than zero, so overflow is not a concern.
+ * where each term in $\exp$ is also smaller than zero, so overflow is not a concern.
*
* For the detailed mathematical derivation, see the reference at
* http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297