From 36827ddafeaa7a683362eb8da31065aaff9676d5 Mon Sep 17 00:00:00 2001
From: Shuai Lin <linshuai2012@gmail.com>
Date: Tue, 2 Aug 2016 09:14:08 -0700
Subject: [SPARK-16822][DOC] Support latex in scaladoc.

## What changes were proposed in this pull request?

Support using latex in scaladoc by adding MathJax javascript to the js template.

## How was this patch tested?

Generated scaladoc.  Preview:

- LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient)

- MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler)

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
---
 .../org/apache/spark/ml/feature/MinMaxScaler.scala |  10 +-
 .../ml/regression/AFTSurvivalRegression.scala      |  94 +++++++++-------
 .../spark/ml/regression/LinearRegression.scala     | 120 ++++++++++++++-------
 .../apache/spark/mllib/clustering/LDAUtils.scala   |   2 +-
 .../spark/mllib/evaluation/RegressionMetrics.scala |   2 +-
 .../apache/spark/mllib/optimization/Gradient.scala |  94 +++++++++-------
 6 files changed, 205 insertions(+), 117 deletions(-)

(limited to 'mllib')
diff --git a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
index 068f11a2a5..9f3d2ca6db 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala
@@ -76,11 +76,15 @@ private[feature] trait MinMaxScalerParams extends Params with HasInputCol with H
 /**
  * Rescale each feature individually to a common range [min, max] linearly using column summary
  * statistics, which is also known as min-max normalization or Rescaling. The rescaled value for
- * feature E is calculated as,
+ * feature E is calculated as:
  *
- * `Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min`
+ * <p><blockquote>
+ *    $$
+ *    Rescaled(e_i) = \frac{e_i - E_{min}}{E_{max} - E_{min}} * (max - min) + min
+ *    $$
+ * </blockquote></p>
  *
- * For the case `E_{max} == E_{min}`, `Rescaled(e_i) = 0.5 * (max + min)`.
+ * For the case $E_{max} == E_{min}$, $Rescaled(e_i) = 0.5 * (max + min)$.
  * Note that since zero values will probably be transformed to non-zero values, output of the
  * transformer will be DenseVector even for sparse input.
  */
diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
index d4ae59deff..be234f7fea 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala
@@ -412,50 +412,72 @@ object AFTSurvivalRegressionModel extends MLReadable[AFTSurvivalRegressionModel]
  * Two AFTAggregator can be merged together to have a summary of loss and gradient of
  * the corresponding joint dataset.
  *
- * Given the values of the covariates x^{'}, for random lifetime t_{i} of subjects i = 1, ..., n,
+ * Given the values of the covariates $x^{'}$, for random lifetime $t_{i}$ of subjects i = 1,..,n,
  * with possible right-censoring, the likelihood function under the AFT model is given as
- * {{{
- *   L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}
- *   (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}
- *   (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
- * }}}
- * Where \delta_{i} is the indicator of the event has occurred i.e. uncensored or not.
- * Using \epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}, the log-likelihood function
+ *
+ * <p><blockquote>
+ *    $$
+ *    L(\beta,\sigma)=\prod_{i=1}^n[\frac{1}{\sigma}f_{0}
+ *      (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})]^{\delta_{i}}S_{0}
+ *    (\frac{\log{t_{i}}-x^{'}\beta}{\sigma})^{1-\delta_{i}}
+ *    $$
+ * </blockquote></p>
+ *
+ * Where $\delta_{i}$ is the indicator of the event has occurred i.e. uncensored or not.
+ * Using $\epsilon_{i}=\frac{\log{t_{i}}-x^{'}\beta}{\sigma}$, the log-likelihood function
  * assumes the form
- * {{{
- *   \iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+
- *   \delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
- * }}}
- * Where S_{0}(\epsilon_{i}) is the baseline survivor function,
- * and f_{0}(\epsilon_{i}) is corresponding density function.
+ *
+ * <p><blockquote>
+ *    $$
+ *    \iota(\beta,\sigma)=\sum_{i=1}^{n}[-\delta_{i}\log\sigma+
+ *    \delta_{i}\log{f_{0}}(\epsilon_{i})+(1-\delta_{i})\log{S_{0}(\epsilon_{i})}]
+ *    $$
+ * </blockquote></p>
+ * Where $S_{0}(\epsilon_{i})$ is the baseline survivor function,
+ * and $f_{0}(\epsilon_{i})$ is corresponding density function.
  *
  * The most commonly used log-linear survival regression method is based on the Weibull
  * distribution of the survival time. The Weibull distribution for lifetime corresponding
  * to extreme value distribution for log of the lifetime,
- * and the S_{0}(\epsilon) function is
- * {{{
- *   S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
- * }}}
- * the f_{0}(\epsilon_{i}) function is
- * {{{
- *   f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
- * }}}
+ * and the $S_{0}(\epsilon)$ function is
+ *
+ * <p><blockquote>
+ *    $$
+ *    S_{0}(\epsilon_{i})=\exp(-e^{\epsilon_{i}})
+ *    $$
+ * </blockquote></p>
+ *
+ * and the $f_{0}(\epsilon_{i})$ function is
+ *
+ * <p><blockquote>
+ *    $$
+ *    f_{0}(\epsilon_{i})=e^{\epsilon_{i}}\exp(-e^{\epsilon_{i}})
+ *    $$
+ * </blockquote></p>
+ *
  * The log-likelihood function for Weibull distribution of lifetime is
- * {{{
- *   \iota(\beta,\sigma)=
- *   -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
- * }}}
+ *
+ * <p><blockquote>
+ *    $$
+ *    \iota(\beta,\sigma)=
+ *    -\sum_{i=1}^n[\delta_{i}\log\sigma-\delta_{i}\epsilon_{i}+e^{\epsilon_{i}}]
+ *    $$
+ * </blockquote></p>
+ *
  * Due to minimizing the negative log-likelihood equivalent to maximum a posteriori probability,
- * the loss function we use to optimize is -\iota(\beta,\sigma).
- * The gradient functions for \beta and \log\sigma respectively are
- * {{{
- *   \frac{\partial (-\iota)}{\partial \beta}=
- *   \sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma}
- * }}}
- * {{{
- *   \frac{\partial (-\iota)}{\partial (\log\sigma)}=
- *   \sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
- * }}}
+ * the loss function we use to optimize is $-\iota(\beta,\sigma)$.
+ * The gradient functions for $\beta$ and $\log\sigma$ respectively are
+ *
+ * <p><blockquote>
+ *    $$
+ *    \frac{\partial (-\iota)}{\partial \beta}=
+ *    \sum_{1=1}^{n}[\delta_{i}-e^{\epsilon_{i}}]\frac{x_{i}}{\sigma} \\
+ *
+ *    \frac{\partial (-\iota)}{\partial (\log\sigma)}=
+ *    \sum_{i=1}^{n}[\delta_{i}+(\delta_{i}-e^{\epsilon_{i}})\epsilon_{i}]
+ *    $$
+ * </blockquote></p>
+ *
  * @param parameters including three part: The log of scale parameter, the intercept and
  *                regression coefficients corresponding to the features.
  * @param fitIntercept Whether to fit an intercept term.
diff --git a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
index f3dc65e0df..6d5e398dfe 100644
--- a/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
+++ b/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala
@@ -58,7 +58,12 @@ private[regression] trait LinearRegressionParams extends PredictorParams
  *
  * The learning objective is to minimize the squared error, with regularization.
  * The specific squared error loss function used is:
- *   L = 1/2n ||A coefficients - y||^2^
+ *
+ * <p><blockquote>
+ *    $$
+ *    L = 1/2n ||A coefficients - y||^2^
+ *    $$
+ * </blockquote></p>
  *
  * This supports multiple types of regularization:
  *  - none (a.k.a. ordinary least squares)
@@ -759,66 +764,103 @@ class LinearRegressionSummary private[regression] (
  *
  * When training with intercept enabled,
  * The objective function in the scaled space is given by
- * {{{
- * L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2,
- * }}}
- * where \bar{x_i} is the mean of x_i, \hat{x_i} is the standard deviation of x_i,
- * \bar{y} is the mean of label, and \hat{y} is the standard deviation of label.
+ *
+ * <p><blockquote>
+ *    $$
+ *    L = 1/2n ||\sum_i w_i(x_i - \bar{x_i}) / \hat{x_i} - (y - \bar{y}) / \hat{y}||^2,
+ *    $$
+ * </blockquote></p>
+ *
+ * where $\bar{x_i}$ is the mean of $x_i$, $\hat{x_i}$ is the standard deviation of $x_i$,
+ * $\bar{y}$ is the mean of label, and $\hat{y}$ is the standard deviation of label.
  *
  * If we fitting the intercept disabled (that is forced through 0.0),
- * we can use the same equation except we set \bar{y} and \bar{x_i} to 0 instead
+ * we can use the same equation except we set $\bar{y}$ and $\bar{x_i}$ to 0 instead
  * of the respective means.
  *
  * This can be rewritten as
- * {{{
- * L = 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i (w_i/\hat{x_i})\bar{x_i} - y / \hat{y}
- *     + \bar{y} / \hat{y}||^2
- *   = 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n diff^2
- * }}}
- * where w_i^\prime^ is the effective coefficients defined by w_i/\hat{x_i}, offset is
- * {{{
- * - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}.
- * }}}, and diff is
- * {{{
- * \sum_i w_i^\prime x_i - y / \hat{y} + offset
- * }}}
  *
+ * <p><blockquote>
+ *    $$
+ *    \begin{align}
+ *     L &= 1/2n ||\sum_i (w_i/\hat{x_i})x_i - \sum_i (w_i/\hat{x_i})\bar{x_i} - y / \hat{y}
+ *          + \bar{y} / \hat{y}||^2 \\
+ *       &= 1/2n ||\sum_i w_i^\prime x_i - y / \hat{y} + offset||^2 = 1/2n diff^2
+ *    \end{align}
+ *    $$
+ * </blockquote></p>
+ *
+ * where $w_i^\prime$ is the effective coefficients defined by $w_i/\hat{x_i}$, offset is
+ *
+ * <p><blockquote>
+ *    $$
+ *    - \sum_i (w_i/\hat{x_i})\bar{x_i} + \bar{y} / \hat{y}.
+ *    $$
+ * </blockquote></p>
+ *
+ * and diff is
+ *
+ * <p><blockquote>
+ *    $$
+ *    \sum_i w_i^\prime x_i - y / \hat{y} + offset
+ *    $$
+ * </blockquote></p>
  *
  * Note that the effective coefficients and offset don't depend on training dataset,
  * so they can be precomputed.
  *
  * Now, the first derivative of the objective function in scaled space is
- * {{{
- * \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i}
- * }}}
- * However, ($x_i - \bar{x_i}$) will densify the computation, so it's not
+ *
+ * <p><blockquote>
+ *    $$
+ *    \frac{\partial L}{\partial w_i} = diff/N (x_i - \bar{x_i}) / \hat{x_i}
+ *    $$
+ * </blockquote></p>
+ *
+ * However, $(x_i - \bar{x_i})$ will densify the computation, so it's not
  * an ideal formula when the training dataset is sparse format.
  *
- * This can be addressed by adding the dense \bar{x_i} / \hat{x_i} terms
+ * This can be addressed by adding the dense $\bar{x_i} / \hat{x_i}$ terms
  * in the end by keeping the sum of diff. The first derivative of total
  * objective function from all the samples is
- * {{{
- * \frac{\partial L}{\partial w_i} =
- *     1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i}
- *   = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i} / \hat{x_i})
- *   = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i)
- * }}},
- * where correction_i = - diffSum \bar{x_i} / \hat{x_i}
+ *
+ *
+ * <p><blockquote>
+ *    $$
+ *    \begin{align}
+ *       \frac{\partial L}{\partial w_i} &=
+ *           1/N \sum_j diff_j (x_{ij} - \bar{x_i}) / \hat{x_i} \\
+ *         &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) - diffSum \bar{x_i} / \hat{x_i}) \\
+ *         &= 1/N ((\sum_j diff_j x_{ij} / \hat{x_i}) + correction_i)
+ *    \end{align}
+ *    $$
+ * </blockquote></p>
+ *
+ * where $correction_i = - diffSum \bar{x_i} / \hat{x_i}$
  *
  * A simple math can show that diffSum is actually zero, so we don't even
  * need to add the correction terms in the end. From the definition of diff,
- * {{{
- * diffSum = \sum_j (\sum_i w_i(x_{ij} - \bar{x_i}) / \hat{x_i} - (y_j - \bar{y}) / \hat{y})
- *         = N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - (\bar{y} - \bar{y}) / \hat{y})
- *         = 0
- * }}}
+ *
+ * <p><blockquote>
+ *    $$
+ *    \begin{align}
+ *       diffSum &= \sum_j (\sum_i w_i(x_{ij} - \bar{x_i})
+ *                    / \hat{x_i} - (y_j - \bar{y}) / \hat{y}) \\
+ *         &= N * (\sum_i w_i(\bar{x_i} - \bar{x_i}) / \hat{x_i} - (\bar{y} - \bar{y}) / \hat{y}) \\
+ *         &= 0
+ *    \end{align}
+ *    $$
+ * </blockquote></p>
  *
  * As a result, the first derivative of the total objective function only depends on
  * the training dataset, which can be easily computed in distributed fashion, and is
  * sparse format friendly.
- * {{{
- * \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i})
- * }}},
+ *
+ * <p><blockquote>
+ *    $$
+ *    \frac{\partial L}{\partial w_i} = 1/N ((\sum_j diff_j x_{ij} / \hat{x_i})
+ *    $$
+ * </blockquote></p>
  *
  * @param coefficients The coefficients corresponding to the features.
  * @param labelStd The standard deviation value of the label.
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
index 647d37bd82..1f6e1a077f 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAUtils.scala
@@ -25,7 +25,7 @@ import breeze.numerics._
 private[clustering] object LDAUtils {
   /**
    * Log Sum Exp with overflow protection using the identity:
-   * For any a: \log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N \exp\{x_n - a\}
+   * For any a: $\log \sum_{n=1}^N \exp\{x_n\} = a + \log \sum_{n=1}^N \exp\{x_n - a\}$
    */
   private[clustering] def logSumExp(x: BDV[Double]): Double = {
     val a = max(x)
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
index ef45c9fd9e..ce44215151 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/evaluation/RegressionMetrics.scala
@@ -73,7 +73,7 @@ class RegressionMetrics @Since("2.0.0") (
 
   /**
    * Returns the variance explained by regression.
-   * explainedVariance = \sum_i (\hat{y_i} - \bar{y})^2 / n
+   * explainedVariance = $\sum_i (\hat{y_i} - \bar{y})^2 / n$
    * @see [[https://en.wikipedia.org/wiki/Fraction_of_variance_unexplained]]
    */
   @Since("1.2.0")
diff --git a/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala b/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala
index 450ed8f22b..81e64de4e5 100644
--- a/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala
+++ b/mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala
@@ -67,43 +67,53 @@ abstract class Gradient extends Serializable {
  * http://statweb.stanford.edu/~tibs/ElemStatLearn/ , Eq. (4.17) on page 119 gives the formula of
  * multinomial logistic regression model. A simple calculation shows that
  *
- * {{{
- * P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))
- * P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))
- *   ...
- * P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))
- * }}}
+ * <p><blockquote>
+ *    $$
+ *    P(y=0|x, w) = 1 / (1 + \sum_i^{K-1} \exp(x w_i))\\
+ *    P(y=1|x, w) = exp(x w_1) / (1 + \sum_i^{K-1} \exp(x w_i))\\
+ *    ...\\
+ *    P(y=K-1|x, w) = exp(x w_{K-1}) / (1 + \sum_i^{K-1} \exp(x w_i))\\
+ *    $$
+ * </blockquote></p>
  *
  * for K classes multiclass classification problem.
  *
- * The model weights w = (w_1, w_2, ..., w_{K-1})^T becomes a matrix which has dimension of
+ * The model weights $w = (w_1, w_2, ..., w_{K-1})^T$ becomes a matrix which has dimension of
  * (K-1) * (N+1) if the intercepts are added. If the intercepts are not added, the dimension
  * will be (K-1) * N.
  *
  * As a result, the loss of objective function for a single instance of data can be written as
- * {{{
- * l(w, x) = -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) log P(y|x, w)
- *         = log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1}
- *         = log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
- * }}}
+ * <p><blockquote>
+ *    $$
+ *    \begin{align}
+ *    l(w, x) &= -log P(y|x, w) = -\alpha(y) log P(y=0|x, w) - (1-\alpha(y)) log P(y|x, w) \\
+ *            &= log(1 + \sum_i^{K-1}\exp(x w_i)) - (1-\alpha(y)) x w_{y-1} \\
+ *            &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
+ *    \end{align}
+ *    $$
+ * </blockquote></p>
  *
- * where \alpha(i) = 1 if i != 0, and
- *       \alpha(i) = 0 if i == 0,
- *       margins_i = x w_i.
+ * where $\alpha(i) = 1$ if $i \ne 0$, and
+ *       $\alpha(i) = 0$ if $i == 0$,
+ *       $margins_i = x w_i$.
  *
  * For optimization, we have to calculate the first derivative of the loss function, and
  * a simple calculation shows that
  *
- * {{{
- * \frac{\partial l(w, x)}{\partial w_{ij}}
- *   = (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - (1-\alpha(y)\delta_{y, i+1})) * x_j
- *   = multiplier_i * x_j
- * }}}
+ * <p><blockquote>
+ *    $$
+ *    \begin{align}
+ *      \frac{\partial l(w, x)}{\partial w_{ij}} &=
+ *         (\exp(x w_i) / (1 + \sum_k^{K-1} \exp(x w_k)) - (1-\alpha(y)\delta_{y, i+1})) * x_j \\
+ *                                               &= multiplier_i * x_j
+ *    \end{align}
+ *    $$
+ * </blockquote></p>
  *
- * where \delta_{i, j} = 1 if i == j,
- *       \delta_{i, j} = 0 if i != j, and
+ * where $\delta_{i, j} = 1$ if $i == j$,
+ *       $\delta_{i, j} = 0$ if $i != j$, and
  *       multiplier =
- *         \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})
+ *         $\exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})$
  *
  * If any of margins is larger than 709.78, the numerical computation of multiplier and loss
  * function will be suffered from arithmetic overflow. This issue occurs when there are outliers
@@ -113,26 +123,36 @@ abstract class Gradient extends Serializable {
  * Fortunately, when max(margins) = maxMargin > 0, the loss function and the multiplier can be
  * easily rewritten into the following equivalent numerically stable formula.
  *
- * {{{
- * l(w, x) = log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1}
- *         = log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin)) + maxMargin
- *           - (1-\alpha(y)) margins_{y-1}
- *         = log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1}
- * }}}
- *
- * where sum = \exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) - 1.
+ * <p><blockquote>
+ *    $$
+ *    \begin{align}
+ *      l(w, x) &= log(1 + \sum_i^{K-1}\exp(margins_i)) - (1-\alpha(y)) margins_{y-1} \\
+ *              &= log(\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin)) + maxMargin
+ *                  - (1-\alpha(y)) margins_{y-1} \\
+ *              &= log(1 + sum) + maxMargin - (1-\alpha(y)) margins_{y-1}
+ *    \end{align}
+ *    $$
+ * </blockquote></p>
+
+ * where sum = $\exp(-maxMargin) + \sum_i^{K-1}\exp(margins_i - maxMargin) - 1$.
  *
- * Note that each term, (margins_i - maxMargin) in \exp is smaller than zero; as a result,
+ * Note that each term, $(margins_i - maxMargin)$ in $\exp$ is smaller than zero; as a result,
  * overflow will not happen with this formula.
  *
  * For multiplier, similar trick can be applied as the following,
  *
- * {{{
- * multiplier = \exp(margins_i) / (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1})
- *            = \exp(margins_i - maxMargin) / (1 + sum) - (1-\alpha(y)\delta_{y, i+1})
- * }}}
+ * <p><blockquote>
+ *    $$
+ *    \begin{align}
+ *      multiplier
+ *       &= \exp(margins_i) /
+  *           (1 + \sum_k^{K-1} \exp(margins_i)) - (1-\alpha(y)\delta_{y, i+1}) \\
+ *       &= \exp(margins_i - maxMargin) / (1 + sum) - (1-\alpha(y)\delta_{y, i+1})
+ *    \end{align}
+ *    $$
+ * </blockquote></p>
  *
- * where each term in \exp is also smaller than zero, so overflow is not a concern.
+ * where each term in $\exp$ is also smaller than zero, so overflow is not a concern.
  *
  * For the detailed mathematical derivation, see the reference at
  * http://www.slideshare.net/dbtsai/2014-0620-mlor-36132297
-- 
cgit v1.2.3