diff options
author | Joseph K. Bradley <joseph@databricks.com> | 2014-11-25 20:10:15 -0800 |
---|---|---|
committer | Xiangrui Meng <meng@databricks.com> | 2014-11-25 20:10:15 -0800 |
commit | c251fd7405db57d3ab2686c38712601fd8f13ccd (patch) | |
tree | 0496b7175e6081cc7f4f3278e1e44448217866f7 /data/mllib/pagerank_data.txt | |
parent | 7eba0fbe456c451122d7a2353ff0beca00f15223 (diff) | |
download | spark-c251fd7405db57d3ab2686c38712601fd8f13ccd.tar.gz spark-c251fd7405db57d3ab2686c38712601fd8f13ccd.tar.bz2 spark-c251fd7405db57d3ab2686c38712601fd8f13ccd.zip |
[SPARK-4583] [mllib] LogLoss for GradientBoostedTrees fix + doc updates
Currently, the LogLoss used by GradientBoostedTrees has 2 issues:
* the gradient (and therefore loss) does not match that used by Friedman (1999)
* the error computation uses 0/1 accuracy, not log loss
This PR updates LogLoss.
It also adds some doc for boosting and forests.
I tested it on sample data and made sure the log loss is monotonically decreasing with each boosting iteration.
CC: mengxr manishamde codedeft
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #3439 from jkbradley/gbt-loss-fix and squashes the following commits:
cfec17e [Joseph K. Bradley] removed forgotten temp comments
a27eb6d [Joseph K. Bradley] corrections to last log loss commit
ed5da2c [Joseph K. Bradley] updated LogLoss (boosting) for numerical stability
5e52bff [Joseph K. Bradley] * Removed the 1/2 from SquaredError. This also required updating the test suite since it effectively doubles the gradient and loss. * Added doc for developers within RandomForest. * Small cleanup in test suite (generating data only once)
e57897a [Joseph K. Bradley] Fixed LogLoss for GradientBoostedTrees, and updated doc for losses, forests, and boosting
Diffstat (limited to 'data/mllib/pagerank_data.txt')
0 files changed, 0 insertions, 0 deletions