diff options
author | MechCoder <manojkumarsivaraj334@gmail.com> | 2015-02-24 15:13:22 -0800 |
---|---|---|
committer | Joseph K. Bradley <joseph@databricks.com> | 2015-02-24 15:13:22 -0800 |
commit | 2a0fe34891882e0fde1b5722d8227aa99acc0f1f (patch) | |
tree | 238c58f540e0b8c727e131b6359041d137c4e780 /docs/mllib-ensembles.md | |
parent | da505e59274d1c838653c1109db65ad374e65304 (diff) | |
download | spark-2a0fe34891882e0fde1b5722d8227aa99acc0f1f.tar.gz spark-2a0fe34891882e0fde1b5722d8227aa99acc0f1f.tar.bz2 spark-2a0fe34891882e0fde1b5722d8227aa99acc0f1f.zip |
[SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidation
One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit.
This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #4677 from MechCoder/spark-5436 and squashes the following commits:
1bb21d4 [MechCoder] Combine regression and classification tests into a single one
e4d799b [MechCoder] Addresses indentation and doc comments
b48a70f [MechCoder] COSMIT
b928a19 [MechCoder] Move validation while training section under usage tips
fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance.
55e5c3b [MechCoder] One liner for prevValidateError
3e74372 [MechCoder] TST: Add test for classification
77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
Diffstat (limited to 'docs/mllib-ensembles.md')
-rw-r--r-- | docs/mllib-ensembles.md | 11 |
1 files changed, 11 insertions, 0 deletions
diff --git a/docs/mllib-ensembles.md b/docs/mllib-ensembles.md index fb90b70399..00040e6073 100644 --- a/docs/mllib-ensembles.md +++ b/docs/mllib-ensembles.md @@ -427,6 +427,17 @@ We omit some decision tree parameters since those are covered in the [decision t * **`algo`**: The algorithm or task (classification vs. regression) is set using the tree [Strategy] parameter. +#### Validation while training + +Gradient boosting can overfit when trained with more trees. In order to prevent overfitting, it is useful to validate while +training. The method runWithValidation has been provided to make use of this option. It takes a pair of RDD's as arguments, the +first one being the training dataset and the second being the validation dataset. + +The training is stopped when the improvement in the validation error is not more than a certain tolerance +(supplied by the `validationTol` argument in `BoostingStrategy`). In practice, the validation error +decreases initially and later increases. There might be cases in which the validation error does not change monotonically, +and the user is advised to set a large enough negative tolerance and examine the validation curve to to tune the number of +iterations. ### Examples |