aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-guide.md
diff options
context:
space:
mode:
Diffstat (limited to 'docs/ml-guide.md')
-rw-r--r--docs/ml-guide.md8
1 files changed, 4 insertions, 4 deletions
diff --git a/docs/ml-guide.md b/docs/ml-guide.md
index cc353df1ec..dae86d8480 100644
--- a/docs/ml-guide.md
+++ b/docs/ml-guide.md
@@ -47,7 +47,7 @@ mostly inspired by the [scikit-learn](http://scikit-learn.org/) project.
E.g., a `DataFrame` could have different columns storing text, feature vectors, true labels, and predictions.
* **[`Transformer`](ml-guide.html#transformers)**: A `Transformer` is an algorithm which can transform one `DataFrame` into another `DataFrame`.
-E.g., an ML model is a `Transformer` which transforms `DataFrame` with features into a `DataFrame` with predictions.
+E.g., an ML model is a `Transformer` which transforms a `DataFrame` with features into a `DataFrame` with predictions.
* **[`Estimator`](ml-guide.html#estimators)**: An `Estimator` is an algorithm which can be fit on a `DataFrame` to produce a `Transformer`.
E.g., a learning algorithm is an `Estimator` which trains on a `DataFrame` and produces a model.
@@ -292,13 +292,13 @@ However, it is also a well-established method for choosing parameters which is m
## Example: model selection via train validation split
In addition to `CrossValidator` Spark also offers `TrainValidationSplit` for hyper-parameter tuning.
-`TrainValidationSplit` only evaluates each combination of parameters once as opposed to k times in
- case of `CrossValidator`. It is therefore less expensive,
+`TrainValidationSplit` only evaluates each combination of parameters once, as opposed to k times in
+ the case of `CrossValidator`. It is therefore less expensive,
but will not produce as reliable results when the training dataset is not sufficiently large.
`TrainValidationSplit` takes an `Estimator`, a set of `ParamMap`s provided in the `estimatorParamMaps` parameter,
and an `Evaluator`.
-It begins by splitting the dataset into two parts using `trainRatio` parameter
+It begins by splitting the dataset into two parts using the `trainRatio` parameter
which are used as separate training and test datasets. For example with `$trainRatio=0.75$` (default),
`TrainValidationSplit` will generate a training and test dataset pair where 75% of the data is used for training and 25% for validation.
Similar to `CrossValidator`, `TrainValidationSplit` also iterates through the set of `ParamMap`s.