aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-decision-tree.md
diff options
context:
space:
mode:
authorTimothy Hunter <timhunter@databricks.com>2015-12-10 12:50:46 -0800
committerJoseph K. Bradley <joseph@databricks.com>2015-12-10 12:50:46 -0800
commit2ecbe02d5b28ee562d10c1735244b90a08532c9e (patch)
treec589a01a2900513aa1b277303ed7cdffc1961ba4 /docs/ml-decision-tree.md
parentec5f9ed5de2218938dba52152475daafd4dc4786 (diff)
downloadspark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.tar.gz
spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.tar.bz2
spark-2ecbe02d5b28ee562d10c1735244b90a08532c9e.zip
[SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, spark.mllib and mllib in the documentation.
Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.
Diffstat (limited to 'docs/ml-decision-tree.md')
-rw-r--r--docs/ml-decision-tree.md171
1 files changed, 4 insertions, 167 deletions
diff --git a/docs/ml-decision-tree.md b/docs/ml-decision-tree.md
index 2bfac6f6c8..a721d55bc6 100644
--- a/docs/ml-decision-tree.md
+++ b/docs/ml-decision-tree.md
@@ -1,171 +1,8 @@
---
layout: global
-title: Decision Trees - SparkML
-displayTitle: <a href="ml-guide.html">ML</a> - Decision Trees
+title: Decision trees - spark.ml
+displayTitle: Decision trees - spark.ml
---
-**Table of Contents**
-
-* This will become a table of contents (this text will be scraped).
-{:toc}
-
-
-# Overview
-
-[Decision trees](http://en.wikipedia.org/wiki/Decision_tree_learning)
-and their ensembles are popular methods for the machine learning tasks of
-classification and regression. Decision trees are widely used since they are easy to interpret,
-handle categorical features, extend to the multiclass classification setting, do not require
-feature scaling, and are able to capture non-linearities and feature interactions. Tree ensemble
-algorithms such as random forests and boosting are among the top performers for classification and
-regression tasks.
-
-MLlib supports decision trees for binary and multiclass classification and for regression,
-using both continuous and categorical features. The implementation partitions data by rows,
-allowing distributed training with millions or even billions of instances.
-
-Users can find more information about the decision tree algorithm in the [MLlib Decision Tree guide](mllib-decision-tree.html). In this section, we demonstrate the Pipelines API for Decision Trees.
-
-The Pipelines API for Decision Trees offers a bit more functionality than the original API. In particular, for classification, users can get the predicted probability of each class (a.k.a. class conditional probabilities).
-
-Ensembles of trees (Random Forests and Gradient-Boosted Trees) are described in the [Ensembles guide](ml-ensembles.html).
-
-# Inputs and Outputs
-
-We list the input and output (prediction) column types here.
-All output columns are optional; to exclude an output column, set its corresponding Param to an empty string.
-
-## Input Columns
-
-<table class="table">
- <thead>
- <tr>
- <th align="left">Param name</th>
- <th align="left">Type(s)</th>
- <th align="left">Default</th>
- <th align="left">Description</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>labelCol</td>
- <td>Double</td>
- <td>"label"</td>
- <td>Label to predict</td>
- </tr>
- <tr>
- <td>featuresCol</td>
- <td>Vector</td>
- <td>"features"</td>
- <td>Feature vector</td>
- </tr>
- </tbody>
-</table>
-
-## Output Columns
-
-<table class="table">
- <thead>
- <tr>
- <th align="left">Param name</th>
- <th align="left">Type(s)</th>
- <th align="left">Default</th>
- <th align="left">Description</th>
- <th align="left">Notes</th>
- </tr>
- </thead>
- <tbody>
- <tr>
- <td>predictionCol</td>
- <td>Double</td>
- <td>"prediction"</td>
- <td>Predicted label</td>
- <td></td>
- </tr>
- <tr>
- <td>rawPredictionCol</td>
- <td>Vector</td>
- <td>"rawPrediction"</td>
- <td>Vector of length # classes, with the counts of training instance labels at the tree node which makes the prediction</td>
- <td>Classification only</td>
- </tr>
- <tr>
- <td>probabilityCol</td>
- <td>Vector</td>
- <td>"probability"</td>
- <td>Vector of length # classes equal to rawPrediction normalized to a multinomial distribution</td>
- <td>Classification only</td>
- </tr>
- </tbody>
-</table>
-
-# Examples
-
-The below examples demonstrate the Pipelines API for Decision Trees. The main differences between this API and the [original MLlib Decision Tree API](mllib-decision-tree.html) are:
-
-* support for ML Pipelines
-* separation of Decision Trees for classification vs. regression
-* use of DataFrame metadata to distinguish continuous and categorical features
-
-
-## Classification
-
-The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
-We use two feature transformers to prepare the data; these help index categories for the label and categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.classification.DecisionTreeClassifier).
-
-{% include_example scala/org/apache/spark/examples/ml/DecisionTreeClassificationExample.scala %}
-
-</div>
-
-<div data-lang="java" markdown="1">
-
-More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/classification/DecisionTreeClassifier.html).
-
-{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeClassificationExample.java %}
-
-</div>
-
-<div data-lang="python" markdown="1">
-
-More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.classification.DecisionTreeClassifier).
-
-{% include_example python/ml/decision_tree_classification_example.py %}
-
-</div>
-
-</div>
-
-
-## Regression
-
-The following examples load a dataset in LibSVM format, split it into training and test sets, train on the first dataset, and then evaluate on the held-out test set.
-We use a feature transformer to index categorical features, adding metadata to the `DataFrame` which the Decision Tree algorithm can recognize.
-
-<div class="codetabs">
-<div data-lang="scala" markdown="1">
-
-More details on parameters can be found in the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.regression.DecisionTreeRegressor).
-
-{% include_example scala/org/apache/spark/examples/ml/DecisionTreeRegressionExample.scala %}
-</div>
-
-<div data-lang="java" markdown="1">
-
-More details on parameters can be found in the [Java API documentation](api/java/org/apache/spark/ml/regression/DecisionTreeRegressor.html).
-
-{% include_example java/org/apache/spark/examples/ml/JavaDecisionTreeRegressionExample.java %}
-</div>
-
-<div data-lang="python" markdown="1">
-
-More details on parameters can be found in the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.regression.DecisionTreeRegressor).
-
-{% include_example python/ml/decision_tree_regression_example.py %}
-</div>
-
-</div>
+ > This section has been moved into the
+ [classification and regression section](ml-classification-regression.html#decision-trees).