aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
diff options
context:
space:
mode:
authorBenFradet <benjamin.fradet@gmail.com>2015-12-08 12:45:34 -0800
committerJoseph K. Bradley <joseph@databricks.com>2015-12-08 12:45:34 -0800
commit06746b3005e5e9892d0314bee3bfdfaebc36d3d4 (patch)
treed0bdd5af1a56b07fe00c6c5c44a0da3f276338e6 /docs/ml-features.md
parent5cb4695051e3dac847b1ea14d62e54dcf672c31c (diff)
downloadspark-06746b3005e5e9892d0314bee3bfdfaebc36d3d4.tar.gz
spark-06746b3005e5e9892d0314bee3bfdfaebc36d3d4.tar.bz2
spark-06746b3005e5e9892d0314bee3bfdfaebc36d3d4.zip
[SPARK-12159][ML] Add user guide section for IndexToString transformer
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10166 from BenFradet/SPARK-12159.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r--docs/ml-features.md104
1 files changed, 88 insertions, 16 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index 01d6abeb5b..e15c26836a 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -835,10 +835,10 @@ dctDf.select("featuresDCT").show(3);
`StringIndexer` encodes a string column of labels to a column of label indices.
The indices are in `[0, numLabels)`, ordered by label frequencies.
So the most frequent label gets index `0`.
-If the input column is numeric, we cast it to string and index the string
-values. When downstream pipeline components such as `Estimator` or
-`Transformer` make use of this string-indexed label, you must set the input
-column of the component to this string-indexed column name. In many cases,
+If the input column is numeric, we cast it to string and index the string
+values. When downstream pipeline components such as `Estimator` or
+`Transformer` make use of this string-indexed label, you must set the input
+column of the component to this string-indexed column name. In many cases,
you can set the input column with `setInputCol`.
**Examples**
@@ -951,9 +951,78 @@ indexed.show()
</div>
</div>
+
+## IndexToString
+
+Symmetrically to `StringIndexer`, `IndexToString` maps a column of label indices
+back to a column containing the original labels as strings. The common use case
+is to produce indices from labels with `StringIndexer`, train a model with those
+indices and retrieve the original labels from the column of predicted indices
+with `IndexToString`. However, you are free to supply your own labels.
+
+**Examples**
+
+Building on the `StringIndexer` example, let's assume we have the following
+DataFrame with columns `id` and `categoryIndex`:
+
+~~~~
+ id | categoryIndex
+----|---------------
+ 0 | 0.0
+ 1 | 2.0
+ 2 | 1.0
+ 3 | 0.0
+ 4 | 0.0
+ 5 | 1.0
+~~~~
+
+Applying `IndexToString` with `categoryIndex` as the input column,
+`originalCategory` as the output column, we are able to retrieve our original
+labels (they will be inferred from the columns' metadata):
+
+~~~~
+ id | categoryIndex | originalCategory
+----|---------------|-----------------
+ 0 | 0.0 | a
+ 1 | 2.0 | b
+ 2 | 1.0 | c
+ 3 | 0.0 | a
+ 4 | 0.0 | a
+ 5 | 1.0 | c
+~~~~
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+Refer to the [IndexToString Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IndexToString)
+for more details on the API.
+
+{% include_example scala/org/apache/spark/examples/ml/IndexToStringExample.scala %}
+
+</div>
+
+<div data-lang="java" markdown="1">
+
+Refer to the [IndexToString Java docs](api/java/org/apache/spark/ml/feature/IndexToString.html)
+for more details on the API.
+
+{% include_example java/org/apache/spark/examples/ml/JavaIndexToStringExample.java %}
+
+</div>
+
+<div data-lang="python" markdown="1">
+
+Refer to the [IndexToString Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IndexToString)
+for more details on the API.
+
+{% include_example python/ml/index_to_string_example.py %}
+
+</div>
+</div>
+
## OneHotEncoder
-[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
+[One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a column of label indices to a column of binary vectors, with at most a single one-value. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features
<div class="codetabs">
<div data-lang="scala" markdown="1">
@@ -979,10 +1048,11 @@ val indexer = new StringIndexer()
.fit(df)
val indexed = indexer.transform(df)
-val encoder = new OneHotEncoder().setInputCol("categoryIndex").
- setOutputCol("categoryVec")
+val encoder = new OneHotEncoder()
+ .setInputCol("categoryIndex")
+ .setOutputCol("categoryVec")
val encoded = encoder.transform(indexed)
-encoded.select("id", "categoryVec").foreach(println)
+encoded.select("id", "categoryVec").show()
{% endhighlight %}
</div>
@@ -1015,7 +1085,7 @@ JavaRDD<Row> jrdd = jsc.parallelize(Arrays.asList(
RowFactory.create(5, "c")
));
StructType schema = new StructType(new StructField[]{
- new StructField("id", DataTypes.DoubleType, false, Metadata.empty()),
+ new StructField("id", DataTypes.IntegerType, false, Metadata.empty()),
new StructField("category", DataTypes.StringType, false, Metadata.empty())
});
DataFrame df = sqlContext.createDataFrame(jrdd, schema);
@@ -1029,6 +1099,7 @@ OneHotEncoder encoder = new OneHotEncoder()
.setInputCol("categoryIndex")
.setOutputCol("categoryVec");
DataFrame encoded = encoder.transform(indexed);
+encoded.select("id", "categoryVec").show();
{% endhighlight %}
</div>
@@ -1054,6 +1125,7 @@ model = stringIndexer.fit(df)
indexed = model.transform(df)
encoder = OneHotEncoder(includeFirst=False, inputCol="categoryIndex", outputCol="categoryVec")
encoded = encoder.transform(indexed)
+encoded.select("id", "categoryVec").show()
{% endhighlight %}
</div>
</div>
@@ -1582,7 +1654,7 @@ from pyspark.mllib.linalg import Vectors
data = [(Vectors.dense([1.0, 2.0, 3.0]),), (Vectors.dense([4.0, 5.0, 6.0]),)]
df = sqlContext.createDataFrame(data, ["vector"])
-transformer = ElementwiseProduct(scalingVec=Vectors.dense([0.0, 1.0, 2.0]),
+transformer = ElementwiseProduct(scalingVec=Vectors.dense([0.0, 1.0, 2.0]),
inputCol="vector", outputCol="transformedVector")
transformer.transform(df).show()
@@ -1837,15 +1909,15 @@ for more details on the API.
sub-array of the original features. It is useful for extracting features from a vector column.
`VectorSlicer` accepts a vector column with a specified indices, then outputs a new vector column
-whose values are selected via those indices. There are two types of indices,
+whose values are selected via those indices. There are two types of indices,
1. Integer indices that represents the indices into the vector, `setIndices()`;
- 2. String indices that represents the names of features into the vector, `setNames()`.
+ 2. String indices that represents the names of features into the vector, `setNames()`.
*This requires the vector column to have an `AttributeGroup` since the implementation matches on
the name field of an `Attribute`.*
-Specification by integer and string are both acceptable. Moreover, you can use integer index and
+Specification by integer and string are both acceptable. Moreover, you can use integer index and
string name simultaneously. At least one feature must be selected. Duplicate features are not
allowed, so there can be no overlap between selected indices and names. Note that if names of
features are selected, an exception will be threw out when encountering with empty input attributes.
@@ -1858,9 +1930,9 @@ followed by the selected names (in the order given).
Suppose that we have a DataFrame with the column `userFeatures`:
~~~
- userFeatures
+ userFeatures
------------------
- [0.0, 10.0, 0.5]
+ [0.0, 10.0, 0.5]
~~~
`userFeatures` is a vector column that contains three user features. Assuming that the first column
@@ -1874,7 +1946,7 @@ column named `features`:
[0.0, 10.0, 0.5] | [10.0, 0.5]
~~~
-Suppose also that we have a potential input attributes for the `userFeatures`, i.e.
+Suppose also that we have a potential input attributes for the `userFeatures`, i.e.
`["f1", "f2", "f3"]`, then we can use `setNames("f2", "f3")` to select them.
~~~