aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
diff options
context:
space:
mode:
authorYuhao Yang <hhbyyh@gmail.com>2015-09-28 22:40:02 -0700
committerXiangrui Meng <meng@databricks.com>2015-09-28 22:40:02 -0700
commit9b9fe5f7bf55257269d8febcd64e95677075dfb6 (patch)
tree5e960bbf3257491b3454d7468e96be4e33dad589 /docs/ml-features.md
parentbf4199e261c3c8dd2970e2a154c97b46fb339f02 (diff)
downloadspark-9b9fe5f7bf55257269d8febcd64e95677075dfb6.tar.gz
spark-9b9fe5f7bf55257269d8febcd64e95677075dfb6.tar.bz2
spark-9b9fe5f7bf55257269d8febcd64e95677075dfb6.zip
[SPARK-10670] [ML] [Doc] add api reference for ml doc
jira: https://issues.apache.org/jira/browse/SPARK-10670 In the Markdown docs for the spark.ml Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "Word2Vec" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md This JIRA is just for spark.ml, not spark.mllib Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8901 from hhbyyh/docAPI.
Diffstat (limited to 'docs/ml-features.md')
-rw-r--r--docs/ml-features.md259
1 files changed, 195 insertions, 64 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md
index b70da4ac63..44a9882939 100644
--- a/docs/ml-features.md
+++ b/docs/ml-features.md
@@ -28,12 +28,15 @@ The algorithm combines Term Frequency (TF) counts with the [hashing trick](http:
**IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The `IDFModel` takes feature vectors (generally created from `HashingTF`) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus.
Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency.
-For API details, refer to the [HashingTF API docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and the [IDF API docs](api/scala/index.html#org.apache.spark.ml.feature.IDF).
In the following code segment, we start with a set of sentences. We split each sentence into words using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm.
<div class="codetabs">
<div data-lang="scala" markdown="1">
+
+Refer to the [HashingTF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and
+the [IDF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IDF) for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
@@ -54,6 +57,10 @@ rescaledData.select("features", "label").take(3).foreach(println)
</div>
<div data-lang="java" markdown="1">
+
+Refer to the [HashingTF Java docs](api/java/org/apache/spark/ml/feature/HashingTF.html) and the
+[IDF Java docs](api/java/org/apache/spark/ml/feature/IDF.html) for more details on the API.
+
{% highlight java %}
import java.util.Arrays;
@@ -100,6 +107,10 @@ for (Row r : rescaledData.select("features", "label").take(3)) {
</div>
<div data-lang="python" markdown="1">
+
+Refer to the [HashingTF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.HashingTF) and
+the [IDF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import HashingTF, IDF, Tokenizer
@@ -267,9 +278,11 @@ each vector represents the token counts of the document over the vocabulary.
<div class="codetabs">
<div data-lang="scala" markdown="1">
-More details can be found in the API docs for
-[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and
-[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel).
+
+Refer to the [CountVectorizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer)
+and the [CountVectorizerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.CountVectorizer
import org.apache.spark.mllib.util.CountVectorizerModel
@@ -297,9 +310,11 @@ cvModel.transform(df).select("features").show()
</div>
<div data-lang="java" markdown="1">
-More details can be found in the API docs for
-[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and
-[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html).
+
+Refer to the [CountVectorizer Java docs](api/java/org/apache/spark/ml/feature/CountVectorizer.html)
+and the [CountVectorizerModel Java docs](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html)
+for more details on the API.
+
{% highlight java %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.ml.feature.CountVectorizer;
@@ -351,6 +366,11 @@ cvModel.transform(df).show();
<div class="codetabs">
<div data-lang="scala" markdown="1">
+
+Refer to the [Tokenizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer)
+and the [RegexTokenizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer}
@@ -373,6 +393,11 @@ regexTokenized.select("words", "label").take(3).foreach(println)
</div>
<div data-lang="java" markdown="1">
+
+Refer to the [Tokenizer Java docs](api/java/org/apache/spark/ml/feature/Tokenizer.html)
+and the [RegexTokenizer Java docs](api/java/org/apache/spark/ml/feature/RegexTokenizer.html)
+for more details on the API.
+
{% highlight java %}
import java.util.Arrays;
@@ -414,6 +439,11 @@ RegexTokenizer regexTokenizer = new RegexTokenizer()
</div>
<div data-lang="python" markdown="1">
+
+Refer to the [Tokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Tokenizer) and
+the the [RegexTokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import Tokenizer, RegexTokenizer
@@ -443,7 +473,8 @@ words from the input sequences. The list of stopwords is specified by
the `stopWords` parameter. We provide [a list of stop
words](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) by
default, accessible by calling `getStopWords` on a newly instantiated
-`StopWordsRemover` instance.
+`StopWordsRemover` instance. A boolean parameter `caseSensitive` indicates
+if the matches should be case sensitive (false by default).
**Examples**
@@ -473,10 +504,8 @@ filtered out.
<div data-lang="scala" markdown="1">
-[`StopWordsRemover`](api/scala/index.html#org.apache.spark.ml.feature.StopWordsRemover)
-takes an input column name, an output column name, a list of stop words,
-and a boolean indicating if the matches should be case sensitive (false
-by default).
+Refer to the [StopWordsRemover Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StopWordsRemover)
+for more details on the API.
{% highlight scala %}
import org.apache.spark.ml.feature.StopWordsRemover
@@ -495,10 +524,8 @@ remover.transform(dataSet).show()
<div data-lang="java" markdown="1">
-[`StopWordsRemover`](api/java/org/apache/spark/ml/feature/StopWordsRemover.html)
-takes an input column name, an output column name, a list of stop words,
-and a boolean indicating if the matches should be case sensitive (false
-by default).
+Refer to the [StopWordsRemover Java docs](api/java/org/apache/spark/ml/feature/StopWordsRemover.html)
+for more details on the API.
{% highlight java %}
import java.util.Arrays;
@@ -531,10 +558,9 @@ remover.transform(dataset).show();
</div>
<div data-lang="python" markdown="1">
-[`StopWordsRemover`](api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover)
-takes an input column name, an output column name, a list of stop words,
-and a boolean indicating if the matches should be case sensitive (false
-by default).
+
+Refer to the [StopWordsRemover Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover)
+for more details on the API.
{% highlight python %}
from pyspark.ml.feature import StopWordsRemover
@@ -560,7 +586,8 @@ An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ tokens (t
<div data-lang="scala" markdown="1">
-[`NGram`](api/scala/index.html#org.apache.spark.ml.feature.NGram) takes an input column name, an output column name, and an optional length parameter n (n=2 by default).
+Refer to the [NGram Scala docs](api/scala/index.html#org.apache.spark.ml.feature.NGram)
+for more details on the API.
{% highlight scala %}
import org.apache.spark.ml.feature.NGram
@@ -579,7 +606,8 @@ ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(pri
<div data-lang="java" markdown="1">
-[`NGram`](api/java/org/apache/spark/ml/feature/NGram.html) takes an input column name, an output column name, and an optional length parameter n (n=2 by default).
+Refer to the [NGram Java docs](api/java/org/apache/spark/ml/feature/NGram.html)
+for more details on the API.
{% highlight java %}
import java.util.Arrays;
@@ -617,7 +645,8 @@ for (Row r : ngramDataFrame.select("ngrams", "label").take(3)) {
<div data-lang="python" markdown="1">
-[`NGram`](api/python/pyspark.ml.html#pyspark.ml.feature.NGram) takes an input column name, an output column name, and an optional length parameter n (n=2 by default).
+Refer to the [NGram Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.NGram)
+for more details on the API.
{% highlight python %}
from pyspark.ml.feature import NGram
@@ -645,7 +674,8 @@ Binarization is the process of thresholding numerical features to binary (0/1) f
<div class="codetabs">
<div data-lang="scala" markdown="1">
-Refer to the [Binarizer API doc](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) for more details.
+Refer to the [Binarizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Binarizer)
+for more details on the API.
{% highlight scala %}
import org.apache.spark.ml.feature.Binarizer
@@ -671,7 +701,8 @@ binarizedFeatures.collect().foreach(println)
<div data-lang="java" markdown="1">
-Refer to the [Binarizer API doc](api/java/org/apache/spark/ml/feature/Binarizer.html) for more details.
+Refer to the [Binarizer Java docs](api/java/org/apache/spark/ml/feature/Binarizer.html)
+for more details on the API.
{% highlight java %}
import java.util.Arrays;
@@ -711,7 +742,8 @@ for (Row r : binarizedFeatures.collect()) {
<div data-lang="python" markdown="1">
-Refer to the [Binarizer API doc](api/python/pyspark.ml.html#pyspark.ml.feature.Binarizer) for more details.
+Refer to the [Binarizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Binarizer)
+for more details on the API.
{% highlight python %}
from pyspark.ml.feature import Binarizer
@@ -736,7 +768,10 @@ for binarized_feature, in binarizedFeatures.collect():
<div class="codetabs">
<div data-lang="scala" markdown="1">
-See the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.feature.PCA) for API details.
+
+Refer to the [PCA Scala docs](api/scala/index.html#org.apache.spark.ml.feature.PCA)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.PCA
import org.apache.spark.mllib.linalg.Vectors
@@ -759,7 +794,10 @@ result.show()
</div>
<div data-lang="java" markdown="1">
-See the [Java API documentation](api/java/org/apache/spark/ml/feature/PCA.html) for API details.
+
+Refer to the [PCA Java docs](api/java/org/apache/spark/ml/feature/PCA.html)
+for more details on the API.
+
{% highlight java %}
import java.util.Arrays;
@@ -799,7 +837,10 @@ result.show();
</div>
<div data-lang="python" markdown="1">
-See the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.feature.PCA) for API details.
+
+Refer to the [PCA Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.PCA)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import PCA
from pyspark.mllib.linalg import Vectors
@@ -822,6 +863,10 @@ result.show(truncate=False)
<div class="codetabs">
<div data-lang="scala" markdown="1">
+
+Refer to the [PolynomialExpansion Scala docs](api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.PolynomialExpansion
import org.apache.spark.mllib.linalg.Vectors
@@ -842,6 +887,10 @@ polyDF.select("polyFeatures").take(3).foreach(println)
</div>
<div data-lang="java" markdown="1">
+
+Refer to the [PolynomialExpansion Java docs](api/java/org/apache/spark/ml/feature/PolynomialExpansion.html)
+for more details on the API.
+
{% highlight java %}
import java.util.Arrays;
@@ -882,6 +931,10 @@ for (Row r : row) {
</div>
<div data-lang="python" markdown="1">
+
+Refer to the [PolynomialExpansion Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.PolynomialExpansion)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import PolynomialExpansion
from pyspark.mllib.linalg import Vectors
@@ -915,6 +968,10 @@ $0$th DCT coefficient and _not_ the $N/2$th).
<div class="codetabs">
<div data-lang="scala" markdown="1">
+
+Refer to the [DCT Scala docs](api/scala/index.html#org.apache.spark.ml.feature.DCT)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.DCT
import org.apache.spark.mllib.linalg.Vectors
@@ -934,6 +991,10 @@ dctDf.select("featuresDCT").show(3)
</div>
<div data-lang="java" markdown="1">
+
+Refer to the [DCT Java docs](api/java/org/apache/spark/ml/feature/DCT.html)
+for more details on the API.
+
{% highlight java %}
import java.util.Arrays;
@@ -1018,8 +1079,8 @@ index `2`.
<div data-lang="scala" markdown="1">
-[`StringIndexer`](api/scala/index.html#org.apache.spark.ml.feature.StringIndexer) takes an input
-column name and an output column name.
+Refer to the [StringIndexer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StringIndexer)
+for more details on the API.
{% highlight scala %}
import org.apache.spark.ml.feature.StringIndexer
@@ -1036,8 +1097,9 @@ indexed.show()
</div>
<div data-lang="java" markdown="1">
-[`StringIndexer`](api/java/org/apache/spark/ml/feature/StringIndexer.html) takes an input column
-name and an output column name.
+
+Refer to the [StringIndexer Java docs](api/java/org/apache/spark/ml/feature/StringIndexer.html)
+for more details on the API.
{% highlight java %}
import java.util.Arrays;
@@ -1074,8 +1136,8 @@ indexed.show();
<div data-lang="python" markdown="1">
-[`StringIndexer`](api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer) takes an input
-column name and an output column name.
+Refer to the [StringIndexer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer)
+for more details on the API.
{% highlight python %}
from pyspark.ml.feature import StringIndexer
@@ -1096,6 +1158,10 @@ indexed.show()
<div class="codetabs">
<div data-lang="scala" markdown="1">
+
+Refer to the [OneHotEncoder Scala docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer}
@@ -1122,6 +1188,10 @@ encoded.select("id", "categoryVec").foreach(println)
</div>
<div data-lang="java" markdown="1">
+
+Refer to the [OneHotEncoder Java docs](api/java/org/apache/spark/ml/feature/OneHotEncoder.html)
+for more details on the API.
+
{% highlight java %}
import java.util.Arrays;
@@ -1164,6 +1234,10 @@ DataFrame encoded = encoder.transform(indexed);
</div>
<div data-lang="python" markdown="1">
+
+Refer to the [OneHotEncoder Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import OneHotEncoder, StringIndexer
@@ -1197,12 +1271,14 @@ It can both automatically decide which features are categorical and convert orig
Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance.
-Please refer to the [VectorIndexer API docs](api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer) for more details.
-
In the example below, we read in a dataset of labeled points and then use `VectorIndexer` to decide which features should be treated as categorical. We transform the categorical feature values to their indices. This transformed data could then be passed to algorithms such as `DecisionTreeRegressor` that handle categorical features.
<div class="codetabs">
<div data-lang="scala" markdown="1">
+
+Refer to the [VectorIndexer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.VectorIndexer
@@ -1223,6 +1299,10 @@ val indexedData = indexerModel.transform(data)
</div>
<div data-lang="java" markdown="1">
+
+Refer to the [VectorIndexer Java docs](api/java/org/apache/spark/ml/feature/VectorIndexer.html)
+for more details on the API.
+
{% highlight java %}
import java.util.Map;
@@ -1250,6 +1330,10 @@ DataFrame indexedData = indexerModel.transform(data);
</div>
<div data-lang="python" markdown="1">
+
+Refer to the [VectorIndexer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.VectorIndexer)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import VectorIndexer
@@ -1273,6 +1357,10 @@ The following example demonstrates how to load a dataset in libsvm format and th
<div class="codetabs">
<div data-lang="scala">
+
+Refer to the [Normalizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Normalizer)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.Normalizer
@@ -1292,6 +1380,10 @@ val lInfNormData = normalizer.transform(dataFrame, normalizer.p -> Double.Positi
</div>
<div data-lang="java">
+
+Refer to the [Normalizer Java docs](api/java/org/apache/spark/ml/feature/Normalizer.html)
+for more details on the API.
+
{% highlight java %}
import org.apache.spark.ml.feature.Normalizer;
import org.apache.spark.sql.DataFrame;
@@ -1313,6 +1405,10 @@ DataFrame lInfNormData =
</div>
<div data-lang="python">
+
+Refer to the [Normalizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import Normalizer
@@ -1341,14 +1437,14 @@ lInfNormData = normalizer.transform(dataFrame, {normalizer.p: float("inf")})
Note that if the standard deviation of a feature is zero, it will return default `0.0` value in the `Vector` for that feature.
-More details can be found in the API docs for
-[StandardScaler](api/scala/index.html#org.apache.spark.ml.feature.StandardScaler) and
-[StandardScalerModel](api/scala/index.html#org.apache.spark.ml.feature.StandardScalerModel).
-
The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation.
<div class="codetabs">
<div data-lang="scala">
+
+Refer to the [StandardScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StandardScaler)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.StandardScaler
@@ -1369,6 +1465,10 @@ val scaledData = scalerModel.transform(dataFrame)
</div>
<div data-lang="java">
+
+Refer to the [StandardScaler Java docs](api/java/org/apache/spark/ml/feature/StandardScaler.html)
+for more details on the API.
+
{% highlight java %}
import org.apache.spark.ml.feature.StandardScaler;
import org.apache.spark.ml.feature.StandardScalerModel;
@@ -1391,6 +1491,10 @@ DataFrame scaledData = scalerModel.transform(dataFrame);
</div>
<div data-lang="python">
+
+Refer to the [StandardScaler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StandardScaler)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import StandardScaler
@@ -1429,9 +1533,11 @@ The following example demonstrates how to load a dataset in libsvm format and th
<div class="codetabs">
<div data-lang="scala" markdown="1">
-More details can be found in the API docs for
-[MinMaxScaler](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and
-[MinMaxScalerModel](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel).
+
+Refer to the [MinMaxScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler)
+and the [MinMaxScalerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.MinMaxScaler
@@ -1450,9 +1556,11 @@ val scaledData = scalerModel.transform(dataFrame)
</div>
<div data-lang="java" markdown="1">
-More details can be found in the API docs for
-[MinMaxScaler](api/java/org/apache/spark/ml/feature/MinMaxScaler.html) and
-[MinMaxScalerModel](api/java/org/apache/spark/ml/feature/MinMaxScalerModel.html).
+
+Refer to the [MinMaxScaler Java docs](api/java/org/apache/spark/ml/feature/MinMaxScaler.html)
+and the [MinMaxScalerModel Java docs](api/java/org/apache/spark/ml/feature/MinMaxScalerModel.html)
+for more details on the API.
+
{% highlight java %}
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.ml.feature.MinMaxScaler;
@@ -1490,6 +1598,10 @@ The following example demonstrates how to bucketize a column of `Double`s into a
<div class="codetabs">
<div data-lang="scala">
+
+Refer to the [Bucketizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.Bucketizer
import org.apache.spark.sql.DataFrame
@@ -1510,6 +1622,10 @@ val bucketedData = bucketizer.transform(dataFrame)
</div>
<div data-lang="java">
+
+Refer to the [Bucketizer Java docs](api/java/org/apache/spark/ml/feature/Bucketizer.html)
+for more details on the API.
+
{% highlight java %}
import java.util.Arrays;
@@ -1545,6 +1661,10 @@ DataFrame bucketedData = bucketizer.transform(dataFrame);
</div>
<div data-lang="python">
+
+Refer to the [Bucketizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Bucketizer)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import Bucketizer
@@ -1581,14 +1701,14 @@ v_N
\end{pmatrix}
\]`
-[`ElementwiseProduct`](api/scala/index.html#org.apache.spark.ml.feature.ElementwiseProduct) takes the following parameter:
-
-* `scalingVec`: the transforming vector.
-
This example below demonstrates how to transform vectors using a transforming vector value.
<div class="codetabs">
<div data-lang="scala" markdown="1">
+
+Refer to the [ElementwiseProduct Scala docs](api/scala/index.html#org.apache.spark.ml.feature.ElementwiseProduct)
+for more details on the API.
+
{% highlight scala %}
import org.apache.spark.ml.feature.ElementwiseProduct
import org.apache.spark.mllib.linalg.Vectors
@@ -1611,6 +1731,10 @@ transformer.transform(dataFrame).show()
</div>
<div data-lang="java" markdown="1">
+
+Refer to the [ElementwiseProduct Java docs](api/java/org/apache/spark/ml/feature/ElementwiseProduct.html)
+for more details on the API.
+
{% highlight java %}
import java.util.Arrays;
@@ -1649,6 +1773,10 @@ transformer.transform(dataFrame).show();
</div>
<div data-lang="python" markdown="1">
+
+Refer to the [ElementwiseProduct Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.ElementwiseProduct)
+for more details on the API.
+
{% highlight python %}
from pyspark.ml.feature import ElementwiseProduct
from pyspark.mllib.linalg import Vectors
@@ -1702,8 +1830,8 @@ output column to `features`, after transformation we should get the following Da
<div class="codetabs">
<div data-lang="scala" markdown="1">
-[`VectorAssembler`](api/scala/index.html#org.apache.spark.ml.feature.VectorAssembler) takes an array
-of input column names and an output column name.
+Refer to the [VectorAssembler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorAssembler)
+for more details on the API.
{% highlight scala %}
import org.apache.spark.mllib.linalg.Vectors
@@ -1722,8 +1850,8 @@ println(output.select("features", "clicked").first())
<div data-lang="java" markdown="1">
-[`VectorAssembler`](api/java/org/apache/spark/ml/feature/VectorAssembler.html) takes an array
-of input column names and an output column name.
+Refer to the [VectorAssembler Java docs](api/java/org/apache/spark/ml/feature/VectorAssembler.html)
+for more details on the API.
{% highlight java %}
import java.util.Arrays;
@@ -1759,8 +1887,8 @@ System.out.println(output.select("features", "clicked").first());
<div data-lang="python" markdown="1">
-[`VectorAssembler`](api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler) takes a list
-of input column names and an output column name.
+Refer to the [VectorAssembler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler)
+for more details on the API.
{% highlight python %}
from pyspark.mllib.linalg import Vectors
@@ -1836,8 +1964,8 @@ Suppose also that we have a potential input attributes for the `userFeatures`, i
<div class="codetabs">
<div data-lang="scala" markdown="1">
-[`VectorSlicer`](api/scala/index.html#org.apache.spark.ml.feature.VectorSlicer) takes an input
-column name with specified indices or names and an output column name.
+Refer to the [VectorSlicer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSlicer)
+for more details on the API.
{% highlight scala %}
import org.apache.spark.mllib.linalg.Vectors
@@ -1870,8 +1998,8 @@ println(output.select("userFeatures", "features").first())
<div data-lang="java" markdown="1">
-[`VectorSlicer`](api/java/org/apache/spark/ml/feature/VectorSlicer.html) takes an input column name
-with specified indices or names and an output column name.
+Refer to the [VectorSlicer Java docs](api/java/org/apache/spark/ml/feature/VectorSlicer.html)
+for more details on the API.
{% highlight java %}
import java.util.Arrays;
@@ -1941,7 +2069,8 @@ id | country | hour | clicked | features | label
<div class="codetabs">
<div data-lang="scala" markdown="1">
-[`RFormula`](api/scala/index.html#org.apache.spark.ml.feature.RFormula) takes an R formula string, and optional parameters for the names of its output columns.
+Refer to the [RFormula Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RFormula)
+for more details on the API.
{% highlight scala %}
import org.apache.spark.ml.feature.RFormula
@@ -1962,7 +2091,8 @@ output.select("features", "label").show()
<div data-lang="java" markdown="1">
-[`RFormula`](api/java/org/apache/spark/ml/feature/RFormula.html) takes an R formula string, and optional parameters for the names of its output columns.
+Refer to the [RFormula Java docs](api/java/org/apache/spark/ml/feature/RFormula.html)
+for more details on the API.
{% highlight java %}
import java.util.Arrays;
@@ -2000,7 +2130,8 @@ output.select("features", "label").show();
<div data-lang="python" markdown="1">
-[`RFormula`](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula) takes an R formula string, and optional parameters for the names of its output columns.
+Refer to the [RFormula Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula)
+for more details on the API.
{% highlight python %}
from pyspark.ml.feature import RFormula