diff options
-rw-r--r-- | docs/ml-features.md | 259 |
1 files changed, 195 insertions, 64 deletions
diff --git a/docs/ml-features.md b/docs/ml-features.md index b70da4ac63..44a9882939 100644 --- a/docs/ml-features.md +++ b/docs/ml-features.md @@ -28,12 +28,15 @@ The algorithm combines Term Frequency (TF) counts with the [hashing trick](http: **IDF**: `IDF` is an `Estimator` which fits on a dataset and produces an `IDFModel`. The `IDFModel` takes feature vectors (generally created from `HashingTF`) and scales each column. Intuitively, it down-weights columns which appear frequently in a corpus. Please refer to the [MLlib user guide on TF-IDF](mllib-feature-extraction.html#tf-idf) for more details on Term Frequency and Inverse Document Frequency. -For API details, refer to the [HashingTF API docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and the [IDF API docs](api/scala/index.html#org.apache.spark.ml.feature.IDF). In the following code segment, we start with a set of sentences. We split each sentence into words using `Tokenizer`. For each sentence (bag of words), we use `HashingTF` to hash the sentence into a feature vector. We use `IDF` to rescale the feature vectors; this generally improves performance when using text as features. Our feature vectors could then be passed to a learning algorithm. <div class="codetabs"> <div data-lang="scala" markdown="1"> + +Refer to the [HashingTF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.HashingTF) and +the [IDF Scala docs](api/scala/index.html#org.apache.spark.ml.feature.IDF) for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer} @@ -54,6 +57,10 @@ rescaledData.select("features", "label").take(3).foreach(println) </div> <div data-lang="java" markdown="1"> + +Refer to the [HashingTF Java docs](api/java/org/apache/spark/ml/feature/HashingTF.html) and the +[IDF Java docs](api/java/org/apache/spark/ml/feature/IDF.html) for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -100,6 +107,10 @@ for (Row r : rescaledData.select("features", "label").take(3)) { </div> <div data-lang="python" markdown="1"> + +Refer to the [HashingTF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.HashingTF) and +the [IDF Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.IDF) for more details on the API. + {% highlight python %} from pyspark.ml.feature import HashingTF, IDF, Tokenizer @@ -267,9 +278,11 @@ each vector represents the token counts of the document over the vocabulary. <div class="codetabs"> <div data-lang="scala" markdown="1"> -More details can be found in the API docs for -[CountVectorizer](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) and -[CountVectorizerModel](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel). + +Refer to the [CountVectorizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizer) +and the [CountVectorizerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.CountVectorizerModel) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.CountVectorizer import org.apache.spark.mllib.util.CountVectorizerModel @@ -297,9 +310,11 @@ cvModel.transform(df).select("features").show() </div> <div data-lang="java" markdown="1"> -More details can be found in the API docs for -[CountVectorizer](api/java/org/apache/spark/ml/feature/CountVectorizer.html) and -[CountVectorizerModel](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html). + +Refer to the [CountVectorizer Java docs](api/java/org/apache/spark/ml/feature/CountVectorizer.html) +and the [CountVectorizerModel Java docs](api/java/org/apache/spark/ml/feature/CountVectorizerModel.html) +for more details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaRDD; import org.apache.spark.ml.feature.CountVectorizer; @@ -351,6 +366,11 @@ cvModel.transform(df).show(); <div class="codetabs"> <div data-lang="scala" markdown="1"> + +Refer to the [Tokenizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) +and the [RegexTokenizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Tokenizer) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.{Tokenizer, RegexTokenizer} @@ -373,6 +393,11 @@ regexTokenized.select("words", "label").take(3).foreach(println) </div> <div data-lang="java" markdown="1"> + +Refer to the [Tokenizer Java docs](api/java/org/apache/spark/ml/feature/Tokenizer.html) +and the [RegexTokenizer Java docs](api/java/org/apache/spark/ml/feature/RegexTokenizer.html) +for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -414,6 +439,11 @@ RegexTokenizer regexTokenizer = new RegexTokenizer() </div> <div data-lang="python" markdown="1"> + +Refer to the [Tokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Tokenizer) and +the the [RegexTokenizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RegexTokenizer) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import Tokenizer, RegexTokenizer @@ -443,7 +473,8 @@ words from the input sequences. The list of stopwords is specified by the `stopWords` parameter. We provide [a list of stop words](http://ir.dcs.gla.ac.uk/resources/linguistic_utils/stop_words) by default, accessible by calling `getStopWords` on a newly instantiated -`StopWordsRemover` instance. +`StopWordsRemover` instance. A boolean parameter `caseSensitive` indicates +if the matches should be case sensitive (false by default). **Examples** @@ -473,10 +504,8 @@ filtered out. <div data-lang="scala" markdown="1"> -[`StopWordsRemover`](api/scala/index.html#org.apache.spark.ml.feature.StopWordsRemover) -takes an input column name, an output column name, a list of stop words, -and a boolean indicating if the matches should be case sensitive (false -by default). +Refer to the [StopWordsRemover Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StopWordsRemover) +for more details on the API. {% highlight scala %} import org.apache.spark.ml.feature.StopWordsRemover @@ -495,10 +524,8 @@ remover.transform(dataSet).show() <div data-lang="java" markdown="1"> -[`StopWordsRemover`](api/java/org/apache/spark/ml/feature/StopWordsRemover.html) -takes an input column name, an output column name, a list of stop words, -and a boolean indicating if the matches should be case sensitive (false -by default). +Refer to the [StopWordsRemover Java docs](api/java/org/apache/spark/ml/feature/StopWordsRemover.html) +for more details on the API. {% highlight java %} import java.util.Arrays; @@ -531,10 +558,9 @@ remover.transform(dataset).show(); </div> <div data-lang="python" markdown="1"> -[`StopWordsRemover`](api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover) -takes an input column name, an output column name, a list of stop words, -and a boolean indicating if the matches should be case sensitive (false -by default). + +Refer to the [StopWordsRemover Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StopWordsRemover) +for more details on the API. {% highlight python %} from pyspark.ml.feature import StopWordsRemover @@ -560,7 +586,8 @@ An [n-gram](https://en.wikipedia.org/wiki/N-gram) is a sequence of $n$ tokens (t <div data-lang="scala" markdown="1"> -[`NGram`](api/scala/index.html#org.apache.spark.ml.feature.NGram) takes an input column name, an output column name, and an optional length parameter n (n=2 by default). +Refer to the [NGram Scala docs](api/scala/index.html#org.apache.spark.ml.feature.NGram) +for more details on the API. {% highlight scala %} import org.apache.spark.ml.feature.NGram @@ -579,7 +606,8 @@ ngramDataFrame.take(3).map(_.getAs[Stream[String]]("ngrams").toList).foreach(pri <div data-lang="java" markdown="1"> -[`NGram`](api/java/org/apache/spark/ml/feature/NGram.html) takes an input column name, an output column name, and an optional length parameter n (n=2 by default). +Refer to the [NGram Java docs](api/java/org/apache/spark/ml/feature/NGram.html) +for more details on the API. {% highlight java %} import java.util.Arrays; @@ -617,7 +645,8 @@ for (Row r : ngramDataFrame.select("ngrams", "label").take(3)) { <div data-lang="python" markdown="1"> -[`NGram`](api/python/pyspark.ml.html#pyspark.ml.feature.NGram) takes an input column name, an output column name, and an optional length parameter n (n=2 by default). +Refer to the [NGram Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.NGram) +for more details on the API. {% highlight python %} from pyspark.ml.feature import NGram @@ -645,7 +674,8 @@ Binarization is the process of thresholding numerical features to binary (0/1) f <div class="codetabs"> <div data-lang="scala" markdown="1"> -Refer to the [Binarizer API doc](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) for more details. +Refer to the [Binarizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Binarizer) +for more details on the API. {% highlight scala %} import org.apache.spark.ml.feature.Binarizer @@ -671,7 +701,8 @@ binarizedFeatures.collect().foreach(println) <div data-lang="java" markdown="1"> -Refer to the [Binarizer API doc](api/java/org/apache/spark/ml/feature/Binarizer.html) for more details. +Refer to the [Binarizer Java docs](api/java/org/apache/spark/ml/feature/Binarizer.html) +for more details on the API. {% highlight java %} import java.util.Arrays; @@ -711,7 +742,8 @@ for (Row r : binarizedFeatures.collect()) { <div data-lang="python" markdown="1"> -Refer to the [Binarizer API doc](api/python/pyspark.ml.html#pyspark.ml.feature.Binarizer) for more details. +Refer to the [Binarizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Binarizer) +for more details on the API. {% highlight python %} from pyspark.ml.feature import Binarizer @@ -736,7 +768,10 @@ for binarized_feature, in binarizedFeatures.collect(): <div class="codetabs"> <div data-lang="scala" markdown="1"> -See the [Scala API documentation](api/scala/index.html#org.apache.spark.ml.feature.PCA) for API details. + +Refer to the [PCA Scala docs](api/scala/index.html#org.apache.spark.ml.feature.PCA) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.PCA import org.apache.spark.mllib.linalg.Vectors @@ -759,7 +794,10 @@ result.show() </div> <div data-lang="java" markdown="1"> -See the [Java API documentation](api/java/org/apache/spark/ml/feature/PCA.html) for API details. + +Refer to the [PCA Java docs](api/java/org/apache/spark/ml/feature/PCA.html) +for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -799,7 +837,10 @@ result.show(); </div> <div data-lang="python" markdown="1"> -See the [Python API documentation](api/python/pyspark.ml.html#pyspark.ml.feature.PCA) for API details. + +Refer to the [PCA Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.PCA) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import PCA from pyspark.mllib.linalg import Vectors @@ -822,6 +863,10 @@ result.show(truncate=False) <div class="codetabs"> <div data-lang="scala" markdown="1"> + +Refer to the [PolynomialExpansion Scala docs](api/scala/index.html#org.apache.spark.ml.feature.PolynomialExpansion) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.PolynomialExpansion import org.apache.spark.mllib.linalg.Vectors @@ -842,6 +887,10 @@ polyDF.select("polyFeatures").take(3).foreach(println) </div> <div data-lang="java" markdown="1"> + +Refer to the [PolynomialExpansion Java docs](api/java/org/apache/spark/ml/feature/PolynomialExpansion.html) +for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -882,6 +931,10 @@ for (Row r : row) { </div> <div data-lang="python" markdown="1"> + +Refer to the [PolynomialExpansion Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.PolynomialExpansion) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import PolynomialExpansion from pyspark.mllib.linalg import Vectors @@ -915,6 +968,10 @@ $0$th DCT coefficient and _not_ the $N/2$th). <div class="codetabs"> <div data-lang="scala" markdown="1"> + +Refer to the [DCT Scala docs](api/scala/index.html#org.apache.spark.ml.feature.DCT) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.DCT import org.apache.spark.mllib.linalg.Vectors @@ -934,6 +991,10 @@ dctDf.select("featuresDCT").show(3) </div> <div data-lang="java" markdown="1"> + +Refer to the [DCT Java docs](api/java/org/apache/spark/ml/feature/DCT.html) +for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -1018,8 +1079,8 @@ index `2`. <div data-lang="scala" markdown="1"> -[`StringIndexer`](api/scala/index.html#org.apache.spark.ml.feature.StringIndexer) takes an input -column name and an output column name. +Refer to the [StringIndexer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StringIndexer) +for more details on the API. {% highlight scala %} import org.apache.spark.ml.feature.StringIndexer @@ -1036,8 +1097,9 @@ indexed.show() </div> <div data-lang="java" markdown="1"> -[`StringIndexer`](api/java/org/apache/spark/ml/feature/StringIndexer.html) takes an input column -name and an output column name. + +Refer to the [StringIndexer Java docs](api/java/org/apache/spark/ml/feature/StringIndexer.html) +for more details on the API. {% highlight java %} import java.util.Arrays; @@ -1074,8 +1136,8 @@ indexed.show(); <div data-lang="python" markdown="1"> -[`StringIndexer`](api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer) takes an input -column name and an output column name. +Refer to the [StringIndexer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StringIndexer) +for more details on the API. {% highlight python %} from pyspark.ml.feature import StringIndexer @@ -1096,6 +1158,10 @@ indexed.show() <div class="codetabs"> <div data-lang="scala" markdown="1"> + +Refer to the [OneHotEncoder Scala docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.{OneHotEncoder, StringIndexer} @@ -1122,6 +1188,10 @@ encoded.select("id", "categoryVec").foreach(println) </div> <div data-lang="java" markdown="1"> + +Refer to the [OneHotEncoder Java docs](api/java/org/apache/spark/ml/feature/OneHotEncoder.html) +for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -1164,6 +1234,10 @@ DataFrame encoded = encoder.transform(indexed); </div> <div data-lang="python" markdown="1"> + +Refer to the [OneHotEncoder Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import OneHotEncoder, StringIndexer @@ -1197,12 +1271,14 @@ It can both automatically decide which features are categorical and convert orig Indexing categorical features allows algorithms such as Decision Trees and Tree Ensembles to treat categorical features appropriately, improving performance. -Please refer to the [VectorIndexer API docs](api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer) for more details. - In the example below, we read in a dataset of labeled points and then use `VectorIndexer` to decide which features should be treated as categorical. We transform the categorical feature values to their indices. This transformed data could then be passed to algorithms such as `DecisionTreeRegressor` that handle categorical features. <div class="codetabs"> <div data-lang="scala" markdown="1"> + +Refer to the [VectorIndexer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorIndexer) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.VectorIndexer @@ -1223,6 +1299,10 @@ val indexedData = indexerModel.transform(data) </div> <div data-lang="java" markdown="1"> + +Refer to the [VectorIndexer Java docs](api/java/org/apache/spark/ml/feature/VectorIndexer.html) +for more details on the API. + {% highlight java %} import java.util.Map; @@ -1250,6 +1330,10 @@ DataFrame indexedData = indexerModel.transform(data); </div> <div data-lang="python" markdown="1"> + +Refer to the [VectorIndexer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.VectorIndexer) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import VectorIndexer @@ -1273,6 +1357,10 @@ The following example demonstrates how to load a dataset in libsvm format and th <div class="codetabs"> <div data-lang="scala"> + +Refer to the [Normalizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Normalizer) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.Normalizer @@ -1292,6 +1380,10 @@ val lInfNormData = normalizer.transform(dataFrame, normalizer.p -> Double.Positi </div> <div data-lang="java"> + +Refer to the [Normalizer Java docs](api/java/org/apache/spark/ml/feature/Normalizer.html) +for more details on the API. + {% highlight java %} import org.apache.spark.ml.feature.Normalizer; import org.apache.spark.sql.DataFrame; @@ -1313,6 +1405,10 @@ DataFrame lInfNormData = </div> <div data-lang="python"> + +Refer to the [Normalizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Normalizer) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import Normalizer @@ -1341,14 +1437,14 @@ lInfNormData = normalizer.transform(dataFrame, {normalizer.p: float("inf")}) Note that if the standard deviation of a feature is zero, it will return default `0.0` value in the `Vector` for that feature. -More details can be found in the API docs for -[StandardScaler](api/scala/index.html#org.apache.spark.ml.feature.StandardScaler) and -[StandardScalerModel](api/scala/index.html#org.apache.spark.ml.feature.StandardScalerModel). - The following example demonstrates how to load a dataset in libsvm format and then normalize each feature to have unit standard deviation. <div class="codetabs"> <div data-lang="scala"> + +Refer to the [StandardScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.StandardScaler) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.StandardScaler @@ -1369,6 +1465,10 @@ val scaledData = scalerModel.transform(dataFrame) </div> <div data-lang="java"> + +Refer to the [StandardScaler Java docs](api/java/org/apache/spark/ml/feature/StandardScaler.html) +for more details on the API. + {% highlight java %} import org.apache.spark.ml.feature.StandardScaler; import org.apache.spark.ml.feature.StandardScalerModel; @@ -1391,6 +1491,10 @@ DataFrame scaledData = scalerModel.transform(dataFrame); </div> <div data-lang="python"> + +Refer to the [StandardScaler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.StandardScaler) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import StandardScaler @@ -1429,9 +1533,11 @@ The following example demonstrates how to load a dataset in libsvm format and th <div class="codetabs"> <div data-lang="scala" markdown="1"> -More details can be found in the API docs for -[MinMaxScaler](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and -[MinMaxScalerModel](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel). + +Refer to the [MinMaxScaler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) +and the [MinMaxScalerModel Scala docs](api/scala/index.html#org.apache.spark.ml.feature.MinMaxScalerModel) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.MinMaxScaler @@ -1450,9 +1556,11 @@ val scaledData = scalerModel.transform(dataFrame) </div> <div data-lang="java" markdown="1"> -More details can be found in the API docs for -[MinMaxScaler](api/java/org/apache/spark/ml/feature/MinMaxScaler.html) and -[MinMaxScalerModel](api/java/org/apache/spark/ml/feature/MinMaxScalerModel.html). + +Refer to the [MinMaxScaler Java docs](api/java/org/apache/spark/ml/feature/MinMaxScaler.html) +and the [MinMaxScalerModel Java docs](api/java/org/apache/spark/ml/feature/MinMaxScalerModel.html) +for more details on the API. + {% highlight java %} import org.apache.spark.api.java.JavaRDD; import org.apache.spark.ml.feature.MinMaxScaler; @@ -1490,6 +1598,10 @@ The following example demonstrates how to bucketize a column of `Double`s into a <div class="codetabs"> <div data-lang="scala"> + +Refer to the [Bucketizer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.Bucketizer) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.Bucketizer import org.apache.spark.sql.DataFrame @@ -1510,6 +1622,10 @@ val bucketedData = bucketizer.transform(dataFrame) </div> <div data-lang="java"> + +Refer to the [Bucketizer Java docs](api/java/org/apache/spark/ml/feature/Bucketizer.html) +for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -1545,6 +1661,10 @@ DataFrame bucketedData = bucketizer.transform(dataFrame); </div> <div data-lang="python"> + +Refer to the [Bucketizer Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.Bucketizer) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import Bucketizer @@ -1581,14 +1701,14 @@ v_N \end{pmatrix} \]` -[`ElementwiseProduct`](api/scala/index.html#org.apache.spark.ml.feature.ElementwiseProduct) takes the following parameter: - -* `scalingVec`: the transforming vector. - This example below demonstrates how to transform vectors using a transforming vector value. <div class="codetabs"> <div data-lang="scala" markdown="1"> + +Refer to the [ElementwiseProduct Scala docs](api/scala/index.html#org.apache.spark.ml.feature.ElementwiseProduct) +for more details on the API. + {% highlight scala %} import org.apache.spark.ml.feature.ElementwiseProduct import org.apache.spark.mllib.linalg.Vectors @@ -1611,6 +1731,10 @@ transformer.transform(dataFrame).show() </div> <div data-lang="java" markdown="1"> + +Refer to the [ElementwiseProduct Java docs](api/java/org/apache/spark/ml/feature/ElementwiseProduct.html) +for more details on the API. + {% highlight java %} import java.util.Arrays; @@ -1649,6 +1773,10 @@ transformer.transform(dataFrame).show(); </div> <div data-lang="python" markdown="1"> + +Refer to the [ElementwiseProduct Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.ElementwiseProduct) +for more details on the API. + {% highlight python %} from pyspark.ml.feature import ElementwiseProduct from pyspark.mllib.linalg import Vectors @@ -1702,8 +1830,8 @@ output column to `features`, after transformation we should get the following Da <div class="codetabs"> <div data-lang="scala" markdown="1"> -[`VectorAssembler`](api/scala/index.html#org.apache.spark.ml.feature.VectorAssembler) takes an array -of input column names and an output column name. +Refer to the [VectorAssembler Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorAssembler) +for more details on the API. {% highlight scala %} import org.apache.spark.mllib.linalg.Vectors @@ -1722,8 +1850,8 @@ println(output.select("features", "clicked").first()) <div data-lang="java" markdown="1"> -[`VectorAssembler`](api/java/org/apache/spark/ml/feature/VectorAssembler.html) takes an array -of input column names and an output column name. +Refer to the [VectorAssembler Java docs](api/java/org/apache/spark/ml/feature/VectorAssembler.html) +for more details on the API. {% highlight java %} import java.util.Arrays; @@ -1759,8 +1887,8 @@ System.out.println(output.select("features", "clicked").first()); <div data-lang="python" markdown="1"> -[`VectorAssembler`](api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler) takes a list -of input column names and an output column name. +Refer to the [VectorAssembler Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler) +for more details on the API. {% highlight python %} from pyspark.mllib.linalg import Vectors @@ -1836,8 +1964,8 @@ Suppose also that we have a potential input attributes for the `userFeatures`, i <div class="codetabs"> <div data-lang="scala" markdown="1"> -[`VectorSlicer`](api/scala/index.html#org.apache.spark.ml.feature.VectorSlicer) takes an input -column name with specified indices or names and an output column name. +Refer to the [VectorSlicer Scala docs](api/scala/index.html#org.apache.spark.ml.feature.VectorSlicer) +for more details on the API. {% highlight scala %} import org.apache.spark.mllib.linalg.Vectors @@ -1870,8 +1998,8 @@ println(output.select("userFeatures", "features").first()) <div data-lang="java" markdown="1"> -[`VectorSlicer`](api/java/org/apache/spark/ml/feature/VectorSlicer.html) takes an input column name -with specified indices or names and an output column name. +Refer to the [VectorSlicer Java docs](api/java/org/apache/spark/ml/feature/VectorSlicer.html) +for more details on the API. {% highlight java %} import java.util.Arrays; @@ -1941,7 +2069,8 @@ id | country | hour | clicked | features | label <div class="codetabs"> <div data-lang="scala" markdown="1"> -[`RFormula`](api/scala/index.html#org.apache.spark.ml.feature.RFormula) takes an R formula string, and optional parameters for the names of its output columns. +Refer to the [RFormula Scala docs](api/scala/index.html#org.apache.spark.ml.feature.RFormula) +for more details on the API. {% highlight scala %} import org.apache.spark.ml.feature.RFormula @@ -1962,7 +2091,8 @@ output.select("features", "label").show() <div data-lang="java" markdown="1"> -[`RFormula`](api/java/org/apache/spark/ml/feature/RFormula.html) takes an R formula string, and optional parameters for the names of its output columns. +Refer to the [RFormula Java docs](api/java/org/apache/spark/ml/feature/RFormula.html) +for more details on the API. {% highlight java %} import java.util.Arrays; @@ -2000,7 +2130,8 @@ output.select("features", "label").show(); <div data-lang="python" markdown="1"> -[`RFormula`](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula) takes an R formula string, and optional parameters for the names of its output columns. +Refer to the [RFormula Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.RFormula) +for more details on the API. {% highlight python %} from pyspark.ml.feature import RFormula |