| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
PySpark ML Params setter code clean up.
For examples,
```setInputCol``` can be simplified from
```
self._set(inputCol=value)
return self
```
to:
```
return self._set(inputCol=value)
```
This is a pretty big sweeps, and we cleaned wherever possible.
## How was this patch tested?
Exist unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #12749 from yanboliang/spark-14971.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark and PySpark - update
## What changes were proposed in this pull request?
This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
* This includes changing classes taking weightCol to treat unset and empty String Param values the same way.
Defaults changed:
* Scala
* LogisticRegression: weightCol defaults to not set (instead of empty string)
* StringIndexer: labels default to not set (instead of empty array)
* GeneralizedLinearRegression:
* maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
* weightCol defaults to not set (instead of empty string)
* LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
* MultilayerPerceptron: layers default to not set (instead of [1,1])
* ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)
## How was this patch tested?
Generic unit test. Manually tested that unit test by changing defaults and verifying that broke the test.
Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>
Closes #12816 from jkbradley/yinxusen-SPARK-14931.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Word2VecModel
## What changes were proposed in this pull request?
This PR fixes the bug that generates infinite distances between word vectors. For example,
Before this PR, we have
```
val synonyms = model.findSynonyms("who", 40)
```
will give the following results:
```
to Infinity
and Infinity
that Infinity
with Infinity
```
With this PR, the distance between words is a value between 0 and 1, as follows:
```
scala> model.findSynonyms("who", 10)
res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006))
scala> model.findSynonyms("king", 10)
res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043))
scala> model.findSynonyms("queen", 10)
res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199))
```
### There are two places changed in this PR:
- Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once.
- Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation
## How was this patch tested?
Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors.
Author: Junyang <fly.shenjy@gmail.com>
Author: flyskyfly <fly.shenjy@gmail.com>
Closes #11812 from flyjy/TVec.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.
## How was this patch tested?
Unit tests.
cc jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #12702 from yanboliang/spark-14899.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.
Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.
## How was this patch tested?
unit tests.
cc jkbradley MLnick
Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #12498 from yanboliang/spark-10574.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
#11939 make Python param setters use the `_set` method. This PR fix omissive ones.
## How was this patch tested?
Existing tests.
cc jkbradley sethah
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #12531 from yanboliang/setters-omissive.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
- ContinuousQuery
- Trigger
- ProcessingTime
in pyspark under `pyspark.sql.streaming`.
In addition, it contains the new methods added under:
- `DataFrameWriter`
a) `startStream`
b) `trigger`
c) `queryName`
- `DataFrameReader`
a) `stream`
- `DataFrame`
a) `isStreaming`
This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
- `exception`
- `sourceStatuses`
- `sinkStatus`
They may be added in a follow up.
This PR also contains some very minor doc fixes in the Scala side.
## How was this patch tested?
Python doc tests
TODO:
- [ ] verify Python docs look good
Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>
Closes #12320 from brkyvz/stream-python.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Param constructor
## What changes were proposed in this pull request?
PySpark Param constructors need to pass the TypeConverter argument by name, partly to make sure it is not mistaken for the expectedType arg and partly because we will remove the expectedType arg in 2.1. In several places, this is not being done correctly.
This PR changes all usages in pyspark/ml/ to keyword args.
## How was this patch tested?
Existing unit tests. I will not test type conversion for every Param unless we really think it necessary.
Also, if you start the PySpark shell and import classes (e.g., pyspark.ml.feature.StandardScaler), then you no longer get this warning:
```
/Users/josephkb/spark/python/pyspark/ml/param/__init__.py:58: UserWarning: expectedType is deprecated and will be removed in 2.1. Use typeConverter instead, as a keyword argument.
"Use typeConverter instead, as a keyword argument.")
```
That warning came from the typeConverter argument being passes as the expectedType arg by mistake.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #12480 from jkbradley/typeconverter-fix.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib
## How was this patch tested?
Added test cases in tests.py under both ML and MLlib
Author: Jason Lee <cjlee@us.ibm.com>
Closes #12428 from jasoncl/SPARK-14564.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.
Additional changes:
* [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.
## How was this patch tested?
Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.
Author: sethah <seth.hendrickson16@gmail.com>
Closes #11939 from sethah/SPARK-14104.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
The default stopwords were a Java object. They are no longer.
## How was this patch tested?
Unit test which failed before the fix
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #12422 from jkbradley/pyspark-stopwords.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
HashingTF in ML & MLlib
## What changes were proposed in this pull request?
This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.
Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.
## How was this patch tested?
This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.
Author: Yong Tang <yong.tang.github@outlook.com>
Closes #12079 from yongtang/SPARK-14238.
|
|
|
|
|
|
|
|
|
|
| |
Added binary toggle param to CountVectorizer feature transformer in PySpark.
Created a unit test for using CountVectorizer with the binary toggle on.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.
This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.
## How was this patch tested?
Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.
Author: sethah <seth.hendrickson16@gmail.com>
Closes #11663 from sethah/SPARK-13068-tc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Adds support for saving and loading nested ML Pipelines from Python. Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.
Also:
* Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
* Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java
Added new unit test for nested Pipelines. Abstracted validity check into a helper method for the 2 unit tests.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #11866 from jkbradley/nested-pipeline-io.
Closes #11835
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
https://issues.apache.org/jira/browse/SPARK-13993
## How was this patch tested?
doctest
Author: Xusen Yin <yinxusen@gmail.com>
Closes #11807 from yinxusen/SPARK-13993.
|
|
|
|
|
|
|
|
|
|
| |
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.
In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #11203 from yinxusen/SPARK-13036.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top. Since we keep it alphabetized, it often creates a lot more changes than needed. It is also easy to add the Estimator and forget the Model. I'm going to switch it to have one algorithm per line.
This also alphabetizes a few out-of-place classes in pyspark.ml.feature. No changes have been made to the moved classes.
CC: thunterdb
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #10927 from jkbradley/ml-python-all-list.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
After SPARK-13028, we should add Python API for MaxAbsScaler.
## How was this patch tested?
unit test
Author: zlpmichelle <zlpmichelle@gmail.com>
Closes #11393 from zlpmichelle/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PySpark
## What changes were proposed in this pull request?
QuantileDiscretizer in Python should also specify a random seed.
## How was this patch tested?
unit tests
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #11362 from yu-iskw/SPARK-13292 and squashes the following commits:
02ffa76 [Yu ISHIKAWA] [SPARK-13292][ML][PYTHON] QuantileDiscretizer should take random seed in PySpark
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
and other tuning for Word2Vec
add support of arbitrary length sentence by using the nature representation of sentences in the input.
add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience
need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?
jira link: https://issues.apache.org/jira/browse/SPARK-12153
Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>
Closes #10152 from ygcao/improvementForSentenceBoundary.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12780
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10724 from yinxusen/SPARK-12780.
|
|
|
|
|
|
|
|
| |
The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).
Author: Holden Karau <holden@us.ibm.com>
Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11923
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10186 from yinxusen/SPARK-11923.
|
|
|
|
|
|
|
|
|
|
|
| |
Add Python API for ml.feature.QuantileDiscretizer.
One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc brkyvz & mengxr
Author: Holden Karau <holden@us.ibm.com>
Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
|
|
|
|
|
|
|
|
|
|
| |
```PCAModel``` can output ```explainedVariance``` at Python side.
cc mengxr srowen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10830 from yanboliang/spark-12905.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark 1.6 QA
Add PySpark missing methods and params for ml.feature:
* ```RegexTokenizer``` should support setting ```toLowercase```.
* ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
* ```PCAModel``` should support output ```pc```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9908 from yanboliang/spark-11925.
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes #8314 from squito/SPARK-10116.
|
|
|
|
|
|
|
|
| |
and pyspark.ml.*
Author: lihao <lihaowhu@gmail.com>
Closes #9275 from lidinghao/SPARK-10286.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Without an empty line, sphinx will treat doctest as docstring. holdenk
~~~
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])".
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])".
~~~
Author: Xiangrui Meng <meng@databricks.com>
Closes #9188 from mengxr/py-count-vec-doc-fix.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #8830 from ericl/interaction-2.
|
|
|
|
|
|
|
|
| |
From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-8530
add python API for MinMaxScaler
jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #7150 from hhbyyh/pythonMinMax.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Changes:
* Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore
CC: holdenk mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8679 from jkbradley/doc-fixes-1.5.
|
|
|
|
|
|
|
|
|
|
|
| |
Missing method of ml.feature are listed here:
```StringIndexer``` lacks of parameter ```handleInvalid```.
```StringIndexerModel``` lacks of method ```labels```.
```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8313 from yanboliang/spark-10027.
|
|
|
|
|
|
|
|
| |
Add Python API for ml.feature.VectorSlicer.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8102 from yanboliang/SPARK-9772.
|
|
|
|
|
|
|
|
| |
Adds IndexToString to PySpark.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
|
|
|
|
|
|
|
|
| |
Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental.
Author: noelsmith <mail@noelsmith.com>
Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
|
|
|
|
|
|
|
|
| |
Add a python API for the Stop Words Remover.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
|
|
|
|
|
|
|
|
| |
Add Python API for SQLTransformer
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8527 from yanboliang/spark-10355.
|
|
|
|
|
|
|
|
| |
Add Python API for ml.feature.DCT.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8485 from yanboliang/spark-8472.
|
|
|
|
|
|
|
|
|
|
| |
ml.feature.ElementwiseProduct
Add Python API, user guide and example for ml.feature.ElementwiseProduct.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8061 from yanboliang/SPARK-9768.
|
|
|
|
|
|
|
|
| |
Check and add miss docs for PySpark ML (this issue only check miss docs for o.a.s.ml not o.a.s.mllib).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8059 from yanboliang/SPARK-9766.
|
|
|
|
|
|
|
|
|
|
|
| |
After https://github.com/apache/spark/pull/7263 it is pretty straightforward to Python wrappers.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #7930 from MechCoder/spark-9533 and squashes the following commits:
1bea394 [MechCoder] make getVectors a lazy val
5522756 [MechCoder] [SPARK-9533] [PySpark] [ML] Add missing methods in Word2Vec ML
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder
Author: Xiangrui Meng <meng@databricks.com>
Closes #7879 from mengxr/SPARK-9544 and squashes the following commits:
3d5ff03 [Xiangrui Meng] add an doctest for . and -
5e969a5 [Xiangrui Meng] fix pydoc
1cd41f8 [Xiangrui Meng] organize imports
3c18b10 [Xiangrui Meng] add Python API for RFormula
|
|
|
|
|
|
|
|
|
|
| |
This is #7791 for Python. hhbyyh
Author: Xiangrui Meng <meng@databricks.com>
Closes #7798 from mengxr/regex-tok-py and squashes the following commits:
baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer
|
|
|
|
|
|
|
|
|
|
|
| |
Add Python API for PCA transformer
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #7190 from yanboliang/spark-8792 and squashes the following commits:
8f4ac31 [Yanbo Liang] address comments
8a79cc0 [Yanbo Liang] Add Python API for PCA transformer
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add std, mean to StandardScalerModel
getVectors, findSynonyms to Word2Vec Model
setFeatures and getFeatures to hashingTF
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #7086 from MechCoder/missing_model_methods and squashes the following commits:
9fbae90 [MechCoder] Add type
6e3d6b2 [MechCoder] [SPARK-8704] Add missing methods in StandardScaler (ML and PySpark)
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Python API for N-gram feature transformer
Author: Feynman Liang <fliang@databricks.com>
Closes #6960 from feynmanliang/ngram-featurizer-python and squashes the following commits:
f9e37c9 [Feynman Liang] Remove debugging code
4dd81f4 [Feynman Liang] Fix typo and doctest
06c79ac [Feynman Liang] Style guide
26c1175 [Feynman Liang] Add python NGram API
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
attributes and change includeFirst to dropLast
This PR contains two major changes to `OneHotEncoder`:
1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:
a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
c. If users use `StringIndex`, the last element is the least frequent one.
Sorry for including two changes in one PR! I'll update the user guide in another PR.
jkbradley sryza
Author: Xiangrui Meng <meng@databricks.com>
Closes #6466 from mengxr/SPARK-7912 and squashes the following commits:
a280dca [Xiangrui Meng] fix tests
d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
|