| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
reserve one extra bucket for NaN values, instead of throwing an illegal exception.
Before:
```
Bucketizer.transform on NaN value threw an illegal exception.
```
After:
```
NaN values will be grouped in an extra bucket.
```
## How was this patch tested?
New test cases added in `BucketizerSuite`.
Signed-off-by: VinceShieh <vincent.xieintel.com>
Author: VinceShieh <vincent.xie@intel.com>
Closes #14858 from VinceShieh/spark-17219.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
withMean=True
## What changes were proposed in this pull request?
Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
## How was this patch tested?
Jenkins tests, including new caes to reflect the new behavior.
Author: Sean Owen <sowen@cloudera.com>
Closes #14663 from srowen/SPARK-17001.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we should correct corresponding document.
![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png)
## How was this patch tested?
Doc change, no test.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #14787 from yanboliang/normalizer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Fixed several inline formatting in ml features doc.
Before:
<img width="475" alt="screen shot 2016-07-14 at 12 24 57 pm" src="https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png">
After:
<img width="404" alt="screen shot 2016-07-14 at 12 25 48 pm" src="https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png">
## How was this patch tested?
Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser.
Author: Shuai Lin <linshuai2012@gmail.com>
Closes #14194 from lins05/fix-docs-formatting.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Made DataFrame-based API primary
* Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html
* mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html
* ml-guide.html includes a "maintenance mode" announcement about the RDD-based API
* **Reviewers: please check this carefully**
* (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix
* Moved migration guide to ml-guide from mllib-guide
* Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides
* **Reviewers**: I did not change any of the content of the migration guides.
Reorganized DataFrame-based guide:
* ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc.
* Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html
* **Reviewers**: I did not change the content of these guides, except some intro text.
* Sidebar remains the same, but with pipeline and tuning sections added
Other:
* ml-classification-regression.html: Moved text about linear methods to new section in page
## How was this patch tested?
Generated docs locally
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #14213 from jkbradley/ml-guide-2.0.
|
|
|
|
|
|
|
|
|
|
|
|
| |
and CountVectorizer
## What changes were proposed in this pull request?
Made changes to HashingTF,QuantileVectorizer and CountVectorizer
Author: GayathriMurali <gayathri.m@intel.com>
Closes #13745 from GayathriMurali/SPARK-15997.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
binarizer
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-16045
2.0 Audit: Update document for StopWordsRemover and Binarizer.
## How was this patch tested?
manual review for doc
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes #13375 from hhbyyh/stopdoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Correct some typos and incorrectly worded sentences.
## How was this patch tested?
Doc changes only.
Note that many of these changes were identified by whomfire01
Author: sethah <seth.hendrickson16@gmail.com>
Closes #13180 from sethah/ml_guide_audit.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide.
## How was this patch tested?
manual review for doc.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes #12957 from hhbyyh/tfidfdoc.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Fixed some minor errors found when reviewing feature.ml user guide
## How was this patch tested?
built docs locally
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #12940 from BryanCutler/feature.ml-doc_fixes-DOCS-MINOR.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add the missing python example for QuantileDiscretizer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12281 from zhengruifeng/discret_pe.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add the missing python example for VectorSlicer
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12282 from zhengruifeng/vecslicer_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this.
## How was this patch tested?
unit tests and doc generation
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #12454 from hhbyyh/tfdoc.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add the missing python example for ChiSqSelector
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12283 from zhengruifeng/chi2_pe.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add python CountVectorizerExample
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #11917 from zhengruifeng/cv_pe.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
add three python examples
## How was this patch tested?
manual tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #12063 from zhengruifeng/dct_pe.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-13512
Add example and doc for ml.feature.MaxAbsScaler.
## How was this patch tested?
unit tests
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #11392 from hhbyyh/maxabsdoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in other comments
## What changes were proposed in this pull request?
This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.
## How was the this patch tested?
manual tests.
Author: Dongjoon Hyun <dongjoon@apache.org>
Closes #11300 from dongjoon-hyun/minor_fix_typos.
|
|
|
|
|
|
|
|
| |
Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10222 from yanboliang/spark-11965.
|
|
|
|
|
|
|
|
| |
Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10282 from BenFradet/SPARK-12199.
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12199
Follow-up PR of SPARK-11551. Fix some errors in ml-features.md
mengxr
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10193 from yinxusen/SPARK-12199.
|
|
|
|
|
|
|
|
|
|
| |
Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation.
I wonder if I should also add a snippet to the code example, input welcome.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10257 from BenFradet/SPARK-12217.
|
|
|
|
|
|
|
|
|
|
|
|
| |
spark.mllib and mllib in the documentation.
Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark).
It also removes some files that I forgot to delete with #10207
Author: Timothy Hunter <timhunter@databricks.com>
Closes #10234 from thunterdb/12212.
|
|
|
|
|
|
|
|
|
| |
PR on behalf of somideshmukh, thanks!
Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes #10219 from yinxusen/SPARK-11551.
|
|
|
|
|
|
|
|
|
|
| |
This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested.
<img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png">
Author: Timothy Hunter <timhunter@databricks.com>
Closes #10207 from thunterdb/spark-8517.
|
|
|
|
|
|
|
|
| |
Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10166 from BenFradet/SPARK-12159.
|
|
|
|
|
|
|
|
|
|
| |
This reverts PR #10002, commit 78209b0ccaf3f22b5e2345dfb2b98edfdb746819.
The original PR wasn't tested on Jenkins before being merged.
Author: Cheng Lian <lian@databricks.com>
Closes #10200 from liancheng/revert-pr-10002.
|
|
|
|
|
|
|
|
| |
Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10006 from yanboliang/spark-11958.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
Made new patch contaning only markdown examples moved to exmaple/folder.
Ony three java code were not shfted since they were contaning compliation error ,these classes are
1)StandardScale 2)NormalizerExample 3)VectorIndexer
Author: Xusen Yin <yinxusen@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>
Closes #10002 from somideshmukh/SomilBranch1.33.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11963
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9962 from yinxusen/SPARK-11963.
|
|
|
|
|
|
|
|
| |
\cc mengxr
Author: Jeff Zhang <zjffdu@apache.org>
Closes #10093 from zjffdu/mllib_typo.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-11961
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9965 from yinxusen/SPARK-11961.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MLUtils.loadLibSVMFile to load DataFrame
Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include:
* Use libSVM data source for all example codes under examples/ml, and remove unused import.
* Use libSVM data source for user guides under ml-*** which were omitted by #8697.
* Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```.
* Code cleanup.
mengxr
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9690 from yanboliang/spark-11723.
|
|
|
|
|
|
|
|
|
|
|
|
| |
include_example
mengxr https://issues.apache.org/jira/browse/SPARK-11289
I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #9266 from yinxusen/SPARK-11289.
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-10670
In the Markdown docs for the spark.ml Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "Word2Vec" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md
This JIRA is just for spark.ml, not spark.mllib
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #8901 from hhbyyh/docAPI.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Various ML guide cleanups.
* ml-guide.md: Make it easier to access the algorithm-specific guides.
* LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically. E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics.
* mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec”
* Clean up Binarizer user guide a little.
* Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place.
* spark.ml Word2Vec user guide: clean up grammar/writing
* Chi Sq Feature Selector docs: Improve text in doc.
CC: mengxr feynmanliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8752 from jkbradley/mlguide-fixes-1.5.
|
|
|
|
|
|
|
|
|
|
| |
LIBSVM data source instead of MLUtils
I fixed to use LIBSVM data source in the example code in spark.ml instead of MLUtils
Author: y-shimizu <y.shimizu0429@gmail.com>
Closes #8697 from y-shimizu/SPARK-10518.
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-10249
update user guide since python support added.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #8620 from hhbyyh/swPyDocExample.
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-9890
document with Scala and java examples
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #8487 from hhbyyh/cvDoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
compatibility test
* Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
* Cleans up scaladocs for public methods
* Adds test for Java compatibility
* Follow up Python user guide code example is tracked by SPARK-10249
Author: Feynman Liang <fliang@databricks.com>
Closes #8436 from feynmanliang/SPARK-10230.
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-8531
Update ML user guide for MinMaxScaler
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: unknown <yuhaoyan@yuhaoyan-MOBL1.ccr.corp.intel.com>
Closes #7211 from hhbyyh/minmaxdoc.
|
|
|
|
|
|
|
|
|
|
| |
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
Note that Python version does not support selecting by names now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #8267 from yinxusen/SPARK-9893.
|
|
|
|
|
|
|
|
| |
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #8293 from ericl/docs-2.
|
|
|
|
|
|
|
|
|
|
|
|
| |
New user guide section ml-decision-tree.md, including code examples.
I have run all examples, including the Java ones.
CC: manishamde yanboliang mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8244 from jkbradley/ml-dt-docs.
|
|
|
|
|
|
|
|
|
| |
By using `StringIndexer`, we can obtain indexed label on new column. So a following estimator should use this new column through pipeline if it wants to use string indexed label.
I think it is better to make it explicit on documentation.
Author: lewuathe <lewuathe@me.com>
Closes #8205 from Lewuathe/SPARK-9977.
|
|
|
|
|
|
|
|
|
|
|
|
| |
`Lists.newArrayList` -> `Arrays.asList`
CC jkbradley feynmanliang
Anybody into replacing usages of `Lists.newArrayList` in the examples / source code too? this method isn't useful in Java 7 and beyond.
Author: Sean Owen <sowen@cloudera.com>
Closes #8272 from srowen/SPARK-10070.
|
|
|
|
|
|
|
|
| |
mengxr jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes #8184 from feynmanliang/SPARK-9889-DCT-docs.
|
|
|
|
|
|
|
|
|
|
| |
ml.feature.ElementwiseProduct
Add Python API, user guide and example for ml.feature.ElementwiseProduct.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8061 from yanboliang/SPARK-9768.
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-7583
User guide update for RegexTokenizer
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #7828 from hhbyyh/regexTokenizerDoc.
|
|
|
|
|
|
|
|
|
|
|
| |
Add ml.PCA user guide document and code examples for Scala/Java/Python.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #7522 from yanboliang/ml-pca-md and squashes the following commits:
60dec05 [Yanbo Liang] address comments
f992abe [Yanbo Liang] Add ml.PCA doc and examples
|