aboutsummaryrefslogtreecommitdiff
path: root/docs/ml-features.md
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-19969][ML] Imputer doc and exampleYuhao Yang2017-04-031-0/+66
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316 ## How was this patch tested? local doc generation and example execution Author: Yuhao Yang <yuhao.yang@intel.com> Closes #17324 from hhbyyh/imputerdoc.
* [SPARK-17498][ML] StringIndexer enhancement for handling unseen labelsVinceShieh2017-03-071-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is an enhancement to ML StringIndexer. Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records. But those unseen records might still be useful and user would like to keep the unseen labels in certain use cases, This PR enables StringIndexer to support keeping unseen labels as indices [numLabels]. '''Before StringIndexer().setHandleInvalid("skip") StringIndexer().setHandleInvalid("error") '''After support the third option "keep" StringIndexer().setHandleInvalid("keep") ## How was this patch tested? Test added in StringIndexerSuite Signed-off-by: VinceShieh <vincent.xieintel.com> (Please fill in changes proposed in this fix) Author: VinceShieh <vincent.xie@intel.com> Closes #16883 from VinceShieh/spark-17498.
* [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive HashingYun Ni2017-02-151-0/+17
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH. ## How was this patch tested? API and examples are tested using spark-submit: `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py` `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py` User guide changes are generated and manually inspected: `SKIP_API=1 jekyll build` Author: Yun Ni <yunn@uber.com> Author: Yanbo Liang <ybliang8@gmail.com> Author: Yunni <Euler57721@gmail.com> Closes #16715 from Yunni/spark-18080.
* [SPARK-17645][MLLIB][ML][FOLLOW-UP] document minor changePeng, Meng2017-01-101-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add FDR test case in ml/feature/ChiSqSelectorSuite. Improve some comments in the code. This is a follow-up pr for #15212. ## How was this patch tested? ut Author: Peng, Meng <peng.meng@intel.com> Closes #16434 from mpjlu/fdr_fwe_update.
* [MINOR][DOCS] Remove consecutive duplicated words/typo in Spark RepoNiranjan Padmanabhan2017-01-041-1/+1
| | | | | | | | | | | | ## What changes were proposed in this pull request? There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words. ## How was this patch tested? N/A since only docs or comments were updated. Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com> Closes #16455 from neurons/np.structure_streaming_doc.
* [SPARK-17645][MLLIB][ML] add feature selector method based on: False ↵Peng2016-12-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Discovery Rate (FDR) and Family wise error rate (FWE) ## What changes were proposed in this pull request? Univariate feature selection works by selecting the best features based on univariate statistical tests. FDR and FWE are a popular univariate statistical test for feature selection. In 2005, the Benjamini and Hochberg paper on FDR was identified as one of the 25 most-cited statistical papers. The FDR uses the Benjamini-Hochberg procedure in this PR. https://en.wikipedia.org/wiki/False_discovery_rate. In statistics, FWE is the probability of making one or more false discoveries, or type I errors, among all the hypotheses when performing multiple hypotheses tests. https://en.wikipedia.org/wiki/Family-wise_error_rate We add FDR and FWE methods for ChiSqSelector in this PR, like it is implemented in scikit-learn. http://scikit-learn.org/stable/modules/feature_selection.html#univariate-feature-selection ## How was this patch tested? ut will be added soon (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: Peng <peng.meng@intel.com> Author: Peng, Meng <peng.meng@intel.com> Closes #15212 from mpjlu/fdr_fwe.
* [SPARK-18081][ML][DOCS] Add user guide for Locality Sensitive Hashing(LSH)Yunni2016-12-031-0/+111
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The user guide for LSH is added to ml-features.md, with several scala/java examples in spark-examples. ## How was this patch tested? Doc has been generated through Jekyll, and checked through manual inspection. Author: Yunni <Euler57721@gmail.com> Author: Yun Ni <yunn@uber.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Yun Ni <Euler57721@gmail.com> Closes #15795 from Yunni/SPARK-18081-lsh-guide.
* [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docsYanbo Liang2016-11-301-1/+3
| | | | | | | | | | | | ## What changes were proposed in this pull request? API review for 2.1, except ```LSH``` related classes which are still under development. ## How was this patch tested? Only doc changes, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16009 from yanboliang/spark-18318.
* [SPARK-18480][DOCS] Fix wrong links for ML guide docsZheng RuiFeng2016-11-171-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert. 2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter` in `ml-pipeline.md` were linked to `ml-guide.html` by mistake. 3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private. 4, Other link updates. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15912 from zhengruifeng/md_fix.
* [MINOR][DOC] Unify example marksZheng RuiFeng2016-11-081-0/+30
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? 1, `**Example**` => `**Examples**`, because more algos use `**Examples**`. 2, delete `### Examples` in `Isotonic regression`, because it's not that special in http://spark.apache.org/docs/latest/ml-classification-regression.html 3, add missing marks for `LDA` and other algos. ## How was this patch tested? No tests for it only modify doc Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #15783 from zhengruifeng/doc_fix.
* [SPARK-13770][DOCUMENTATION][ML] Document the ML feature Interactionchie88422016-11-081-0/+52
| | | | | | | | I created Scala and Java example and added documentation. Author: chie8842 <hayashidac@nttdata.co.jp> Closes #15658 from hayashidac/SPARK-13770.
* [SPARK-18088][ML] Various ChiSqSelector cleanupsJoseph K. Bradley2016-11-011-6/+6
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? - Renamed kbest to numTopFeatures - Renamed alpha to fpr - Added missing Since annotations - Doc cleanups ## How was this patch tested? Added new standardized unit tests for spark.ml. Improved existing unit test coverage a bit. Author: Joseph K. Bradley <joseph@databricks.com> Closes #15647 from jkbradley/chisqselector-follow-ups.
* [SPARK-17219][ML] enhanced NaN value handling in BucketizerVinceShieh2016-10-271-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is an enhancement of PR with commit ID:57dc326bd00cf0a49da971e9c573c48ae28acaa2. NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively. '''Before: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) '''After: val bucketizer: Bucketizer = new Bucketizer() .setInputCol("feature") .setOutputCol("result") .setSplits(splits) .setHandleNaN("keep") ## How was this patch tested? Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Author: Vincent Xie <vincent.xie@intel.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #15428 from VinceShieh/spark-17219_followup.
* [SPARK-17017][ML][MLLIB][ML][DOC] Updated the ml/mllib feature selection ↵Shuai Lin2016-09-281-4/+10
| | | | | | | | | | | | | | | | | | | docs for ChiSqSelector ## What changes were proposed in this pull request? A follow up for #14597 to update feature selection docs about ChiSqSelector. ## How was this patch tested? Generated html docs. It can be previewed at: * ml: http://sparkdocs.lins05.pw/spark-17017/ml-features.html#chisqselector * mllib: http://sparkdocs.lins05.pw/spark-17017/mllib-feature-extraction.html#chisqselector Author: Shuai Lin <linshuai2012@gmail.com> Closes #15236 from lins05/spark-17017-update-docs-for-chisq-selector-fpr.
* [SPARK-17219][ML] Add NaN value handling in BucketizerVinceShieh2016-09-211-1/+5
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value. Sometimes, null value might also be useful to users, so in these cases, Bucketizer should reserve one extra bucket for NaN values, instead of throwing an illegal exception. Before: ``` Bucketizer.transform on NaN value threw an illegal exception. ``` After: ``` NaN values will be grouped in an extra bucket. ``` ## How was this patch tested? New test cases added in `BucketizerSuite`. Signed-off-by: VinceShieh <vincent.xieintel.com> Author: VinceShieh <vincent.xie@intel.com> Closes #14858 from VinceShieh/spark-17219.
* [SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when ↵Sean Owen2016-08-271-1/+1
| | | | | | | | | | | | | | | | withMean=True ## What changes were proposed in this pull request? Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages. ## How was this patch tested? Jenkins tests, including new caes to reflect the new behavior. Author: Sean Owen <sowen@cloudera.com> Closes #14663 from srowen/SPARK-17001.
* [MINOR][DOC] Fix wrong ml.feature.Normalizer document.Yanbo Liang2016-08-241-1/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? The ```ml.feature.Normalizer``` examples illustrate L1 norm rather than L2, we should correct corresponding document. ![image](https://cloud.githubusercontent.com/assets/1962026/17928637/85aec284-69b0-11e6-9b13-d465ee560581.png) ## How was this patch tested? Doc change, no test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14787 from yanboliang/normalizer.
* [SPARK-16485][DOC][ML] Fixed several inline formatting in ml features docShuai Lin2016-07-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixed several inline formatting in ml features doc. Before: <img width="475" alt="screen shot 2016-07-14 at 12 24 57 pm" src="https://cloud.githubusercontent.com/assets/717363/16827974/1e1b6e04-49be-11e6-8aa9-4a0cb6cd3b4e.png"> After: <img width="404" alt="screen shot 2016-07-14 at 12 25 48 pm" src="https://cloud.githubusercontent.com/assets/717363/16827976/2576510a-49be-11e6-96dd-92a1fa464d36.png"> ## How was this patch tested? Genetate the docs locally by `SKIP_API=1 jekyll build` and view it in the browser. Author: Shuai Lin <linshuai2012@gmail.com> Closes #14194 from lins05/fix-docs-formatting.
* [SPARK-14817][ML][MLLIB][DOC] Made DataFrame-based API primary in MLlib guideJoseph K. Bradley2016-07-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Made DataFrame-based API primary * Spark doc menu bar and other places now link to ml-guide.html, not mllib-guide.html * mllib-guide.html keeps RDD-specific list of features, with a link at the top redirecting people to ml-guide.html * ml-guide.html includes a "maintenance mode" announcement about the RDD-based API * **Reviewers: please check this carefully** * (minor) Titles for DF API no longer include "- spark.ml" suffix. Titles for RDD API have "- RDD-based API" suffix * Moved migration guide to ml-guide from mllib-guide * Also moved past guides from mllib-migration-guides to ml-migration-guides, with a redirect link on mllib-migration-guides * **Reviewers**: I did not change any of the content of the migration guides. Reorganized DataFrame-based guide: * ml-guide.html mimics the old mllib-guide.html page in terms of content: overview, migration guide, etc. * Moved Pipeline description into ml-pipeline.html and moved tuning into ml-tuning.html * **Reviewers**: I did not change the content of these guides, except some intro text. * Sidebar remains the same, but with pipeline and tuning sections added Other: * ml-classification-regression.html: Moved text about linear methods to new section in page ## How was this patch tested? Generated docs locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #14213 from jkbradley/ml-guide-2.0.
* [SPARK-15997][DOC][ML] Update user guide for HashingTF, QuantileVectorizer ↵GayathriMurali2016-06-241-12/+17
| | | | | | | | | | | | and CountVectorizer ## What changes were proposed in this pull request? Made changes to HashingTF,QuantileVectorizer and CountVectorizer Author: GayathriMurali <gayathri.m@intel.com> Closes #13745 from GayathriMurali/SPARK-15997.
* [SPARK-16045][ML][DOC] Spark 2.0 ML.feature: doc update for stopwords and ↵Yuhao Yang2016-06-211-6/+10
| | | | | | | | | | | | | | | | | | binarizer ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16045 2.0 Audit: Update document for StopWordsRemover and Binarizer. ## How was this patch tested? manual review for doc Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Closes #13375 from hhbyyh/stopdoc.
* [SPARK-15394][ML][DOCS] User guide typos and grammar auditsethah2016-05-191-24/+23
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Correct some typos and incorrectly worded sentences. ## How was this patch tested? Doc changes only. Note that many of these changes were identified by whomfire01 Author: sethah <seth.hendrickson16@gmail.com> Closes #13180 from sethah/ml_guide_audit.
* [SPARK-15182][ML] Copy MLlib doc to ML: ml.feature.tf, idfYuhao Yang2016-05-171-9/+42
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide. ## How was this patch tested? manual review for doc. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Closes #12957 from hhbyyh/tfidfdoc.
* [DOC][MINOR] Fixed minor errors in feature.ml user guide docBryan Cutler2016-05-071-3/+5
| | | | | | | | | | | | ## What changes were proposed in this pull request? Fixed some minor errors found when reviewing feature.ml user guide ## How was this patch tested? built docs locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #12940 from BryanCutler/feature.ml-doc_fixes-DOCS-MINOR.
* [SPARK-14512] [DOC] Add python example for QuantileDiscretizerZheng RuiFeng2016-05-061-0/+9
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add the missing python example for QuantileDiscretizer ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12281 from zhengruifeng/discret_pe.
* [SPARK-14514][DOC] Add python example for VectorSlicerZheng RuiFeng2016-04-261-0/+8
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add the missing python example for VectorSlicer ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12282 from zhengruifeng/vecslicer_pe.
* [SPARK-14635][ML] Documentation and Examples for TF-IDF only refer to HashingTFYuhao Yang2016-04-201-3/+12
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, the docs for TF-IDF only refer to using HashingTF with IDF. However, CountVectorizer can also be used. We should probably amend the user guide and examples to show this. ## How was this patch tested? unit tests and doc generation Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12454 from hhbyyh/tfdoc.
* [SPARK-14515][DOC] Add python example for ChiSqSelectorZheng RuiFeng2016-04-181-0/+8
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add the missing python example for ChiSqSelector ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12283 from zhengruifeng/chi2_pe.
* [SPARK-14509][DOC] Add python CountVectorizerExampleZheng RuiFeng2016-04-131-0/+9
| | | | | | | | | | | | ## What changes were proposed in this pull request? Add python CountVectorizerExample ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11917 from zhengruifeng/cv_pe.
* [SPARK-14339][DOC] Add python examples for DCT,MinMaxScaler,MaxAbsScalerZheng RuiFeng2016-04-091-0/+24
| | | | | | | | | | | | ## What changes were proposed in this pull request? add three python examples ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #12063 from zhengruifeng/dct_pe.
* [SPARK-13512][ML] add example and doc for MaxAbsScalerYuhao Yang2016-03-111-0/+32
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13512 Add example and doc for ml.feature.MaxAbsScaler. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11392 from hhbyyh/maxabsdoc.
* [MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns ↵Dongjoon Hyun2016-02-221-3/+3
| | | | | | | | | | | | | | | | | in other comments ## What changes were proposed in this pull request? This PR tries to fix all typos in all markdown files under `docs` module, and fixes similar typos in other comments, too. ## How was the this patch tested? manual tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11300 from dongjoon-hyun/minor_fix_typos.
* [SPARK-11965][ML][DOC] Update user guide for RFormula feature interactionsYanbo Liang2016-01-251-1/+19
| | | | | | | | Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10222 from yanboliang/spark-11965.
* [MINOR][DOC] Fix broken word2vec linkBenFradet2015-12-141-1/+1
| | | | | | | | Follow-up of [SPARK-12199](https://issues.apache.org/jira/browse/SPARK-12199) and #10193 where a broken link has been left as is. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10282 from BenFradet/SPARK-12199.
* [SPARK-12199][DOC] Follow-up: Refine example code in ml-features.mdXusen Yin2015-12-121-11/+11
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12199 Follow-up PR of SPARK-11551. Fix some errors in ml-features.md mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #10193 from yinxusen/SPARK-12199.
* [SPARK-12217][ML] Document invalid handling for StringIndexerBenFradet2015-12-111-0/+36
| | | | | | | | | | Added a paragraph regarding StringIndexer#setHandleInvalid to the ml-features documentation. I wonder if I should also add a snippet to the code example, input welcome. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10257 from BenFradet/SPARK-12217.
* [SPARK-12212][ML][DOC] Clarifies the difference between spark.ml, ↵Timothy Hunter2015-12-101-2/+2
| | | | | | | | | | | | spark.mllib and mllib in the documentation. Replaces a number of occurences of `MLlib` in the documentation that were meant to refer to the `spark.mllib` package instead. It should clarify for new users the difference between `spark.mllib` (the package) and MLlib (the umbrella project for ML in spark). It also removes some files that I forgot to delete with #10207 Author: Timothy Hunter <timhunter@databricks.com> Closes #10234 from thunterdb/12212.
* [SPARK-11551][DOC] Replace example code in ml-features.md using include_exampleXusen Yin2015-12-091-1061/+51
| | | | | | | | | PR on behalf of somideshmukh, thanks! Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10219 from yinxusen/SPARK-11551.
* [SPARK-8517][ML][DOC] Reorganizes the spark.ml user guideTimothy Hunter2015-12-081-2/+2
| | | | | | | | | | This PR moves pieces of the spark.ml user guide to reflect suggestions in SPARK-8517. It does not introduce new content, as requested. <img width="192" alt="screen shot 2015-12-08 at 11 36 00 am" src="https://cloud.githubusercontent.com/assets/7594753/11666166/e82b84f2-9d9f-11e5-8904-e215424d8444.png"> Author: Timothy Hunter <timhunter@databricks.com> Closes #10207 from thunterdb/spark-8517.
* [SPARK-12159][ML] Add user guide section for IndexToString transformerBenFradet2015-12-081-16/+88
| | | | | | | | Documentation regarding the `IndexToString` label transformer with code snippets in Scala/Java/Python. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10166 from BenFradet/SPARK-12159.
* [SPARK-11551][DOC][EXAMPLE] Revert PR #10002Cheng Lian2015-12-081-51/+1058
| | | | | | | | | | This reverts PR #10002, commit 78209b0ccaf3f22b5e2345dfb2b98edfdb746819. The original PR wasn't tested on Jenkins before being merged. Author: Cheng Lian <lian@databricks.com> Closes #10200 from liancheng/revert-pr-10002.
* [SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example codeYanbo Liang2015-12-071-0/+59
| | | | | | | | Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10006 from yanboliang/spark-11958.
* [SPARK-11551][DOC][EXAMPLE] Replace example code in ml-features.md using ↵somideshmukh2015-12-071-1058/+51
| | | | | | | | | | | | | include_example Made new patch contaning only markdown examples moved to exmaple/folder. Ony three java code were not shfted since they were contaning compliation error ,these classes are 1)StandardScale 2)NormalizerExample 3)VectorIndexer Author: Xusen Yin <yinxusen@gmail.com> Author: somideshmukh <somilde@us.ibm.com> Closes #10002 from somideshmukh/SomilBranch1.33.
* [SPARK-11963][DOC] Add docs for QuantileDiscretizerXusen Yin2015-12-071-0/+65
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11963 Author: Xusen Yin <yinxusen@gmail.com> Closes #9962 from yinxusen/SPARK-11963.
* [DOCUMENTATION][MLLIB] typo in mllib docJeff Zhang2015-12-031-1/+1
| | | | | | | | \cc mengxr Author: Jeff Zhang <zjffdu@apache.org> Closes #10093 from zjffdu/mllib_typo.
* [SPARK-11961][DOC] Add docs of ChiSqSelectorXusen Yin2015-12-011-0/+50
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11961 Author: Xusen Yin <yinxusen@gmail.com> Closes #9965 from yinxusen/SPARK-11961.
* [SPARK-11723][ML][DOC] Use LibSVM data source rather than ↵Yanbo Liang2015-11-131-4/+4
| | | | | | | | | | | | | | | | MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.
* [SPARK-11289][DOC] Substitute code examples in ML features extractors with ↵Xusen Yin2015-10-261-209/+8
| | | | | | | | | | | | include_example mengxr https://issues.apache.org/jira/browse/SPARK-11289 I make some changes in ML feature extractors. I.e. TF-IDF, Word2Vec, and CountVectorizer. I add new example code in spark/examples, hope it is the right place to add those examples. Author: Xusen Yin <yinxusen@gmail.com> Closes #9266 from yinxusen/SPARK-11289.
* [SPARK-10670] [ML] [Doc] add api reference for ml docYuhao Yang2015-09-281-64/+195
| | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-10670 In the Markdown docs for the spark.ml Programming Guide, we have code examples with codetabs for each language. We should link to each language's API docs within the corresponding codetab, but we are inconsistent about this. For an example of what we want to do, see the "Word2Vec" section in https://github.com/apache/spark/blob/64743870f23bffb8d96dcc8a0181c1452782a151/docs/ml-features.md This JIRA is just for spark.ml, not spark.mllib Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8901 from hhbyyh/docAPI.
* [SPARK-10595] [ML] [MLLIB] [DOCS] Various ML guide cleanupsJoseph K. Bradley2015-09-151-4/+30
| | | | | | | | | | | | | | | | | | Various ML guide cleanups. * ml-guide.md: Make it easier to access the algorithm-specific guides. * LDA user guide: EM often begins with useless topics, but running longer generally improves them dramatically. E.g., 10 iterations on a Wikipedia dataset produces useless topics, but 50 iterations produces very meaningful topics. * mllib-feature-extraction.html#elementwiseproduct: “w” parameter should be “scalingVec” * Clean up Binarizer user guide a little. * Document in Pipeline that users should not put an instance into the Pipeline in more than 1 place. * spark.ml Word2Vec user guide: clean up grammar/writing * Chi Sq Feature Selector docs: Improve text in doc. CC: mengxr feynmanliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8752 from jkbradley/mlguide-fixes-1.5.