aboutsummaryrefslogtreecommitdiff
path: root/mllib/src
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-10260] [ML] Add @Since annotation to ml.clusteringYu ISHIKAWA2015-08-281-3/+29
| | | | | | | | | ### JIRA [[SPARK-10260] Add Since annotation to ml.clustering - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10260) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8455 from yu-iskw/SPARK-10260.
* [SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java ↵Feynman Liang2015-08-271-0/+72
| | | | | | | | | | | | | compatibility test * Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine * Cleans up scaladocs for public methods * Adds test for Java compatibility * Follow up Python user guide code example is tracked by SPARK-10249 Author: Feynman Liang <fliang@databricks.com> Closes #8436 from feynmanliang/SPARK-10230.
* [SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached dataVyacheslav Baranov2015-08-271-0/+5
| | | | | | | | | | | | `GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache. The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning. Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #8395 from SlavikBaranov/SPARK-10182.
* [SPARK-10257] [MLLIB] Removes Guava from all spark.mllib Java testsFeynman Liang2015-08-2714-74/+71
| | | | | | | | | | | | * Replaces instances of `Lists.newArrayList` with `Arrays.asList` * Replaces `commons.lang.StringUtils` over `com.google.collections.Strings` * Replaces `List` interface over `ArrayList` implementations This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests. Author: Feynman Liang <fliang@databricks.com> Closes #8451 from feynmanliang/SPARK-10257.
* [SPARK-9613] [HOTFIX] Fix usage of JavaConverters removed in Scala 2.11Jacek Laskowski2015-08-271-1/+1
| | | | | | | | | | | | | | | | | Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases. Build for 2.10: ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install and 2.11: ./dev/change-scala-version.sh 2.11 ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install Author: Jacek Laskowski <jacek@japila.pl> Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
* [SPARK-10256] [ML] Removes guava dependency from spark.ml.classification ↵Feynman Liang2015-08-271-2/+2
| | | | | | | | JavaTests Author: Feynman Liang <fliang@databricks.com> Closes #8447 from feynmanliang/SPARK-10256.
* [SPARK-10255] [ML] Removes Guava dependencies from spark.ml.param JavaTestsFeynman Liang2015-08-272-6/+6
| | | | | | Author: Feynman Liang <fliang@databricks.com> Closes #8446 from feynmanliang/SPARK-10255.
* [SPARK-10254] [ML] Removes Guava dependencies in spark.ml.feature JavaTestsFeynman Liang2015-08-2711-30/+35
| | | | | | | | | * Replaces `com.google.common` dependencies with `java.util.Arrays` * Small clean up in `JavaNormalizerSuite` Author: Feynman Liang <fliang@databricks.com> Closes #8445 from feynmanliang/SPARK-10254.
* [SPARK-10241] [MLLIB] update since versions in mllib.recommendationXiangrui Meng2015-08-262-5/+25
| | | | | | | | | | Same as #8421 but for `mllib.recommendation`. cc srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #8432 from mengxr/SPARK-10241.
* [SPARK-9665] [MLLIB] audit MLlib API annotationsXiangrui Meng2015-08-261-4/+8
| | | | | | | | | | I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8452 from mengxr/SPARK-9665.
* [SPARK-10236] [MLLIB] update since versions in mllib.featureXiangrui Meng2015-08-258-16/+21
| | | | | | | | | | | | | Same as #8421 but for `mllib.feature`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits: 0e8d658 [Xiangrui Meng] remove unnecessary comment ad70b03 [Xiangrui Meng] update since versions in mllib.feature
* [SPARK-10235] [MLLIB] update since versions in mllib.regressionXiangrui Meng2015-08-258-29/+47
| | | | | | | | | | | | Same as #8421 but for `mllib.regression`. cc freeman-lab dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8426 from mengxr/SPARK-10235 and squashes the following commits: 6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
* [SPARK-10243] [MLLIB] update since versions in mllib.treeXiangrui Meng2015-08-2512-44/+57
| | | | | | | | | | Same as #8421 but for `mllib.tree`. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8442 from mengxr/SPARK-10236.
* [SPARK-10234] [MLLIB] update since version in mllib.clusteringXiangrui Meng2015-08-257-23/+44
| | | | | | | | | | Same as #8421 but for `mllib.clustering`. cc feynmanliang yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8435 from mengxr/SPARK-10234.
* [SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.random ↵Xiangrui Meng2015-08-254-25/+117
| | | | | | | | | | | | and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242.
* [SPARK-10238] [MLLIB] update since versions in mllib.linalgXiangrui Meng2015-08-258-31/+64
| | | | | | | | | | | | Same as #8421 but for `mllib.linalg`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8440 from mengxr/SPARK-10238 and squashes the following commits: b38437e [Xiangrui Meng] update since versions in mllib.linalg
* [SPARK-10233] [MLLIB] update since version in mllib.evaluationXiangrui Meng2015-08-254-7/+27
| | | | | | | | | | Same as #8421 but for `mllib.evaluation`. cc avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8423 from mengxr/SPARK-10233.
* [SPARK-9888] [MLLIB] User guide for new LDA featuresFeynman Liang2015-08-252-1/+1
| | | | | | | | | | | | * Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888.
* [SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and ↵Xiangrui Meng2015-08-259-11/+41
| | | | | | | | | | | | | | mllib.util Same as #8421 but for `mllib.pmml` and `mllib.util`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8430 from mengxr/SPARK-10239 and squashes the following commits: a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
* [SPARK-9797] [MLLIB] [DOC] ↵Feynman Liang2015-08-251-1/+1
| | | | | | | | | | StreamingLinearRegressionWithSGD.setConvergenceTol default value Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc Author: Feynman Liang <fliang@databricks.com> Closes #8424 from feynmanliang/SPARK-9797.
* [SPARK-10237] [MLLIB] update since versions in mllib.fpmXiangrui Meng2015-08-253-7/+32
| | | | | | | | | | Same as #8421 but for `mllib.fpm`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8429 from mengxr/SPARK-10237.
* [SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD aliasFeynman Liang2015-08-251-1/+4
| | | | | | | | | * Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol * Cleans up a note in code Author: Feynman Liang <fliang@databricks.com> Closes #8425 from feynmanliang/SPARK-9800.
* [SPARK-10231] [MLLIB] update @Since annotation for mllib.classificationXiangrui Meng2015-08-255-21/+58
| | | | | | | | | | | | | | | | Update `Since` annotation in `mllib.classification`: 1. add version to classes, objects, constructors, and public variables declared in constructors 2. correct some versions 3. remove `Since` on `toString` MechCoder dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8421 from mengxr/SPARK-10231 and squashes the following commits: b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
* [SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentrationFeynman Liang2015-08-252-9/+9
| | | | | | | | | | See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230.
* [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵Sean Owen2015-08-256-13/+14
| | | | | | | | | | | | to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
* [SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bugJoseph K. Bradley2015-08-232-9/+35
| | | | | | | | | | | | GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix.
* [SPARK-9893] User guide with Java test suite for VectorSlicerXusen Yin2015-08-211-0/+85
| | | | | | | | | | Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin <yinxusen@gmail.com> Closes #8267 from yinxusen/SPARK-9893.
* [SPARK-10163] [ML] Allow single-category features for GBT modelsJoseph K. Bradley2015-08-211-5/+0
| | | | | | | | | | | | Removed categorical feature info validation since no longer needed This is needed to make the ML user guide examples work (in another current PR). CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8367 from jkbradley/gbt-single-cat.
* [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since annotationMechCoder2015-08-2168-862/+692
| | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8352 from MechCoder/since.
* [SPARK-9245] [MLLIB] LDA topic assignmentsJoseph K. Bradley2015-08-204-7/+74
| | | | | | | | | | For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token. CC: rotationsymmetry mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8329 from jkbradley/lda-topic-assignments.
* [SPARK-10108] Add since tags to mllib.featureMechCoder2015-08-209-11/+76
| | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8309 from MechCoder/tags_feature.
* [SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add ↵Xiangrui Meng2015-08-202-27/+101
| | | | | | | | | | Java test suite Otherwise, setters do not return self type. jkbradley avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8342 from mengxr/SPARK-10138.
* [SPARK-9895] User Guide for RFormula Feature TransformerEric Liang2015-08-191-2/+2
| | | | | | | | mengxr Author: Eric Liang <ekl@databricks.com> Closes #8293 from ericl/docs-2.
* [SPARK-8918] [MLLIB] [DOC] Add @since tags to mllib.clusteringXiangrui Meng2015-08-199-52/+338
| | | | | | | | | | | | This continues the work from #8256. I removed `since` tags from private/protected/local methods/variables (see https://github.com/apache/spark/commit/72fdeb64630470f6f46cf3eed8ffbfe83a7c4659). MechCoder Closes #8256 Author: Xiangrui Meng <meng@databricks.com> Author: Xiaoqing Wang <spark445@126.com> Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8288 from mengxr/SPARK-8918.
* [SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator`Feynman Liang2015-08-199-20/+50
| | | | | | | | | | | | | Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097.
* [SPARK-10012] [ML] Missing test case for Params#arrayLengthGtlewuathe2015-08-181-0/+3
| | | | | | | | Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012.
* [SPARK-8924] [MLLIB, DOCUMENTATION] Added @since tags to mllib.treeBryan Cutler2015-08-1824-1/+157
| | | | | | | | Added since tags to mllib.tree Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #7380 from BryanCutler/sinceTag-mllibTree-8924.
* [SPARK-9900] [MLLIB] User guide for Association RulesFeynman Liang2015-08-181-1/+1
| | | | | | | | Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules.
* [SPARK-9028] [ML] Add CountVectorizer as an estimator to generate ↵Yuhao Yang2015-08-184-155/+402
| | | | | | | | | | | | | | | CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator.
* [SPARK-10076] [ML] make MultilayerPerceptronClassifier layers and weights publicYanbo Liang2015-08-171-2/+2
| | | | | | | | Fix the issue that ```layers``` and ```weights``` should be public variables of ```MultilayerPerceptronClassificationModel```. Users can not get ```layers``` and ```weights``` from a ```MultilayerPerceptronClassificationModel``` currently. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8263 from yanboliang/mlp-public.
* [SPARK-7808] [ML] add package doc for ml.featureXiangrui Meng2015-08-171-0/+89
| | | | | | | | This PR adds a short description of `ml.feature` package with code example. The Java package doc will come in a separate PR. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8260 from mengxr/SPARK-7808.
* SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regressionPrayag Chandran2015-08-179-12/+168
| | | | | | | | | | | | | Added since tags to mllib.regression Author: Prayag Chandran <prayagchandran@gmail.com> Closes #7518 from prayagchandran/sinceTags and squashes the following commits: fa4dda2 [Prayag Chandran] Re-formatting 6c6d584 [Prayag Chandran] Corrected a few tags. Removed few unnecessary tags 1a0365f [Prayag Chandran] Reformating and adding a few more tags 89fdb66 [Prayag Chandran] SPARK-8916 [Documentation, MLlib] Add @since tags to mllib.regression
* [SPARK-8920] [MLLIB] Add @since tags to mllib.linalgSameer Abhyankar2015-08-178-17/+227
| | | | | | | Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.Samavihome> Author: Sameer Abhyankar <sabhyankar@sabhyankar-MBP.local> Closes #7729 from sabhyankar/branch_8920.
* [SPARK-9959] [MLLIB] Association Rules Java CompatibilityFeynman Liang2015-08-171-2/+28
| | | | | | | | mengxr Author: Feynman Liang <fliang@databricks.com> Closes #8206 from feynmanliang/SPARK-9959-arules-java.
* [HOTFIX] fix duplicated bracesDavies Liu2015-08-141-1/+1
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #8219 from davies/fix_typo.
* [SPARK-9981] [ML] Made labels public for StringIndexerModelJoseph K. Bradley2015-08-142-1/+22
| | | | | | | | | | Also added unit test for integration between StringIndexerModel and IndexToString CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8211 from jkbradley/stridx-labels.
* [SPARK-9929] [SQL] support metadata in withColumnWenchen Fan2015-08-144-7/+6
| | | | | | | | in MLlib sometimes we need to set metadata for the new column, thus we will alias the new column with metadata before call `withColumn` and in `withColumn` we alias this clolumn again. Here I overloaded `withColumn` to allow user set metadata, just like what we did for `Column.as`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8159 from cloud-fan/withColumn.
* [SPARK-8744] [ML] Add a public constructor to StringIndexerHolden Karau2015-08-142-1/+5
| | | | | | | | It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model. Author: Holden Karau <holden@pigscanfly.ca> Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
* [SPARK-9956] [ML] Make trees work with one-category featuresJoseph K. Bradley2015-08-142-10/+30
| | | | | | | | | | | | | | This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical. As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing. Targeted for 1.5 and master CC: manishamde mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8187 from jkbradley/tree-1cat.
* [SPARK-9661] [MLLIB] minor clean-up of SPARK-9661Xiangrui Meng2015-08-144-25/+28
| | | | | | | | Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8190 from mengxr/SPARK-9661-fix.