aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-10573] [ML] IndexToString output schema should be StringTypeNick Pritchard2015-09-142-3/+10
| | | | | | | | Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8751 from pnpritchard/SPARK-10573.
* [SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter ↵Yanbo Liang2015-09-141-5/+15
| | | | | | | | | | in Python [SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8457 from yanboliang/spark-10194.
* [SPARK-9720] [ML] Identifiable types need UID in toString methodsBertrand Dechoux2015-09-148-9/+9
| | | | | | | | | | | | | | A few Identifiable types did override their toString method but without using the parent implementation. As a consequence, the uid was not present anymore in the toString result. It is the default behaviour. This patch is a quick fix. The question of enforcement is still up. No tests have been written to verify the toString method behaviour. That would be long to do because all types should be tested and not only those which have a regression now. It is possible to enforce the condition using the compiler by making the toString method final but that would introduce unwanted potential API breaking changes (see jira). Author: Bertrand Dechoux <BertrandDechoux@users.noreply.github.com> Closes #8062 from BertrandDechoux/SPARK-9720.
* [MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtilsJoseph K. Bradley2015-09-112-21/+12
| | | | | | | | | | | | Changes: * Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited. * MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore CC: holdenk mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8679 from jkbradley/doc-fixes-1.5.
* [SPARK-10537] [ML] document LIBSVM source options in public API doc and some ↵Xiangrui Meng2015-09-113-43/+66
| | | | | | | | | | | | | | | | | minor improvements We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR: 1. Do `vectorType == "sparse"` only once. 2. Update `hashCode` and `equals`. 3. Remove inherited doc. 4. Delete temp dir in `afterAll`. Lewuathe Author: Xiangrui Meng <meng@databricks.com> Closes #8699 from mengxr/SPARK-10537.
* [SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifierYanbo Liang2015-09-111-0/+9
| | | | | | | | Add Python API for ```MultilayerPerceptronClassifier```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8067 from yanboliang/SPARK-9773.
* [SPARK-10023] [ML] [PySpark] Unified DecisionTreeParams checkpointInterval ↵Yanbo Liang2015-09-104-24/+16
| | | | | | | | | | | | | | | | between Scala and Python API. "checkpointInterval" is member of DecisionTreeParams in Scala API which is inconsistency with Python API, we should unified them. ``` member of DecisionTreeParams <-> Scala API shared param for all ML Transformer/Estimator <-> Python API ``` Proposal: "checkpointInterval" is also used by ALS, so we make it shared params at Scala. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8528 from yanboliang/spark-10023.
* [SPARK-10117] [MLLIB] Implement SQL data source API for reading LIBSVM datalewuathe2015-09-094-0/+256
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API. Two option is implemented. * `numFeatures`: Specify the dimension of features vector * `featuresType`: Specify the type of output vector. `sparse` is default. Author: lewuathe <lewuathe@me.com> Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits: 986999d [lewuathe] Change unit test phrase 11d513f [lewuathe] Fix some reviews 21600a4 [lewuathe] Merge branch 'master' into SPARK-10117 9ce63c7 [lewuathe] Rewrite service loader file 1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117 ba3657c [lewuathe] Merge branch 'master' into SPARK-10117 0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF 4f40891 [lewuathe] Improve test suites 5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117 8660d0e [lewuathe] Fix Java unit test b56a948 [lewuathe] Merge branch 'master' into SPARK-10117 2c12894 [lewuathe] Remove unnecessary tag 7d693c2 [lewuathe] Resolv conflict 62010af [lewuathe] Merge branch 'master' into SPARK-10117 a97ee97 [lewuathe] Fix some points aef9564 [lewuathe] Fix 70ee4dd [lewuathe] Add Java test 3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data 40d3027 [lewuathe] Add Java test 7056d4a [lewuathe] Merge branch 'master' into SPARK-10117 99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
* [SPARK-10227] fatal warnings with sbt on Scala 2.11Luc Bourlier2015-09-091-6/+6
| | | | | | | | | | | The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary. But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations. The remainder are some potential bugs, and deprecated syntax. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #8433 from skyluc/issue/sbt-2.11.
* [SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySparkHolden Karau2015-09-081-1/+1
| | | | | | | | Adds IndexToString to PySpark. Author: Holden Karau <holden@pigscanfly.ca> Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
* [SPARK-10464] [MLLIB] Add WeibullGenerator for RandomDataGeneratorYanbo Liang2015-09-082-3/+40
| | | | | | | | | Add WeibullGenerator for RandomDataGenerator. #8611 need use WeibullGenerator to generate random data based on Weibull distribution. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8622 from yanboliang/spark-10464.
* [SPARK-9834] [MLLIB] implement weighted least squares via normal equationXiangrui Meng2015-09-084-1/+438
| | | | | | | | | | | | | | | | | The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet. There are couple TODOs that can be addressed in future PRs: * consolidate summary statistics aggregators * move `dspr` to `BLAS` * etc It would be nice to have this merged first because it blocks couple other features. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8588 from mengxr/SPARK-9834.
* [SPARK-10468] [ MLLIB ] Verify schema before Dataframe select API callVinod K C2015-09-082-5/+2
| | | | | | | | | Loader.checkSchema was called to verify the schema after dataframe.select(...). Schema verification should be done before dataframe.select(...) Author: Vinod K C <vinod.kc@huawei.com> Closes #8636 from vinodkc/fix_GaussianMixtureModel_load_verification.
* [SPARK-10470] [ML] ml.IsotonicRegressionModel.copy should set parentYanbo Liang2015-09-082-1/+6
| | | | | | | | | Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent. Here fix it and add test case. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8637 from yanboliang/spark-10470.
* [SPARK-10480] [ML] Fix ML.LinearRegressionModel.copy()Yanbo Liang2015-09-082-2/+4
| | | | | | | | | | | | This PR fix two model ```copy()``` related issues: [SPARK-10480](https://issues.apache.org/jira/browse/SPARK-10480) ```ML.LinearRegressionModel.copy()``` ignored argument ```extra```, it will not take effect when users setting this parameter. [SPARK-10479](https://issues.apache.org/jira/browse/SPARK-10479) ```ML.LogisticRegressionModel.copy()``` should copy model summary if available. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8641 from yanboliang/linear-regression-copy.
* [SPARK-10013] [ML] [JAVA] [TEST] remove java assert from java unit testsHolden Karau2015-09-054-52/+54
| | | | | | | | From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests. Author: Holden Karau <holden@pigscanfly.ca> Closes #8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
* [SPARK-10402] [DOCS] [ML] Add defaults to the scaladoc for params in ml/Holden Karau2015-09-0410-2/+16
| | | | | | | | We should make sure the scaladoc for params includes their default values through the models in ml/ Author: Holden Karau <holden@pigscanfly.ca> Closes #8591 from holdenk/SPARK-10402-add-scaladoc-for-default-values-of-params-in-ml.
* [SPARK-9723] [ML] params getordefault should throw more useful errorHolden Karau2015-09-023-8/+15
| | | | | | | | Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup. Author: Holden Karau <holden@pigscanfly.ca> Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
* [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words RemoverHolden Karau2015-09-012-4/+4
| | | | | | | | Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
* [SPARK-10349] [ML] OneVsRest use 'when ... otherwise' not UDF to generate ↵Yanbo Liang2015-08-311-8/+2
| | | | | | | | | | | new label at binary reduction Currently OneVsRest use UDF to generate new binary label during training. Considering that [SPARK-7321](https://issues.apache.org/jira/browse/SPARK-7321) has been merged, we can use ```when ... otherwise``` which will be more efficiency. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8519 from yanboliang/spark-10349.
* [SPARK-9954] [MLLIB] use first 128 nonzeros to compute Vector.hashCodeXiangrui Meng2015-08-311-17/+21
| | | | | | | | This could help reduce hash collisions, e.g., in `RDD[Vector].repartition`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8182 from mengxr/SPARK-9954.
* [SPARK-100354] [MLLIB] fix some apparent memory issues in k-means|| ↵Xiangrui Meng2015-08-301-7/+14
| | | | | | | | | | | | | | | | initializaiton * do not cache first cost RDD * change following cost RDD cache level to MEMORY_AND_DISK * remove Vector wrapper to save a object per instance Further improvements will be addressed in SPARK-10329 cc: yu-iskw HuJiayin Author: Xiangrui Meng <meng@databricks.com> Closes #8526 from mengxr/SPARK-10354.
* [SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset ↵Burak Yavuz2015-08-302-16/+15
| | | | | | | | | | | | of matrix multiplications mengxr jkbradley rxin It would be great if this fix made it into RC3! Author: Burak Yavuz <brkyvz@gmail.com> Closes #8525 from brkyvz/blas-scaling.
* [SPARK-10260] [ML] Add @Since annotation to ml.clusteringYu ISHIKAWA2015-08-281-3/+29
| | | | | | | | | ### JIRA [[SPARK-10260] Add Since annotation to ml.clustering - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10260) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8455 from yu-iskw/SPARK-10260.
* [SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java ↵Feynman Liang2015-08-271-0/+72
| | | | | | | | | | | | | compatibility test * Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine * Cleans up scaladocs for public methods * Adds test for Java compatibility * Follow up Python user guide code example is tracked by SPARK-10249 Author: Feynman Liang <fliang@databricks.com> Closes #8436 from feynmanliang/SPARK-10230.
* [SPARK-10182] [MLLIB] GeneralizedLinearModel doesn't unpersist cached dataVyacheslav Baranov2015-08-271-0/+5
| | | | | | | | | | | | `GeneralizedLinearModel` creates a cached RDD when building a model. It's inconvenient, since these RDDs flood the memory when building several models in a row, so useful data might get evicted from the cache. The proposed solution is to always cache the dataset & remove the warning. There's a caveat though: input dataset gets evaluated twice, in line 270 when fitting `StandardScaler` for the first time, and when running optimizer for the second time. So, it might worth to return removed warning. Another possible solution is to disable caching entirely & return removed warning. I don't really know what approach is better. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #8395 from SlavikBaranov/SPARK-10182.
* [SPARK-10257] [MLLIB] Removes Guava from all spark.mllib Java testsFeynman Liang2015-08-2714-74/+71
| | | | | | | | | | | | * Replaces instances of `Lists.newArrayList` with `Arrays.asList` * Replaces `commons.lang.StringUtils` over `com.google.collections.Strings` * Replaces `List` interface over `ArrayList` implementations This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests. Author: Feynman Liang <fliang@databricks.com> Closes #8451 from feynmanliang/SPARK-10257.
* [SPARK-9613] [HOTFIX] Fix usage of JavaConverters removed in Scala 2.11Jacek Laskowski2015-08-271-1/+1
| | | | | | | | | | | | | | | | | Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases. Build for 2.10: ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install and 2.11: ./dev/change-scala-version.sh 2.11 ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install Author: Jacek Laskowski <jacek@japila.pl> Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
* [SPARK-10256] [ML] Removes guava dependency from spark.ml.classification ↵Feynman Liang2015-08-271-2/+2
| | | | | | | | JavaTests Author: Feynman Liang <fliang@databricks.com> Closes #8447 from feynmanliang/SPARK-10256.
* [SPARK-10255] [ML] Removes Guava dependencies from spark.ml.param JavaTestsFeynman Liang2015-08-272-6/+6
| | | | | | Author: Feynman Liang <fliang@databricks.com> Closes #8446 from feynmanliang/SPARK-10255.
* [SPARK-10254] [ML] Removes Guava dependencies in spark.ml.feature JavaTestsFeynman Liang2015-08-2711-30/+35
| | | | | | | | | * Replaces `com.google.common` dependencies with `java.util.Arrays` * Small clean up in `JavaNormalizerSuite` Author: Feynman Liang <fliang@databricks.com> Closes #8445 from feynmanliang/SPARK-10254.
* [SPARK-10241] [MLLIB] update since versions in mllib.recommendationXiangrui Meng2015-08-262-5/+25
| | | | | | | | | | Same as #8421 but for `mllib.recommendation`. cc srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #8432 from mengxr/SPARK-10241.
* [SPARK-9665] [MLLIB] audit MLlib API annotationsXiangrui Meng2015-08-261-4/+8
| | | | | | | | | | I only found `ml.NaiveBayes` missing `Experimental` annotation. This PR doesn't cover Python APIs. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8452 from mengxr/SPARK-9665.
* [SPARK-10236] [MLLIB] update since versions in mllib.featureXiangrui Meng2015-08-258-16/+21
| | | | | | | | | | | | | Same as #8421 but for `mllib.feature`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8449 from mengxr/SPARK-10236.feature and squashes the following commits: 0e8d658 [Xiangrui Meng] remove unnecessary comment ad70b03 [Xiangrui Meng] update since versions in mllib.feature
* [SPARK-10235] [MLLIB] update since versions in mllib.regressionXiangrui Meng2015-08-258-29/+47
| | | | | | | | | | | | Same as #8421 but for `mllib.regression`. cc freeman-lab dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8426 from mengxr/SPARK-10235 and squashes the following commits: 6cd28e4 [Xiangrui Meng] update since versions in mllib.regression
* [SPARK-10243] [MLLIB] update since versions in mllib.treeXiangrui Meng2015-08-2512-44/+57
| | | | | | | | | | Same as #8421 but for `mllib.tree`. cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8442 from mengxr/SPARK-10236.
* [SPARK-10234] [MLLIB] update since version in mllib.clusteringXiangrui Meng2015-08-257-23/+44
| | | | | | | | | | Same as #8421 but for `mllib.clustering`. cc feynmanliang yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8435 from mengxr/SPARK-10234.
* [SPARK-10240] [SPARK-10242] [MLLIB] update since versions in mlilb.random ↵Xiangrui Meng2015-08-254-25/+117
| | | | | | | | | | | | and mllib.stat The same as #8241 but for `mllib.stat` and `mllib.random`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8439 from mengxr/SPARK-10242.
* [SPARK-10238] [MLLIB] update since versions in mllib.linalgXiangrui Meng2015-08-258-31/+64
| | | | | | | | | | | | Same as #8421 but for `mllib.linalg`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8440 from mengxr/SPARK-10238 and squashes the following commits: b38437e [Xiangrui Meng] update since versions in mllib.linalg
* [SPARK-10233] [MLLIB] update since version in mllib.evaluationXiangrui Meng2015-08-254-7/+27
| | | | | | | | | | Same as #8421 but for `mllib.evaluation`. cc avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8423 from mengxr/SPARK-10233.
* [SPARK-9888] [MLLIB] User guide for new LDA featuresFeynman Liang2015-08-252-1/+1
| | | | | | | | | | | | * Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888.
* [SPARK-10239] [SPARK-10244] [MLLIB] update since versions in mllib.pmml and ↵Xiangrui Meng2015-08-259-11/+41
| | | | | | | | | | | | | | mllib.util Same as #8421 but for `mllib.pmml` and `mllib.util`. cc dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8430 from mengxr/SPARK-10239 and squashes the following commits: a189acf [Xiangrui Meng] update since versions in mllib.pmml and mllib.util
* [SPARK-9797] [MLLIB] [DOC] ↵Feynman Liang2015-08-251-1/+1
| | | | | | | | | | StreamingLinearRegressionWithSGD.setConvergenceTol default value Adds default convergence tolerance (0.001, set in `GradientDescent.convergenceTol`) to `setConvergenceTol`'s scaladoc Author: Feynman Liang <fliang@databricks.com> Closes #8424 from feynmanliang/SPARK-9797.
* [SPARK-10237] [MLLIB] update since versions in mllib.fpmXiangrui Meng2015-08-253-7/+32
| | | | | | | | | | Same as #8421 but for `mllib.fpm`. cc feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #8429 from mengxr/SPARK-10237.
* [SPARK-9800] Adds docs for GradientDescent$.runMiniBatchSGD aliasFeynman Liang2015-08-251-1/+4
| | | | | | | | | * Adds doc for alias of runMIniBatchSGD documenting default value for convergeTol * Cleans up a note in code Author: Feynman Liang <fliang@databricks.com> Closes #8425 from feynmanliang/SPARK-9800.
* [SPARK-10231] [MLLIB] update @Since annotation for mllib.classificationXiangrui Meng2015-08-255-21/+58
| | | | | | | | | | | | | | | | Update `Since` annotation in `mllib.classification`: 1. add version to classes, objects, constructors, and public variables declared in constructors 2. correct some versions 3. remove `Since` on `toString` MechCoder dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8421 from mengxr/SPARK-10231 and squashes the following commits: b2dce80 [Xiangrui Meng] update @Since annotation for mllib.classification
* [SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentrationFeynman Liang2015-08-252-9/+9
| | | | | | | | | | See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230.
* [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵Sean Owen2015-08-256-13/+14
| | | | | | | | | | | | to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
* [SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bugJoseph K. Bradley2015-08-232-9/+35
| | | | | | | | | | | | GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix.
* [SPARK-9893] User guide with Java test suite for VectorSlicerXusen Yin2015-08-211-0/+85
| | | | | | | | | | Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin <yinxusen@gmail.com> Closes #8267 from yinxusen/SPARK-9893.