aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed testLiang-Chi Hsieh2016-02-132-47/+56
| | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12363 This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #10539 from viirya/fix-poweriter.
* [SPARK-12746][ML] ArrayType(_, true) should also accept ArrayType(_, false)Earthson Lu2016-02-111-1/+2
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-12746 Author: Earthson Lu <Earthson.Lu@gmail.com> Closes #10697 from Earthson/SPARK-12746.
* [SPARK-12765][ML][COUNTVECTORIZER] fix CountVectorizer.transform's lost ↵Liu Xiang2016-02-111-0/+1
| | | | | | | | | | transformSchema https://issues.apache.org/jira/browse/SPARK-12765 Author: Liu Xiang <lxmtlab@gmail.com> Closes #10720 from sloth2012/sloth.
* [SPARK-11515][ML] QuantileDiscretizer should take random seedYu ISHIKAWA2016-02-112-6/+11
| | | | | | | | cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9535 from yu-iskw/SPARK-11515.
* [SPARK-13265][ML] Refactoring of basic ML import/export for other file ↵Yu ISHIKAWA2016-02-111-6/+7
| | | | | | | | | | system besides HDFS jkbradley I tried to improve the function to export a model. When I tried to export a model to S3 under Spark 1.6, we couldn't do that. So, it should offer S3 besides HDFS. Can you review it when you have time? Thanks! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #11151 from yu-iskw/SPARK-13265.
* [SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.templateSasaki Toru2016-02-111-1/+1
| | | | | | | | In spark-env.sh.template, there are multi-byte characters, this PR will remove it. Author: Sasaki Toru <sasakitoa@nttdata.co.jp> Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.
* [SPARK-10524][ML] Use the soft prediction to order categories' binsLiang-Chi Hsieh2016-02-094-133/+194
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-10524 Currently we use the hard prediction (`ImpurityCalculator.predict`) to order categories' bins. But we should use the soft prediction. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8734 from viirya/dt-soft-centroids.
* [SPARK-13201][SPARK-13200] Deprecation warning cleanups: KMeans & ↵Holden Karau2016-02-093-5/+13
| | | | | | | | | | | | | | | | MFDataGenerator KMeans: Make a private non-deprecated version of setRuns API so that we can call it from the PythonAPI without deprecation warnings in our own build. Also use it internally when being called from train. Add a logWarning for non-1 values MFDataGenerator: Apparently we are calling round on an integer which now in Scala 2.11 results in a warning (it didn't make any sense before either). Figure out if this is a mistake we can just remove or if we got the types wrong somewhere. I put these two together since they are both deprecation fixes in MLlib and pretty small, but I can split them up if we would prefer it that way. Author: Holden Karau <holden@us.ibm.com> Closes #11112 from holdenk/SPARK-13201-non-deprecated-setRuns-SPARK-mathround-integer.
* [SPARK-13132][MLLIB] cache standardization param value in LogisticRegressionGary King2016-02-072-2/+5
| | | | | | | | | | | | cache the value of the standardization Param in LogisticRegression, rather than re-fetching it from the ParamMap for every index and every optimization step in the quasi-newton optimizer also, fix Param#toString to cache the stringified representation, rather than re-interpolating it on every call, so any other implementations that have similar repeated access patterns will see a benefit. this change improves training times for one of my test sets from ~7m30s to ~4m30s Author: Gary King <gary@idibon.com> Closes #11027 from idigary/spark-13132-optimize-logistic-regression.
* [SPARK-12732][ML] bug fix in linear regression trainImran Younus2016-02-022-25/+146
| | | | | | | | Fixed the bug in linear regression train for the case when the target variable is constant. The two cases for `fitIntercept=true` or `fitIntercept=false` should be treated differently. Author: Imran Younus <iyounus@us.ibm.com> Closes #10702 from iyounus/SPARK-12732_bug_fix_in_linear_regression_train.
* [SPARK-12711][ML] ML StopWordsRemover does not protect itself from column ↵Grzegorz Chilkiewicz2016-02-023-8/+19
| | | | | | | | | | | | name duplication Fixes problem and verifies fix by test suite. Also - adds optional parameter: nullable (Boolean) to: SchemaUtils.appendColumn and deduplicates SchemaUtils.appendColumn functions. Author: Grzegorz Chilkiewicz <grzegorz.chilkiewicz@codilime.com> Closes #10741 from grzegorz-chilkiewicz/master.
* [SPARK-12631][PYSPARK][DOC] PySpark clustering parameter desc to consistent ↵Bryan Cutler2016-02-025-29/+37
| | | | | | | | | | format Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the clustering module. Author: Bryan Cutler <cutlerb@gmail.com> Closes #10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
* [SPARK-6363][BUILD] Make Scala 2.11 the default Scala versionJosh Rosen2016-01-301-2/+2
| | | | | | | | | | | | This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds). The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance). After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break. Author: Josh Rosen <joshrosen@databricks.com> Closes #10608 from JoshRosen/SPARK-6363.
* [SPARK-9835][ML] Implement IterativelyReweightedLeastSquares solverYanbo Liang2016-01-283-1/+314
| | | | | | | | | | | | | | | | | | | | Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```. There are two limitations in the current implementation compared with R: * It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code: ``` glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial) ``` * It does not support ```offset```. Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS. The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM). Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10639 from yanboliang/spark-9835.
* [SPARK-7780][MLLIB] intercept in logisticregressionwith lbfgs should not be ↵Holden Karau2016-01-266-28/+179
| | | | | | | | | | | | | | | regularized The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization. The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api. Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution. Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review. Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
* [SPARK-11622][MLLIB] Make LibSVMRelation extends HadoopFsRelation and…Jeff Zhang2016-01-262-16/+109
| | | | | | | | | | | | … Add LibSVMOutputWriter The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter * Partition is still not supported * Multiple input paths is not supported Author: Jeff Zhang <zjffdu@apache.org> Closes #9595 from zjffdu/SPARK-11622.
* [SPARK-12952] EMLDAOptimizer initialize() should return EMLDAOptimizer other ↵Xusen Yin2016-01-261-1/+3
| | | | | | | | | | than its parent class https://issues.apache.org/jira/browse/SPARK-12952 Author: Xusen Yin <yinxusen@gmail.com> Closes #10863 from yinxusen/SPARK-12952.
* [SPARK-12834] Change ser/de of JavaArray and JavaListXusen Yin2016-01-251-1/+5
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12834 We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780 Author: Xusen Yin <yinxusen@gmail.com> Closes #10772 from yinxusen/SPARK-12834.
* [SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySparkYanbo Liang2016-01-251-0/+2
| | | | | | | | | | ```PCAModel``` can output ```explainedVariance``` at Python side. cc mengxr srowen Author: Yanbo Liang <ybliang8@gmail.com> Closes #10830 from yanboliang/spark-12905.
* [SPARK-11965][ML][DOC] Update user guide for RFormula feature interactionsYanbo Liang2016-01-251-0/+21
| | | | | | | | Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10222 from yanboliang/spark-11965.
* [SPARK-7997][CORE] Remove Akka from Spark Core and StreamingShixiong Zhu2016-01-222-2/+2
| | | | | | | | | | | | - Remove Akka dependency from core. Note: the streaming-akka project still uses Akka. - Remove HttpFileServer - Remove Akka configs from SparkConf and SSLOptions - Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it. - Update comments and docs Author: Shixiong Zhu <shixiong@databricks.com> Closes #10854 from zsxwing/remove-akka.
* [SPARK-12908][ML] Add warning message for LogisticRegression for potential ↵DB Tsai2016-01-211-0/+8
| | | | | | | | | | converge issue When all labels are the same, it's a dangerous ground for LogisticRegression without intercept to converge. GLMNET doesn't support this case, and will just exit. GLM can train, but will have a warning message saying the algorithm doesn't converge. Author: DB Tsai <dbt@netflix.com> Closes #10862 from dbtsai/add-tests.
* [SPARK-10263][ML] Add @Since annotation to ml.param and ml.*Takahashi Hiroshi2016-01-202-5/+42
| | | | | | | | | Add Since annotations to ml.param and ml.* Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp> Author: Hiroshi Takahashi <takahashi.hiroshi@lab.ntt.co.jp> Closes #8935 from taishi-oss/issue10263.
* [SPARK-12230][ML] WeightedLeastSquares.fit() should handle division by zero ↵Imran Younus2016-01-202-7/+83
| | | | | | | | | | properly if standard deviation of target variable is zero. This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train. Author: Imran Younus <iyounus@us.ibm.com> Closes #10274 from iyounus/SPARK-12230_bug_fix_in_weighted_least_squares.
* [SPARK-6519][ML] Add spark.ml API for bisecting k-meansYu ISHIKAWA2016-01-202-0/+281
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9604 from yu-iskw/SPARK-6519.
* [SPARK-9716][ML] BinaryClassificationEvaluator should accept Double ↵BenFradet2016-01-193-3/+55
| | | | | | | | | | prediction column This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10472 from BenFradet/SPARK-9716.
* [SPARK-12804][ML] Fix LogisticRegression with FitIntercept on all same label ↵Feynman Liang2016-01-192-95/+148
| | | | | | | | | | training data CC jkbradley mengxr dbtsai Author: Feynman Liang <feynman.liang@gmail.com> Closes #10743 from feynmanliang/SPARK-12804.
* [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k meansHolden Karau2016-01-191-0/+17
| | | | | | | | From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans. Author: Holden Karau <holden@us.ibm.com> Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
* [MLLIB] Fix CholeskyDecomposition assertion's messageWojciech Jurczyk2016-01-191-1/+1
| | | | | | | | Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method. Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com> Closes #10818 from wjur/wjur/rename_error_message.
* [SPARK-12346][ML] Missing attribute names in GLM for vector-type featuresEric Liang2016-01-183-5/+43
| | | | | | | | | | Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names. cc mengxr Author: Eric Liang <ekl@databricks.com> Closes #10323 from ericl/spark-12346.
* [SPARK-10264][DOCUMENTATION] Added @Since to ml.recomendationTommy YU2016-01-181-3/+30
| | | | | | | | | | | I create new pr since original pr long time no update. Please help to review. srowen Author: Tommy YU <tummyyu@163.com> Closes #10756 from Wenpei/add_since_to_recomm.
* [SPARK-12830] Java style: disallow trailing whitespaces.Reynold Xin2016-01-141-1/+1
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10764 from rxin/SPARK-12830.
* [SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number ↵Yuhao Yang2016-01-131-2/+4
| | | | | | | | | | | | | | of features is large jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq.
* [SPARK-7615][MLLIB] MLLIB Word2Vec wordVectors divided by Euclidean Norm ↵Sean Owen2016-01-121-1/+6
| | | | | | | | | | | | equals to zero Cosine similarity with 0 vector should be 0 Related to https://github.com/apache/spark/pull/10152 Author: Sean Owen <sowen@cloudera.com> Closes #10696 from srowen/SPARK-7615.
* [SPARK-10809][MLLIB] Single-document topicDistributions method for LocalLDAModelYuhao Yang2016-01-112-3/+38
| | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-10809 We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents. add some missing assert too. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9484 from hhbyyh/ldaTopicPre.
* [SPARK-12685][MLLIB] word2vec trainWordsCount gets overflowYuhao Yang2016-01-111-4/+4
| | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-12685 the log of `word2vec` reports trainWordsCount = -785727483 during computation over a large dataset. Update the priority as it will affect the computation process. `alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))` Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10627 from hhbyyh/w2voverflow.
* [SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support ↵Yanbo Liang2016-01-112-1/+5
| | | | | | | | | | single instance predict/predictSoft PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10552 from yanboliang/spark-12603.
* [SPARK-3873][BUILD] Enable import ordering error checking.Marcelo Vanzin2016-01-105-8/+7
| | | | | | | | | | | | | Turn import ordering violations into build errors, plus a few adjustments to account for how the checker behaves. I'm a little on the fence about whether the existing code is right, but it's easier to appease the checker than to discuss what's the more correct order here. Plus a few fixes to imports that cropped in since my recent cleanups. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10612 from vanzin/SPARK-3873-enable.
* [SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space ↵Kousuke Saruta2016-01-1014-19/+19
| | | | | | | | | | | before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10684 from sarutak/SPARK-12692-followup-mllib.
* [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 editionSean Owen2016-01-083-15/+15
| | | | | | | | Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs. Author: Sean Owen <sowen@cloudera.com> Closes #10570 from srowen/SPARK-12618.
* [SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFileRobert Dodier2016-01-061-1/+2
| | | | | | | | | | This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663). For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html) Author: Robert Dodier <robert_dodier@users.sourceforge.net> Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
* [SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' ↵BenFradet2016-01-061-2/+1
| | | | | | | | | | | | | | | metricName For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC". Also, in the documentation, it is said that: "The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators." However, the method is called setMetricName. This PR aims to fix both issues. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10328 from BenFradet/SPARK-12368.
* [SPARK-3873][TESTS] Import ordering fixes.Marcelo Vanzin2016-01-0547-62/+58
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10582 from vanzin/SPARK-3873-tests.
* [SPARK-12450][MLLIB] Un-persist broadcasted variables in KMeansRJ Nowling2016-01-051-0/+8
| | | | | | | | SPARK-12450 . Un-persist broadcasted variables in KMeans. Author: RJ Nowling <rnowling@gmail.com> Closes #10415 from rnowling/spark-12450.
* [SPARK-6724][MLLIB] Support model save/load for FPGrowthModelYanbo Liang2016-01-053-3/+205
| | | | | | | | Support model save/load for FPGrowthModel Author: Yanbo Liang <ybliang8@gmail.com> Closes #9267 from yanboliang/spark-6724.
* [SPARK-12331][ML] R^2 for regression through the origin.Imran Younus2016-01-053-71/+112
| | | | | | | | | Modified the definition of R^2 for regression through origin. Added modified test for regression metrics. Author: Imran Younus <iyounus@us.ibm.com> Author: Imran Younus <imranyounus@gmail.com> Closes #10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.
* [SPARK-9622][ML] DecisionTreeRegressor: provide variance of predictionYanbo Liang2016-01-045-4/+92
| | | | | | | | DecisionTreeRegressor will provide variance of prediction as a Double column. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8866 from yanboliang/spark-9622.
* [SPARK-11259][ML] Params.validateParams() should be called automaticallyYanbo Liang2016-01-0430-1/+63
| | | | | | | | See JIRA: https://issues.apache.org/jira/browse/SPARK-11259 Author: Yanbo Liang <ybliang8@gmail.com> Closes #9224 from yanboliang/spark-11259.
* [SPARK-12599][MLLIB][SQL] Remove the use of callUDF in MLlibReynold Xin2016-01-021-2/+2
| | | | | | | | callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that. Author: Reynold Xin <rxin@databricks.com> Closes #10547 from rxin/SPARK-12599.
* [SPARK-3873][MLLIB] Import order fixes.Marcelo Vanzin2015-12-3194-167/+158
| | | | | | | | | | | A slight adjustment to the checker configuration was needed; there is a handful of warnings still left, but those are because of a bug in the checker that I'll fix separately (before enabling errors for the checker, of course). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10535 from vanzin/SPARK-3873-mllib.