aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-6660][MLLIB] pythonToJava doesn't recognize object arraysXiangrui Meng2015-04-011-1/+4
| | | | | | | | | | | davies Author: Xiangrui Meng <meng@databricks.com> Closes #5318 from mengxr/SPARK-6660 and squashes the following commits: 0f66ec2 [Xiangrui Meng] recognize object arrays ad8c42f [Xiangrui Meng] add a test for SPARK-6660
* [SPARK-6580] [MLLIB] Optimize LogisticRegressionModel.predictPointYanbo Liang2015-04-011-29/+26
| | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6580 Author: Yanbo Liang <ybliang8@gmail.com> Closes #5249 from yanboliang/spark-6580 and squashes the following commits: 6f47f21 [Yanbo Liang] address comments 4e0bd0f [Yanbo Liang] fix typos 04e2e2a [Yanbo Liang] trigger jenkins cad5bcd [Yanbo Liang] Optimize LogisticRegressionModel.predictPoint
* [SPARK-6642][MLLIB] use 1.2 lambda scaling and remove addImplicit from ↵Xiangrui Meng2015-04-012-81/+57
| | | | | | | | | | | | | NormalEquation This PR changes lambda scaling from number of users/items to number of explicit ratings. The latter is the behavior in 1.2. Slight refactor of NormalEquation to make it independent of ALS models. srowen codexiang Author: Xiangrui Meng <meng@databricks.com> Closes #5314 from mengxr/SPARK-6642 and squashes the following commits: dc655a1 [Xiangrui Meng] relax python tests f410df2 [Xiangrui Meng] use 1.2 scaling and remove addImplicit from NormalEquation
* [SPARK-5692] [MLlib] Word2Vec save/loadMechCoder2015-03-312-3/+110
| | | | | | | | | | | | | | Word2Vec model now supports saving and loading. a] The Metadata stored in JSON format consists of "version", "classname", "vectorSize" and "numWords" b] The data stored in Parquet file format consists of an Array of rows with each row consisting of 2 columns, first being the word: String and the second, an Array of Floats. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5291 from MechCoder/spark-5692 and squashes the following commits: 1142f3a [MechCoder] Add numWords to metaData bfe4c39 [MechCoder] [SPARK-5692] Word2Vec save/load
* [SPARK-6255] [MLLIB] Support multiclass classification in Python APIYanbo Liang2015-03-311-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Python API parity check for classification and multiclass classification support, major disparities need to be added for Python: ```scala LogisticRegressionWithLBFGS setNumClasses setValidateData LogisticRegressionModel getThreshold numClasses numFeatures SVMWithSGD setValidateData SVMModel getThreshold ``` For users the greatest benefit in this PR is multiclass classification was supported by Python API. Users can train multiclass classification model and use it to predict in pyspark. Author: Yanbo Liang <ybliang8@gmail.com> Closes #5137 from yanboliang/spark-6255 and squashes the following commits: 0bd531e [Yanbo Liang] address comments 444d5e2 [Yanbo Liang] LogisticRegressionModel.predict() optimization fc7990b [Yanbo Liang] address comments b0d9c63 [Yanbo Liang] Support Mulinomial LR model predict in Python API ded847c [Yanbo Liang] Python API parity check for classification (support multiclass classification)
* [SPARK-4894][mllib] Added Bernoulli option to NaiveBayes model in mllibleahmcguire2015-03-313-84/+312
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added optional model type parameter for NaiveBayes training. Can be either Multinomial or Bernoulli. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction as per: http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html. Default for model is original Multinomial fit and predict. Added additional testing for Bernoulli and Multinomial models. Author: leahmcguire <lmcguire@salesforce.com> Author: Joseph K. Bradley <joseph@databricks.com> Author: Leah McGuire <lmcguire@salesforce.com> Closes #4087 from leahmcguire/master and squashes the following commits: f3c8994 [leahmcguire] changed checks on model type to requires acb69af [leahmcguire] removed enum type and replaces all modelType parameters with strings 2224b15 [Leah McGuire] Merge pull request #2 from jkbradley/leahmcguire-master 9ad89ca [Joseph K. Bradley] removed old code 6a8f383 [Joseph K. Bradley] Added new model save/load format 2.0 for NaiveBayesModel after modelType parameter was added. Updated tests. Also updated ModelType enum-like type. 852a727 [leahmcguire] merged with upstream master a22d670 [leahmcguire] changed NaiveBayesModel modelType parameter back to NaiveBayes.ModelType, made NaiveBayes.ModelType serializable, fixed getter method in NavieBayes 18f3219 [leahmcguire] removed private from naive bayes constructor for lambda only bea62af [leahmcguire] put back in constructor for NaiveBayes 01baad7 [leahmcguire] made fixes from code review fb0a5c7 [leahmcguire] removed typo e2d925e [leahmcguire] fixed nonserializable error that was causing naivebayes test failures 2d0c1ba [leahmcguire] fixed typo in NaiveBayes c298e78 [leahmcguire] fixed scala style errors b85b0c9 [leahmcguire] Merge remote-tracking branch 'upstream/master' 900b586 [leahmcguire] fixed model call so that uses type argument ea09b28 [leahmcguire] Merge remote-tracking branch 'upstream/master' e016569 [leahmcguire] updated test suite with model type fix 85f298f [leahmcguire] Merge remote-tracking branch 'upstream/master' dc65374 [leahmcguire] integrated model type fix 7622b0c [leahmcguire] added comments and fixed style as per rb b93aaf6 [Leah McGuire] Merge pull request #1 from jkbradley/nb-model-type 3730572 [Joseph K. Bradley] modified NB model type to be more Java-friendly b61b5e2 [leahmcguire] added back compatable constructor to NaiveBayesModel to fix MIMA test failure 5a4a534 [leahmcguire] fixed scala style error in NaiveBayes 3891bf2 [leahmcguire] synced with apache spark and resolved merge conflict d9477ed [leahmcguire] removed old inaccurate comment from test suite for mllib naive bayes 76e5b0f [leahmcguire] removed unnecessary sort from test 0313c0c [leahmcguire] fixed style error in NaiveBayes.scala 4a3676d [leahmcguire] Updated changes re-comments. Got rid of verbose populateMatrix method. Public api now has string instead of enumeration. Docs are updated." ce73c63 [leahmcguire] added Bernoulli option to niave bayes model in mllib, added optional model type parameter for training. When Bernoulli is given the Bernoulli smoothing is used for fitting and for prediction http://nlp.stanford.edu/IR-book/html/htmledition/the-bernoulli-model-1.html
* [SPARK-6571][MLLIB] use wrapper in MatrixFactorizationModel.loadXiangrui Meng2015-03-282-18/+40
| | | | | | | | | | This fixes `predictAll` after load. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #5243 from mengxr/SPARK-6571 and squashes the following commits: 82dcaa7 [Xiangrui Meng] use wrapper in MatrixFactorizationModel.load
* [SPARK-6526][ML] Add Normalizer transformer in ML packageXusen Yin2015-03-272-0/+162
| | | | | | | | | | | | | | | | See [SPARK-6526](https://issues.apache.org/jira/browse/SPARK-6526). mengxr Should we add test suite for this transformer? There is no test suite for all feature transformers in ML package now. Author: Xusen Yin <yinxusen@gmail.com> Closes #5181 from yinxusen/SPARK-6526 and squashes the following commits: 6faa7bf [Xusen Yin] fix style 8a462da [Xusen Yin] remove duplications ab35ab0 [Xusen Yin] add test suite bc8cd0f [Xusen Yin] fix comment 79774c9 [Xusen Yin] add Normalizer transformer in ML package
* [SPARK-6341][mllib] Upgrade breeze from 0.11.1 to 0.11.2Yu ISHIKAWA2015-03-271-1/+1
| | | | | | | | | | | | | There are any bugs of breeze's SparseVector at 0.11.1. You know, Spark 1.3 depends on breeze 0.11.1. So I think we should upgrade it to 0.11.2. https://issues.apache.org/jira/browse/SPARK-6341 And thanks you for your great cooperation, David Hall(dlwh) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #5222 from yu-iskw/upgrade-breeze and squashes the following commits: ad8a688 [Yu ISHIKAWA] Upgrade breeze from 0.11.1 to 0.11.2 because of a bug of SparseVector. Thanks you for your great cooperation, David Hall(@dlwh)
* [MLlib]remove unused importYuhao Yang2015-03-261-1/+1
| | | | | | | | | | minor thing. Let me know if jira is required. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #5207 from hhbyyh/adjustImport and squashes the following commits: 2240121 [Yuhao Yang] remove unused import
* [SPARK-5987] [MLlib] Save/load for GaussianMixtureModelsMechCoder2015-03-252-20/+128
| | | | | | | | | | | | | | | | Should be self explanatory. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4986 from MechCoder/spark-5987 and squashes the following commits: 7d2cd56 [MechCoder] Iterate over dataframe in a better way e7a14cb [MechCoder] Minor 33c84f9 [MechCoder] Store as Array[Data] instead of Data[Array] 505bd57 [MechCoder] Rebased over master and used MatrixUDT 7422bb4 [MechCoder] Store sigmas as Array[Double] instead of Array[Array[Double]] b9794e4 [MechCoder] Minor cb77095 [MechCoder] [SPARK-5987] Save/load for GaussianMixtureModels
* [SPARK-6256] [MLlib] MLlib Python API parity check for regressionYanbo Liang2015-03-251-3/+13
| | | | | | | | | | | | | | | | | | | | | | | MLlib Python API parity check for Regression, major disparities need to be added for Python list following: ```scala LinearRegressionWithSGD setValidateData LassoWithSGD setIntercept setValidateData RidgeRegressionWithSGD setIntercept setValidateData ``` setFeatureScaling is mllib private function which is not needed to expose in pyspark. Author: Yanbo Liang <ybliang8@gmail.com> Closes #4997 from yanboliang/spark-6256 and squashes the following commits: 102f498 [Yanbo Liang] fix intercept issue & add doc test 1fb7b4f [Yanbo Liang] change 'intercept' to 'addIntercept' de5ecbc [Yanbo Liang] MLlib Python API parity check for regression
* [ML][FEATURE] SPARK-5566: RegEx TokenizerAugustin Borsu2015-03-253-1/+221
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added a Regex based tokenizer for ml. Currently the regex is fixed but if I could add a regex type paramater to the paramMap, changing the tokenizer regex could be a parameter used in the crossValidation. Also I wonder what would be the best way to add a stop word list. Author: Augustin Borsu <augustin@sagacify.com> Author: Augustin Borsu <a.borsu@gmail.com> Author: Augustin Borsu <aborsu@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #4504 from aborsu985/master and squashes the following commits: 716d257 [Augustin Borsu] Merge branch 'mengxr-SPARK-5566' cb07021 [Augustin Borsu] Merge branch 'SPARK-5566' of git://github.com/mengxr/spark into mengxr-SPARK-5566 5f09434 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' a164800 [Xiangrui Meng] remove tabs 556aa27 [Xiangrui Meng] Merge branch 'aborsu985-master' into SPARK-5566 9651aec [Xiangrui Meng] update test f96526d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5566 2338da5 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' e88d7b8 [Xiangrui Meng] change pattern to a StringParameter; update tests 148126f [Augustin Borsu] Added return type to public functions 12dddb4 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' daf685e [Augustin Borsu] Merge remote-tracking branch 'upstream/master' 6a85982 [Augustin Borsu] Style corrections 38b95a1 [Augustin Borsu] Added Java unit test for RegexTokenizer b66313f [Augustin Borsu] Modified the pattern Param so it is compiled when given to the Tokenizer e262bac [Augustin Borsu] Added unit tests in scala cd6642e [Augustin Borsu] Changed regex to pattern 132b00b [Augustin Borsu] Changed matching to gaps and removed case folding 201a107 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' cb9c9a7 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' d3ef6d3 [Augustin Borsu] Added doc to RegexTokenizer 9082fc3 [Augustin Borsu] Removed stopwords parameters and updated doc 19f9e53 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' f6a5002 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' 7f930bb [Augustin Borsu] Merge remote-tracking branch 'upstream/master' 77ff9ca [Augustin Borsu] Merge remote-tracking branch 'upstream/master' 2e89719 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' 196cd7a [Augustin Borsu] Merge remote-tracking branch 'upstream/master' 11ca50f [Augustin Borsu] Merge remote-tracking branch 'upstream/master' 9f8685a [Augustin Borsu] RegexTokenizer 9e07a78 [Augustin Borsu] Merge remote-tracking branch 'upstream/master' 9547e9d [Augustin Borsu] RegEx Tokenizer 01cd26f [Augustin Borsu] RegExTokenizer
* [SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights) ↵Yanbo Liang2015-03-252-0/+10
| | | | | | | | | | | | | | should initialize numFeatures In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need to update it to correct value when we call run() to train a model. ```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call ```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train multiclass classification model, it will throw exception due to the numFeatures is not updated. In this PR, we just update numFeatures at the beginning of GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case. Author: Yanbo Liang <ybliang8@gmail.com> Closes #5167 from yanboliang/spark-6496 and squashes the following commits: 8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, initialWeights) should initialize numFeatures
* [SPARK-6308] [MLlib] [Sql] Override TypeName in VectorUDT and MatrixUDTMechCoder2015-03-234-0/+8
| | | | | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5118 from MechCoder/spark-6308 and squashes the following commits: 6c8ffab [MechCoder] Add test for simpleString b966242 [MechCoder] [SPARK-6308] [MLlib][Sql] VectorUDT is displayed as vecto in dtypes
* [SPARK-6337][Documentation, SQL]Spark 1.3 doc fixesvinodkc2015-03-221-1/+1
| | | | | | | | Author: vinodkc <vinod.kc.in@gmail.com> Closes #5112 from vinodkc/spark_1.3_doc_fixes and squashes the following commits: 2c6aee6 [vinodkc] Spark 1.3 doc fixes
* [SPARK-6025] [MLlib] Add helper method evaluateEachIteration to extract ↵MechCoder2015-03-206-44/+94
| | | | | | | | | | | | | | | | learning curve Added evaluateEachIteration to allow the user to manually extract the error for each iteration of GradientBoosting. The internal optimisation can be dealt with later. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4906 from MechCoder/spark-6025 and squashes the following commits: 67146ab [MechCoder] Minor 352001f [MechCoder] Minor 6e8aa10 [MechCoder] Made the following changes Used mapPartition instead of map Refactored computeError and unpersisted broadcast variables bc99ac6 [MechCoder] Refactor the method and stuff dbda033 [MechCoder] [SPARK-6025] Add helper method evaluateEachIteration to extract learning curve
* [SPARK-6309] [SQL] [MLlib] Implement MatrixUDTMechCoder2015-03-202-0/+103
| | | | | | | | | | | | | Utilities to serialize and deserialize Matrices in MLlib Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #5048 from MechCoder/spark-6309 and squashes the following commits: 05dc6f2 [MechCoder] Hashcode and organize imports 16d5d47 [MechCoder] Test some more 6e67020 [MechCoder] TST: Test using Array conversion instead of equals 7fa7a2c [MechCoder] [SPARK-6309] [SQL] [MLlib] Implement MatrixUDT
* [SPARK-5955][MLLIB] add checkpointInterval to ALSXiangrui Meng2015-03-204-5/+82
| | | | | | | | | | | | | | | | | | Add checkpiontInterval to ALS to prevent: 1. StackOverflow exceptions caused by long lineage, 2. large shuffle files generated during iterations, 3. slow recovery when some node fail. srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #5076 from mengxr/SPARK-5955 and squashes the following commits: df56791 [Xiangrui Meng] update impl to reuse code 29affcb [Xiangrui Meng] do not materialize factors in implicit 20d3f7f [Xiangrui Meng] add checkpointInterval to ALS
* [Spark 6096][MLlib] Add Naive Bayes load save methods in PythonXusen Yin2015-03-201-0/+11
| | | | | | | | | | | | | | See [SPARK-6096](https://issues.apache.org/jira/browse/SPARK-6096). Author: Xusen Yin <yinxusen@gmail.com> Closes #5090 from yinxusen/SPARK-6096 and squashes the following commits: bd0fea5 [Xusen Yin] fix style problem, etc. 3fd41f2 [Xusen Yin] use hanging indent in Python style e83803d [Xusen Yin] fix Python style d6dbde5 [Xusen Yin] fix python call java error a054bb3 [Xusen Yin] add save load for NaiveBayes python
* [MLlib] SPARK-5954: Top by keyShuo Xiang2015-03-202-0/+96
| | | | | | | | | | | | | | | | | | | | | | | | This PR implements two functions - `topByKey(num: Int): RDD[(K, Array[V])]` finds the top-k values for each key in a pair RDD. This can be used, e.g., in computing top recommendations. - `takeOrderedByKey(num: Int): RDD[(K, Array[V])] ` does the opposite of `topByKey` The `sorted` is used here as the `toArray` method of the PriorityQueue does not return a necessarily sorted array. Author: Shuo Xiang <shuoxiangpub@gmail.com> Closes #5075 from coderxiang/topByKey and squashes the following commits: 1611c37 [Shuo Xiang] code clean up 6f565c0 [Shuo Xiang] naming a80e0ec [Shuo Xiang] typo and warning 82dded9 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey d202745 [Shuo Xiang] move to MLPairRDDFunctions 901b0af [Shuo Xiang] style check 70c6e35 [Shuo Xiang] remove takeOrderedByKey, update doc and test 0895c17 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey b10e325 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into topByKey debccad [Shuo Xiang] topByKey
* [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.Marcelo Vanzin2015-03-201-1/+1
| | | | | | | | | | | | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5056 from vanzin/SPARK-6371 and squashes the following commits: 63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371 6506f75 [Marcelo Vanzin] Use more fine-grained exclusion. 178ba71 [Marcelo Vanzin] Oops. 75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA. a45a62c [Marcelo Vanzin] Work around MIMA warning. 1d8a670 [Marcelo Vanzin] Re-group jetty exclusion. 0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx. cef4603 [Marcelo Vanzin] Indentation. 296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
* [SPARK-6428][MLlib] Added explicit type for public methods and implemented ↵Reynold Xin2015-03-2021-58/+97
| | | | | | | | | | | | | hashCode when equals is defined. I want to add a checker to turn public type checking on, since future pull requests can accidentally expose a non-public type. This is the first cleanup task. Author: Reynold Xin <rxin@databricks.com> Closes #5102 from rxin/mllib-hashcode-publicmethodtypes and squashes the following commits: 617f19e [Reynold Xin] Fixed Scala compilation error. 52bc2d5 [Reynold Xin] [MLlib] Added explicit type for public methods and implemented hashCode when equals is defined.
* [SPARK-6291] [MLLIB] GLM toString & toDebugStringYanbo Liang2015-03-193-1/+14
| | | | | | | | | | | | | | GLM toString prints out intercept, numFeatures. For LogisticRegression and SVM model, toString also prints out numClasses, threshold. GLM toDebugString prints out the whole weights, intercept. Author: Yanbo Liang <ybliang8@gmail.com> Closes #5038 from yanboliang/spark-6291 and squashes the following commits: 2f578b0 [Yanbo Liang] code format 78b33f2 [Yanbo Liang] fix typos 1e8a023 [Yanbo Liang] GLM toString & toDebugString
* [SPARK-6374] [MLlib] add get for GeneralizedLinearAlgoYuhao Yang2015-03-181-0/+10
| | | | | | | | | | I find it's better to have getter for NumFeatures and addIntercept within GeneralizedLinearAlgorithm during actual usage, otherwise I 'll have to get the value through debug. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #5058 from hhbyyh/addGetLinear and squashes the following commits: 9dc90e8 [Yuhao Yang] add get for GeneralizedLinearAlgo
* [SPARK-6226][MLLIB] add save/load in PySpark's KMeansModelXiangrui Meng2015-03-171-0/+5
| | | | | | | | | | | Use `_py2java` and `_java2py` to convert Python model to/from Java model. yinxusen Author: Xiangrui Meng <meng@databricks.com> Closes #5049 from mengxr/SPARK-6226-mengxr and squashes the following commits: 570ba81 [Xiangrui Meng] fix python style b10b911 [Xiangrui Meng] add save/load in PySpark's KMeansModel
* [SPARK-6336] LBFGS should document what convergenceTol meanslewuathe2015-03-171-1/+5
| | | | | | | | | | | | LBFGS uses convergence tolerance. This value should be written in document as an argument. Author: lewuathe <lewuathe@me.com> Closes #5033 from Lewuathe/SPARK-6336 and squashes the following commits: e738b33 [lewuathe] Modify text to be more natural ac03c3a [lewuathe] Modify documentations 6ccb304 [lewuathe] [SPARK-6336] LBFGS should document what convergenceTol means
* [SPARK-6252] [mllib] Added getLambda to Scala NaiveBayesJoseph K. Bradley2015-03-132-0/+11
| | | | | | | | | | | | Note: not relevant for Python API since it only has a static train method Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #4969 from jkbradley/SPARK-6252 and squashes the following commits: a471d90 [Joseph K. Bradley] small edits from review 63eff48 [Joseph K. Bradley] Added getLambda to Scala NaiveBayes
* [SPARK-4588] ML AttributesXiangrui Meng2015-03-1210-0/+1306
| | | | | | | | | | | | | | | | | | | | | | | This continues the work in #4460 from srowen . The design doc is published on the JIRA page with some minor changes. Short description of ML attributes: https://github.com/apache/spark/pull/4925/files?diff=unified#diff-95e7f5060429f189460b44a3f8731a35R24 More details can be found in the design doc. srowen Could you help review this PR? There are many lines but most of them are boilerplate code. Author: Xiangrui Meng <meng@databricks.com> Author: Sean Owen <sowen@cloudera.com> Closes #4925 from mengxr/SPARK-4588-new and squashes the following commits: 71d1bd0 [Xiangrui Meng] add JavaDoc for package ml.attribute 617be40 [Xiangrui Meng] remove final; rename cardinality to numValues 393ffdc [Xiangrui Meng] forgot to include Java attribute group tests b1aceef [Xiangrui Meng] more tests e7ab467 [Xiangrui Meng] update ML attribute impl 7c944da [Sean Owen] Add FeatureType hierarchy and categorical cardinality 2a21d6d [Sean Owen] Initial draft of FeatureAttributes class
* [SPARK-6268][MLlib] KMeans parameter getter methodsYuhao Yang2015-03-121-0/+37
| | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-6268 KMeans has many setters for parameters. It should have matching getters. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #4974 from hhbyyh/get4Kmeans and squashes the following commits: f44d4dc [Yuhao Yang] add experimental to getRuns f94a3d7 [Yuhao Yang] add get for KMeans
* [SPARK-5814][MLLIB][GRAPHX] Remove JBLAS from runtimeXiangrui Meng2015-03-129-86/+84
| | | | | | | | | | | | | | | | | The issue is discussed in https://issues.apache.org/jira/browse/SPARK-5669. Replacing all JBLAS usage by netlib-java gives us a simpler dependency tree and less license issues to worry about. I didn't touch the test scope in this PR. The user guide is not modified to avoid merge conflicts with branch-1.3. srowen ankurdave pwendell Author: Xiangrui Meng <meng@databricks.com> Closes #4699 from mengxr/SPARK-5814 and squashes the following commits: 48635c6 [Xiangrui Meng] move netlib-java version to parent pom ca21c74 [Xiangrui Meng] remove jblas from ml-guide 5f7767a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5814 c5c4183 [Xiangrui Meng] merge master 0f20cad [Xiangrui Meng] add mima excludes e53e9f4 [Xiangrui Meng] remove jblas from mllib runtime ceaa14d [Xiangrui Meng] replace jblas by netlib-java in graphx fa7c2ca [Xiangrui Meng] move jblas to test scope
* SPARK-6225 [CORE] [SQL] [STREAMING] Resolve most build warnings, 1.3.0 editionSean Owen2015-03-111-4/+4
| | | | | | | | | | | Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc. Author: Sean Owen <sowen@cloudera.com> Closes #4950 from srowen/SPARK-6225 and squashes the following commits: 3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
* [SPARK-5986][MLLib] Add save/load for k-meansXusen Yin2015-03-112-4/+108
| | | | | | | | | | | | | This PR adds save/load for K-means as described in SPARK-5986. Python version will be added in another PR. Author: Xusen Yin <yinxusen@gmail.com> Closes #4951 from yinxusen/SPARK-5986 and squashes the following commits: 6dd74a0 [Xusen Yin] rewrite some functions and classes cd390fd [Xusen Yin] add indexed point b144216 [Xusen Yin] remove invalid comments dce7055 [Xusen Yin] add save/load for k-means for SPARK-5986
* [SPARK-6090][MLLIB] add a basic BinaryClassificationMetrics to PySpark/MLlibXiangrui Meng2015-03-051-0/+8
| | | | | | | | | | | | | A simple wrapper around the Scala implementation. `DataFrame` is used for serialization/deserialization. Methods that return `RDD`s are not supported in this PR. davies If we recognize Scala's `Product`s in Py4J, we can easily add wrappers for Scala methods that returns `RDD[(Double, Double)]`. Is it easy to register serializer for `Product` in PySpark? Author: Xiangrui Meng <meng@databricks.com> Closes #4863 from mengxr/SPARK-6090 and squashes the following commits: 009a3a3 [Xiangrui Meng] provide schema dcddab5 [Xiangrui Meng] add a basic BinaryClassificationMetrics to PySpark/MLlib
* SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11Sean Owen2015-03-051-1/+1
| | | | | | | | | | Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11 Author: Sean Owen <sowen@cloudera.com> Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits: eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
* [SPARK-6141][MLlib] Upgrade Breeze from 0.10 to 0.11 to fix convergence bugXiangrui Meng2015-03-032-1/+5
| | | | | | | | | | | | | | | | | | LBFGS and OWLQN in Breeze 0.10 has convergence check bug. This is fixed in 0.11, see the description in Breeze project for detail: https://github.com/scalanlp/breeze/pull/373#issuecomment-76879760 Author: Xiangrui Meng <meng@databricks.com> Author: DB Tsai <dbtsai@alpinenow.com> Author: DB Tsai <dbtsai@dbtsai.com> Closes #4879 from dbtsai/breeze and squashes the following commits: d848f65 [DB Tsai] Merge pull request #1 from mengxr/AlpineNow-breeze c2ca6ac [Xiangrui Meng] upgrade to breeze-0.11.1 35c2f26 [Xiangrui Meng] fix LRSuite 397a208 [DB Tsai] upgrade breeze
* [SPARK-6120] [mllib] Warnings about memory in tree, ensemble model saveJoseph K. Bradley2015-03-022-4/+50
| | | | | | | | | | | | | | | Issue: When the Python DecisionTree example in the programming guide is run, it runs out of Java Heap Space when using the default memory settings for the spark shell. This prints a warning. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #4864 from jkbradley/dt-save-heap and squashes the following commits: 02e8daf [Joseph K. Bradley] fixed based on code review 7ecb1ed [Joseph K. Bradley] Added warnings about memory when calling tree and ensemble model save with too small a Java heap size
* [SPARK-5950][SQL]Insert array into a metastore table saved as parquet should ↵Yin Huai2015-03-022-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5950 Thanks viirya for the initial work in #4729. cc marmbrus liancheng Author: Yin Huai <yhuai@databricks.com> Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits: 3b61a04 [Yin Huai] Revert change on equals. 80e487e [Yin Huai] asNullable in UDT. 587d88b [Yin Huai] Make methods private. 0cb7ea2 [Yin Huai] marmbrus's comments. 3cec464 [Yin Huai] Cheng's comments. 486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck d3747d1 [Yin Huai] Remove unnecessary change. 8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck 8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check. 0eb5578 [Yin Huai] Fix tests. f6ed813 [Yin Huai] Update old parquet path. e4f397c [Yin Huai] Unit tests. b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check. 8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data. bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data. 0a703e7 [Yin Huai] Test failed again since we cannot read correct content. 9a26611 [Yin Huai] Make InsertIntoTable happy. 8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability 4ec17fd [Yin Huai] Failed test.
* [SPARK-6053][MLLIB] support save/load in PySpark's ALSXiangrui Meng2015-03-011-1/+1
| | | | | | | | | | | | | A simple wrapper to save/load `MatrixFactorizationModel` in Python. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4811 from mengxr/SPARK-5991 and squashes the following commits: f135dac [Xiangrui Meng] update save doc 57e5200 [Xiangrui Meng] address comments 06140a4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-5991 282ec8d [Xiangrui Meng] support save/load in PySpark's ALS
* SPARK-6063 MLlib doesn't pass mvn scalastyle check due to UTF chars in ↵Michael Griffiths2015-02-281-1/+1
| | | | | | | | | | | | | | | LDAModel.scala Remove unicode characters from MLlib file. Author: Michael Griffiths <msjgriffiths@gmail.com> Author: Griffiths, Michael (NYC-RPM) <michael.griffiths@reprisemedia.com> Closes #4815 from msjgriffiths/SPARK-6063 and squashes the following commits: bcd7de1 [Griffiths, Michael (NYC-RPM)] Change \u201D quote marks around 'theta' to standard single apostrophe (\x27) 38eb535 [Michael Griffiths] Merge pull request #2 from apache/master b08e865 [Michael Griffiths] Merge pull request #1 from apache/master
* [SPARK-6004][MLlib] Pick the best model when training GradientBoostedTrees ↵Liang-Chi Hsieh2015-02-261-3/+9
| | | | | | | | | | | | | with validation Since the validation error does not change monotonically, in practice, it should be proper to pick the best model when training GradientBoostedTrees with validation instead of stopping it early. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4763 from viirya/gbt_record_model and squashes the following commits: 452e049 [Liang-Chi Hsieh] Address comment. ea2fae2 [Liang-Chi Hsieh] Pick the best model when training GradientBoostedTrees with validation.
* [SPARK-5976][MLLIB] Add partitioner to factors returned by ALSXiangrui Meng2015-02-252-23/+64
| | | | | | | | | | | | | The model trained by ALS requires partitioning information to do quick lookup of a user/item factor for making recommendation on individual requests. In the new implementation, we didn't set partitioners in the factors returned by ALS, which would cause performance regression. srowen coderxiang Author: Xiangrui Meng <meng@databricks.com> Closes #4748 from mengxr/SPARK-5976 and squashes the following commits: 9373a09 [Xiangrui Meng] add partitioner to factors returned by ALS 260f183 [Xiangrui Meng] add a test for partitioner
* [SPARK-5436] [MLlib] Validate GradientBoostedTrees using runWithValidationMechCoder2015-02-243-6/+111
| | | | | | | | | | | | | | | | | | | One can early stop if the decrease in error rate is lesser than a certain tol or if the error increases if the training data is overfit. This introduces a new method runWithValidation which takes in a pair of RDD's , one for the training data and the other for the validation. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4677 from MechCoder/spark-5436 and squashes the following commits: 1bb21d4 [MechCoder] Combine regression and classification tests into a single one e4d799b [MechCoder] Addresses indentation and doc comments b48a70f [MechCoder] COSMIT b928a19 [MechCoder] Move validation while training section under usage tips fad9b6e [MechCoder] Made the following changes 1. Add section to documentation 2. Return corresponding to bestValidationError 3. Allow negative tolerance. 55e5c3b [MechCoder] One liner for prevValidateError 3e74372 [MechCoder] TST: Add test for classification 77549a9 [MechCoder] [SPARK-5436] Validate GradientBoostedTrees using runWithValidation
* [SPARK-5867] [SPARK-5892] [doc] [ml] [mllib] Doc cleanups for 1.3 releaseJoseph K. Bradley2015-02-204-11/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For SPARK-5867: * The spark.ml programming guide needs to be updated to use the new SQL DataFrame API instead of the old SchemaRDD API. * It should also include Python examples now. For SPARK-5892: * Fix Python docs * Various other cleanups BTW, I accidentally merged this with master. If you want to compile it on your own, use this branch which is based on spark/branch-1.3 and cherry-picks the commits from this PR: [https://github.com/jkbradley/spark/tree/doc-review-1.3-check] CC: mengxr (ML), davies (Python docs) Author: Joseph K. Bradley <joseph@databricks.com> Closes #4675 from jkbradley/doc-review-1.3 and squashes the following commits: f191bb0 [Joseph K. Bradley] small cleanups e786efa [Joseph K. Bradley] small doc corrections 6b1ab4a [Joseph K. Bradley] fixed python lint test 946affa [Joseph K. Bradley] Added sample data for ml.MovieLensALS example. Changed spark.ml Java examples to use DataFrames API instead of sql() da81558 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into doc-review-1.3 629dbf5 [Joseph K. Bradley] Updated based on code review: * made new page for old migration guides * small fixes * moved inherit_doc in python b9df7c4 [Joseph K. Bradley] Small cleanups: toDF to toDF(), adding s for string interpolation 34b067f [Joseph K. Bradley] small doc correction da16aef [Joseph K. Bradley] Fixed python mllib docs 8cce91c [Joseph K. Bradley] GMM: removed old imports, added some doc 695f3f6 [Joseph K. Bradley] partly done trying to fix inherit_doc for class hierarchies in python docs a72c018 [Joseph K. Bradley] made ChiSqTestResult appear in python docs b05a80d [Joseph K. Bradley] organize imports. doc cleanups e572827 [Joseph K. Bradley] updated programming guide for ml and mllib
* [SPARK-5900][MLLIB] make PIC and FPGrowth Java-friendlyXiangrui Meng2015-02-195-47/+75
| | | | | | | | | | | | | | | | | | | In the previous version, PIC stores clustering assignments as an `RDD[(Long, Int)]`. This is mapped to `RDD<Tuple2<Object, Object>>` in Java and hence Java users have to cast types manually. We should either create a new method called `javaAssignments` that returns `JavaRDD[(java.lang.Long, java.lang.Int)]` or wrap the result pair in a class. I chose the latter approach in this PR. Now assignments are stored as an `RDD[Assignment]`, where `Assignment` is a class with `id` and `cluster`. Similarly, in FPGrowth, the frequent itemsets are stored as an `RDD[(Array[Item], Long)]`, which is mapped to `RDD<Tuple2<Object, Object>>`. Though we provide a "Java-friendly" method `javaFreqItemsets` that returns `JavaRDD[(Array[Item], java.lang.Long)]`. It doesn't really work because `Array[Item]` is mapped to `Object` in Java. So in this PR I created a class `FreqItemset` to wrap the results. It has `items` and `freq`, as well as a `javaItems` method that returns `List<Item>` in Java. I'm not certain that the names I chose are proper: `Assignment`/`id`/`cluster` and `FreqItemset`/`items`/`freq`. Please let me know if there are better suggestions. CC: jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #4695 from mengxr/SPARK-5900 and squashes the following commits: 865b5ca [Xiangrui Meng] make Assignment serializable cffa96e [Xiangrui Meng] fix test 9c0e590 [Xiangrui Meng] remove unused Tuple2 1b9db3d [Xiangrui Meng] make PIC and FPGrowth Java-friendly
* SPARK-4682 [CORE] Consolidate various 'Clock' classesSean Owen2015-02-192-2/+2
| | | | | | | | | | | | | | | Another one from JoshRosen 's wish list. The first commit is much smaller and removes 2 of the 4 Clock classes. The second is much larger, necessary for consolidating the streaming one. I put together implementations in the way that seemed simplest. Almost all the change is standardizing class and method names. Author: Sean Owen <sowen@cloudera.com> Closes #4514 from srowen/SPARK-4682 and squashes the following commits: 5ed3a03 [Sean Owen] Javadoc Clock classes; make ManualClock private[spark] 169dd13 [Sean Owen] Add support for legacy org.apache.spark.streaming clock class names 277785a [Sean Owen] Reduce the net change in this patch by reversing some unnecessary syntax changes along the way b5e53df [Sean Owen] FakeClock -> ManualClock; getTime() -> getTimeMillis() 160863a [Sean Owen] Consolidate Streaming Clock class into common util Clock 7c956b2 [Sean Owen] Consolidate Clocks except for Streaming Clock
* [SPARK-5902] [ml] Made PipelineStage.transformSchema public instead of ↵Joseph K. Bradley2015-02-195-12/+20
| | | | | | | | | | | | | | | | private to ml For users to implement their own PipelineStages, we need to make PipelineStage.transformSchema be public instead of private to ml. This would be nice to include in Spark 1.3 CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #4682 from jkbradley/SPARK-5902 and squashes the following commits: 6f02357 [Joseph K. Bradley] Made transformSchema public 0e6d0a0 [Joseph K. Bradley] made implementations of transformSchema protected as well fdaf26a [Joseph K. Bradley] Made PipelineStage.transformSchema protected instead of private[ml]
* [SPARK-5879][MLLIB] update PIC user guide and add a Java exampleXiangrui Meng2015-02-181-0/+9
| | | | | | | | | | | | Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue. Author: Xiangrui Meng <meng@databricks.com> Closes #4680 from mengxr/SPARK-5897 and squashes the following commits: 847d216 [Xiangrui Meng] apache header 87719a2 [Xiangrui Meng] remove PIC image 2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
* [Minor] [SQL] Cleans up DataFrame variable names and toDF() callsCheng Lian2015-02-176-7/+7
| | | | | | | | | | | | | | Although we've migrated to the DataFrame API, lots of code still uses `rdd` or `srdd` as local variable names. This PR tries to address these naming inconsistencies and some other minor DataFrame related style issues. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4670) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4670 from liancheng/df-cleanup and squashes the following commits: 3e14448 [Cheng Lian] Cleans up DataFrame variable names and toDF() calls
* [SPARK-3381] [MLlib] Eliminate bins for unordered features in DecisionTreesMechCoder2015-02-173-73/+15
| | | | | | | | | | | | | For unordered features, it is sufficient to use splits since the threshold of the split corresponds the threshold of the HighSplit of the bin and there is no use of the LowSplit. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #4231 from MechCoder/spark-3381 and squashes the following commits: 58c19a5 [MechCoder] COSMIT c274b74 [MechCoder] Remove unordered feature calculation in labeledPointToTreePoint b2b9b89 [MechCoder] COSMIT d3ee042 [MechCoder] [SPARK-3381] [MLlib] Eliminate bins for unordered features