aboutsummaryrefslogtreecommitdiff
path: root/mllib/src
Commit message (Collapse)AuthorAgeFilesLines
* Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"Xiangrui Meng2015-07-062-80/+10
| | | | | | | | | | This reverts commit 25f574eb9a3cb9b93b7d9194a8ec16e00ce2c036. After speaking to some users and developers, we realized that FP-growth doesn't meet the requirement for frequent sequence mining. PrefixSpan (SPARK-6487) would be the correct algorithm for it. feynmanliang Author: Xiangrui Meng <meng@databricks.com> Closes #7240 from mengxr/SPARK-7212.revert and squashes the following commits: 2b3d66b [Xiangrui Meng] Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
* [SPARK-7137] [ML] Update SchemaUtils checkInputColumn to print more info if ↵Joshi2015-07-051-2/+7
| | | | | | | | | | | | | | | needed Author: Joshi <rekhajoshm@gmail.com> Author: Rekha Joshi <rekhajoshm@gmail.com> Closes #5992 from rekhajoshm/fix/SPARK-7137 and squashes the following commits: 8c42b57 [Joshi] update checkInputColumn to print more info if needed 33ddd2e [Joshi] update checkInputColumn to print more info if needed acf3e17 [Joshi] update checkInputColumn to print more info if needed 8993c0e [Joshi] SPARK-7137: Add checkInputColumn back to Params and print more info e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
* [SPARK-7104] [MLLIB] Support model save/load in Python's Word2VecYu ISHIKAWA2015-07-021-0/+3
| | | | | | | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6821 from yu-iskw/SPARK-7104 and squashes the following commits: 975136b [Yu ISHIKAWA] Organize import 0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save` 1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
* [SPARK-3382] [MLLIB] GradientDescent convergence tolerancelewuathe2015-07-027-22/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | GrandientDescent can receive convergence tolerance value. Default value is 0.0. When loss value becomes less than the tolerance which is set by user, iteration is terminated. Author: lewuathe <lewuathe@me.com> Closes #3636 from Lewuathe/gd-convergence-tolerance and squashes the following commits: 0b8a9a8 [lewuathe] Update doc ce91b15 [lewuathe] Merge branch 'master' into gd-convergence-tolerance 4f22c2b [lewuathe] Modify based on SPARK-1503 5e47b82 [lewuathe] Merge branch 'master' into gd-convergence-tolerance abadb7e [lewuathe] Fix LassoSuite 8fadebd [lewuathe] Fix failed unit tests ee5de46 [lewuathe] Merge branch 'master' into gd-convergence-tolerance 8313ba2 [lewuathe] Fix styles 0ead94c [lewuathe] Merge branch 'master' into gd-convergence-tolerance a94cfd5 [lewuathe] Modify some styles 3aef0a2 [lewuathe] Modify converged logic to do relative comparison f7b19d5 [lewuathe] [SPARK-3382] Clarify comparison logic e6c9cd2 [lewuathe] [SPARK-3382] Compare with the diff of solution vector 4b125d2 [lewuathe] [SPARK3382] Fix scala style e7c10dd [lewuathe] [SPARK-3382] format improvements f867eea [lewuathe] [SPARK-3382] Modify warning message statements b9d5e61 [lewuathe] [SPARK-3382] should compare diff inside loss history and convergence tolerance 5433f71 [lewuathe] [SPARK-3382] GradientDescent convergence tolerance
* [SPARK-8479] [MLLIB] Add numNonzeros and numActives to linalg.MatricesMechCoder2015-07-022-0/+29
| | | | | | | | | | | | | Matrices allow zeros to be stored in values. Sometimes a method is handy to check if the numNonZeros are same as number of Active values. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6904 from MechCoder/nnz_matrix and squashes the following commits: 252c6b7 [MechCoder] Add to MiMa excludes e2390f5 [MechCoder] Use count instead of foreach 2f62b2f [MechCoder] Add to MiMa excludes d6e96ef [MechCoder] [SPARK-8479] Add numNonzeros and numActives to linalg.Matrices
* [SPARK-8708] [MLLIB] Paritition ALS ratings based on both users and productsLiang-Chi Hsieh2015-07-021-6/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-8708 Previously the partitions of ratings are only based on the given products. So if the `usersProducts` given for prediction contains only few products or even one product, the generated ratings will be pushed into few or single partition and can't use high parallelism. The following codes are the example reported in the JIRA. Because it asks the predictions for users on product 2. There is only one partition in the result. >>> r1 = (1, 1, 1.0) >>> r2 = (1, 2, 2.0) >>> r3 = (2, 1, 2.0) >>> r4 = (2, 2, 2.0) >>> r5 = (3, 1, 1.0) >>> ratings = sc.parallelize([r1, r2, r3, r4, r5], 5) >>> users = ratings.map(itemgetter(0)).distinct() >>> model = ALS.trainImplicit(ratings, 1, seed=10) >>> predictions_for_2 = model.predictAll(users.map(lambda u: (u, 2))) >>> predictions_for_2.glom().map(len).collect() [0, 0, 3, 0, 0] This PR uses user and product instead of only product to partition the ratings. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7121 from viirya/mfm_fix_partition and squashes the following commits: 779946d [Liang-Chi Hsieh] Calculate approximate numbers of users and products in one pass. 4336dc2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into mfm_fix_partition 83e56c1 [Liang-Chi Hsieh] Instead of additional join, use the numbers of users and products to decide how to perform join. b534dc8 [Liang-Chi Hsieh] Paritition ratings based on both users and products.
* [SPARK-8647] [MLLIB] Potential issue with constant hashCodeAlok Singh2015-07-022-2/+4
| | | | | | | | | | | | | | | I added the code, // see [SPARK-8647], this achieves the needed constant hash code without constant no. override def hashCode(): Int = this.getClass.getName.hashCode() does getting the constant hash code as per jira Author: Alok Singh <singhal@Aloks-MacBook-Pro.local> Closes #7146 from aloknsingh/aloknsingh_SPARK-8647 and squashes the following commits: e58bccf [Alok Singh] [SPARK-8647][MLlib] to avoid the class derivation issues, change the constant hashCode to override def hashCode(): Int = classOf[MatrixUDT].getName.hashCode() 43cdb89 [Alok Singh] [SPARK-8647][MLlib] Potential issue with constant hashCode
* [SPARK-3071] Increase default driver memoryIlya Ganelin2015-07-012-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | I've updated default values in comments, documentation, and in the command line builder to be 1g based on comments in the JIRA. I've also updated most usages to point at a single variable defined in the Utils.scala and JavaUtils.java files. This wasn't possible in all cases (R, shell scripts etc.) but usage in most code is now pointing at the same place. Please let me know if I've missed anything. Will the spark-shell use the value within the command line builder during instantiation? Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #7132 from ilganeli/SPARK-3071 and squashes the following commits: 4074164 [Ilya Ganelin] String fix 271610b [Ilya Ganelin] Merge branch 'SPARK-3071' of github.com:ilganeli/spark into SPARK-3071 273b6e9 [Ilya Ganelin] Test fix fd67721 [Ilya Ganelin] Update JavaUtils.java 26cc177 [Ilya Ganelin] test fix e5db35d [Ilya Ganelin] Fixed test failure 39732a1 [Ilya Ganelin] merge fix a6f7deb [Ilya Ganelin] Created default value for DRIVER MEM in Utils that's now used in almost all locations instead of setting manually in each 09ad698 [Ilya Ganelin] Update SubmitRestProtocolSuite.scala 19b6f25 [Ilya Ganelin] Missed one doc update 2698a3d [Ilya Ganelin] Updated default value for driver memory
* [SPARK-8660] [MLLIB] removed > symbols from comments in ↵Rosstin2015-07-011-54/+63
| | | | | | | | | | | | | | | | | | | | | | LogisticRegressionSuite.scala for ease of copypaste '>' symbols removed from comments in LogisticRegressionSuite.scala, for ease of copypaste also single-lined the multiline commands (is this desirable, or does it violate style?) Author: Rosstin <asterazul@gmail.com> Closes #7167 from Rosstin/SPARK-8660-2 and squashes the following commits: f4b9bc8 [Rosstin] SPARK-8660 restored character limit on multiline comments in LogisticRegressionSuite.scala fe6b112 [Rosstin] SPARK-8660 > symbols removed from LogisticRegressionSuite.scala for easy of copypaste 39ddd50 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8661 5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code. bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660 242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala 2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639 21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639 6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
* [SPARK-6263] [MLLIB] Python MLlib API missing items: Utilslewuathe2015-07-011-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement missing API in pyspark. MLUtils * appendBias * loadVectors `kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python. Author: lewuathe <lewuathe@me.com> Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits: 16863ea [lewuathe] Merge master 3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263 6084e9c [lewuathe] Resolv conflict d2aa2a0 [lewuathe] Resolv conflict 9c329d8 [lewuathe] Fix efficiency 3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263 1d4714b [lewuathe] Fix style b29e2bc [lewuathe] Remove scipy dependencies e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263 25d3c9d [lewuathe] Remove unnecessary imports 7ec04db [lewuathe] Resolv conflict 1502d13 [lewuathe] Resolv conflict d6bd416 [lewuathe] Check existence of scipy.sparse 5d555b1 [lewuathe] Construct scipy.sparse matrix c345a44 [lewuathe] Merge branch 'master' into SPARK-6263 b8b5ef7 [lewuathe] Fix unnecessary sort method d254be7 [lewuathe] Merge branch 'master' into SPARK-6263 62a9c7e [lewuathe] Fix appendBias return type 454c73d [lewuathe] Merge branch 'master' into SPARK-6263 a353354 [lewuathe] Remove unnecessary appendBias implementation 44295c2 [lewuathe] Merge branch 'master' into SPARK-6263 64f72ad [lewuathe] Merge branch 'master' into SPARK-6263 c728046 [lewuathe] Fix style 2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
* [SPARK-8471] [ML] Rename DiscreteCosineTransformer to DCTFeynman Liang2015-06-303-8/+8
| | | | | | | | | | | | | | | | | | | | Rename DiscreteCosineTransformer and related classes to DCT. Author: Feynman Liang <fliang@databricks.com> Closes #7138 from feynmanliang/dct-features and squashes the following commits: e547b3e [Feynman Liang] Fix renaming bug 9d5c9e4 [Feynman Liang] Lowercase JavaDCTSuite variable f9a8958 [Feynman Liang] Remove old files f8fe794 [Feynman Liang] Merge branch 'master' into dct-features 894d0b2 [Feynman Liang] Rename DiscreteCosineTransformer to DCT 433dbc7 [Feynman Liang] Test refactoring 91e9636 [Feynman Liang] Style guide and test helper refactor b5ac19c [Feynman Liang] Use Vector types, add Java test 530983a [Feynman Liang] Tests for other numeric datatypes 195d7aa [Feynman Liang] Implement support for arbitrary numeric types 95d4939 [Feynman Liang] Working DCT for 1D Doubles
* [SPARK-8563] [MLLIB] Fixed a bug so that ↵lee192015-06-302-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IndexedRowMatrix.computeSVD().U.numCols = k I'm sorry that I made https://github.com/apache/spark/pull/6949 closed by mistake. I pushed codes again. And, I added a test code. > There is a bug that `U.numCols() = self.nCols` in `IndexedRowMatrix.computeSVD()` It should have been `U.numCols() = k = svd.U.numCols()` > ``` self = U * sigma * V.transpose (m x n) = (m x n) * (k x k) * (k x n) //ASIS --> (m x n) = (m x k) * (k x k) * (k x n) //TOBE ``` Author: lee19 <lee19@live.co.kr> Closes #6953 from lee19/MLlibBugfix and squashes the following commits: c1812a0 [lee19] [SPARK-8563] [MLlib] Used nRows instead of numRows() to reduce a burden. 4b9803b [lee19] [SPARK-8563] [MLlib] Fixed a build error. c2ccd89 [lee19] Added a unit test that validates matrix sizes of svd for [SPARK-8563][MLlib] 8373424 [lee19] [SPARK-8563][MLlib] Fixed a bug so that IndexedRowMatrix.computeSVD().U.numCols = k
* [SPARK-8736] [ML] GBTRegressor should not threshold predictionJoseph K. Bradley2015-06-302-3/+23
| | | | | | | | | | | | Changed GBTRegressor so it does NOT threshold the prediction. Added test which fails with bug but works after fix. CC: feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7134 from jkbradley/gbrt-fix and squashes the following commits: 613b90e [Joseph K. Bradley] Changed GBTRegressor so it does NOT threshold the prediction
* [SPARK-7514] [MLLIB] Add MinMaxScaler to feature transformationYuhao Yang2015-06-302-0/+238
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-7514 Add a popular scaling method to feature component, which is commonly known as min-max normalization or Rescaling. Core function is, Normalized(x) = (x - min) / (max - min) * scale + newBase where `newBase` and `scale` are parameters (type Double) of the `VectorTransformer`. `newBase` is the new minimum number for the features, and `scale` controls the ranges after transformation. This is a little complicated than the basic MinMax normalization, yet it provides flexibility so that users can control the range more specifically. like [0.1, 0.9] in some NN application. For case that `max == min`, 0.5 is used as the raw value. (0.5 * scale + newBase) I'll add UT once the design got settled ( and this is not considered as too naive) reference: http://en.wikipedia.org/wiki/Feature_scaling http://stn.spotfire.com/spotfire_client_help/index.htm#norm/norm_scale_between_0_and_1.htm Author: Yuhao Yang <hhbyyh@gmail.com> Closes #6039 from hhbyyh/minMaxNorm and squashes the following commits: f942e9f [Yuhao Yang] add todo for metadata 8b37bbc [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm 4894dbc [Yuhao Yang] add copy fa2989f [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm 29db415 [Yuhao Yang] add clue and minor adjustment 5b8f7cc [Yuhao Yang] style fix 9b133d0 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm 22f20f2 [Yuhao Yang] style change and bug fix 747c9bb [Yuhao Yang] add ut and remove mllib version a5ba0aa [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm 585cc07 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm 1c6dcb1 [Yuhao Yang] minor change 0f1bc80 [Yuhao Yang] add MinMaxScaler to ml 8e7436e [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm 3663165 [Yuhao Yang] Merge remote-tracking branch 'upstream/master' into minMaxNorm 1247c27 [Yuhao Yang] some comments improvement d285a19 [Yuhao Yang] initial checkin for minMaxNorm
* [SPARK-8471] [ML] Discrete Cosine Transform Feature TransformerFeynman Liang2015-06-303-0/+223
| | | | | | | | | | | | | | | Implementation and tests for Discrete Cosine Transformer. Author: Feynman Liang <fliang@databricks.com> Closes #6894 from feynmanliang/dct-features and squashes the following commits: 433dbc7 [Feynman Liang] Test refactoring 91e9636 [Feynman Liang] Style guide and test helper refactor b5ac19c [Feynman Liang] Use Vector types, add Java test 530983a [Feynman Liang] Tests for other numeric datatypes 195d7aa [Feynman Liang] Implement support for arbitrary numeric types 95d4939 [Feynman Liang] Working DCT for 1D Doubles
* [SPARK-8664] [ML] Add PCA transformerYanbo Liang2015-06-303-1/+195
| | | | | | | | | | | Add PCA transformer for ML pipeline Author: Yanbo Liang <ybliang8@gmail.com> Closes #7065 from yanboliang/spark-8664 and squashes the following commits: 4afae45 [Yanbo Liang] address comments e9effd7 [Yanbo Liang] Add PCA transformer
* [SPARK-8661][ML] for LinearRegressionSuite.scala, changed javadoc-style ↵Rosstin2015-06-291-96/+96
| | | | | | | | | | | | | | | | | comments to regular multiline comments, to make copy-pasting R code more simple for mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments, to make copy-pasting R code more simple Author: Rosstin <asterazul@gmail.com> Closes #7098 from Rosstin/SPARK-8661 and squashes the following commits: 5a05dee [Rosstin] SPARK-8661 for LinearRegressionSuite.scala, changed javadoc-style comments to regular multiline comments to make it easier to copy-paste the R code. bb9a4b1 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8660 242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala 2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639 21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639 6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
* [SPARK-8660][ML] Convert JavaDoc style comments ↵Rosstin2015-06-291-171/+171
| | | | | | | | | | | | | | | inLogisticRegressionSuite.scala to regular multiline comments, to make copy-pasting R commands easier Converted JavaDoc style comments in mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala to regular multiline comments, to make copy-pasting R commands easier. Author: Rosstin <asterazul@gmail.com> Closes #7096 from Rosstin/SPARK-8660 and squashes the following commits: 242aedd [Rosstin] SPARK-8660, changed comment style from JavaDoc style to normal multiline comment in order to make copypaste into R easier, in file classification/LogisticRegressionSuite.scala 2cd2985 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639 21ac1e5 [Rosstin] Merge branch 'master' of github.com:apache/spark into SPARK-8639 6c18058 [Rosstin] fixed minor typos in docs/README.md and docs/api.md
* [SPARK-8575] [SQL] Deprecate callUDF in favor of udfBenFradet2015-06-285-32/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | Follow up of [SPARK-8356](https://issues.apache.org/jira/browse/SPARK-8356) and #6902. Removes the unit test for the now deprecated ```callUdf``` Unit test in SQLQuerySuite now uses ```udf``` instead of ```callUDF``` Replaced ```callUDF``` by ```udf``` where possible in mllib Author: BenFradet <benjamin.fradet@gmail.com> Closes #6993 from BenFradet/SPARK-8575 and squashes the following commits: 26f5a7a [BenFradet] 2 spaces instead of 1 1ddb452 [BenFradet] renamed initUDF in order to be consistent in OneVsRest 48ca15e [BenFradet] used vector type tag for udf call in VectorIndexer 0ebd0da [BenFradet] replace the now deprecated callUDF by udf in VectorIndexer 8013409 [BenFradet] replaced the now deprecated callUDF by udf in Predictor 94345b5 [BenFradet] unifomized udf calls in ProbabilisticClassifier 1305492 [BenFradet] uniformized udf calls in Classifier a672228 [BenFradet] uniformized udf calls in OneVsRest 49e4904 [BenFradet] Revert "removal of the unit test for the now deprecated callUdf" bbdeaf3 [BenFradet] fixed syntax for init udf in OneVsRest fe2a10b [BenFradet] callUDF => udf in ProbabilisticClassifier 0ea30b3 [BenFradet] callUDF => udf in Classifier where possible 197ec82 [BenFradet] callUDF => udf in OneVsRest 84d6780 [BenFradet] modified unit test in SQLQuerySuite to use udf instead of callUDF 477709f [BenFradet] removal of the unit test for the now deprecated callUdf
* [SPARK-5962] [MLLIB] Python support for Power Iteration ClusteringYanbo Liang2015-06-282-0/+59
| | | | | | | | | | | | Python support for Power Iteration Clustering https://issues.apache.org/jira/browse/SPARK-5962 Author: Yanbo Liang <ybliang8@gmail.com> Closes #6992 from yanboliang/pyspark-pic and squashes the following commits: 6b03d82 [Yanbo Liang] address comments 4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
* [SPARK-7212] [MLLIB] Add sequence learning flagFeynman Liang2015-06-282-10/+80
| | | | | | | | | | | | | | | | Support mining of ordered frequent item sequences. Author: Feynman Liang <fliang@databricks.com> Closes #6997 from feynmanliang/fp-sequence and squashes the following commits: 7c14e15 [Feynman Liang] Improve scalatests with R code and Seq 0d3e4b6 [Feynman Liang] Fix python test ce987cb [Feynman Liang] Backwards compatibility aux constructor 34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq 648d4d4 [Feynman Liang] Test case for frequent item sequences 252a36a [Feynman Liang] Add sequence learning flag
* [SPARK-8613] [ML] [TRIVIAL] add param to disable linear feature scalingHolden Karau2015-06-262-0/+20
| | | | | | | | | | | | | Add a param to disable linear feature scaling (to be implemented later in linear & logistic regression). Done as a seperate PR so we can use same param & not conflict while working on the sub-tasks. Author: Holden Karau <holden@pigscanfly.ca> Closes #7024 from holdenk/SPARK-8522-Disable-Linear_featureScaling-Spark-8613-Add-param and squashes the following commits: ce8931a [Holden Karau] Regenerate the sharedParams code fa6427e [Holden Karau] update text for standardization param. 7b24a2b [Holden Karau] generate the new standardization param 3c190af [Holden Karau] Add the standardization param to sharedparamscodegen
* [MINOR] [MLLIB] rename some functions of PythonMLLibAPIYanbo Liang2015-06-251-3/+3
| | | | | | | | | | | | | | | | | | | | | | | Keep the same naming conventions for PythonMLLibAPI. Only the following three functions is different from others ```scala trainNaiveBayes trainGaussianMixture trainWord2Vec ``` So change them to ```scala trainNaiveBayesModel trainGaussianMixtureModel trainWord2VecModel ``` It does not affect any users and public APIs, only to make better understand for developer and code hacker. Author: Yanbo Liang <ybliang8@gmail.com> Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits: 771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
* [SPARK-8525] [MLLIB] fix LabeledPoint parser when there is a whitespace ↵Oleksiy Dyagilev2015-06-233-0/+14
| | | | | | | | | | | | | | between label and features vector fix LabeledPoint parser when there is a whitespace between label and features vector, e.g. (y, [x1, x2, x3]) Author: Oleksiy Dyagilev <oleksiy_dyagilev@epam.com> Closes #6954 from fe2s/SPARK-8525 and squashes the following commits: 0755b9d [Oleksiy Dyagilev] [SPARK-8525][MLLIB] addressing comment, removing dep on commons-lang c1abc2b [Oleksiy Dyagilev] [SPARK-8525][MLLIB] fix LabeledPoint parser when there is a whitespace on specific position
* [SPARK-8265] [MLLIB] [PYSPARK] Add LinearDataGenerator to pyspark.mllib.utilsMechCoder2015-06-231-1/+31
| | | | | | | | | | | | It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6715 from MechCoder/generate_linear_input and squashes the following commits: 6182884 [MechCoder] Minor changes 8bda047 [MechCoder] Minor style fixes 0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
* [SPARK-7888] Be able to disable intercept in linear regression in ml packageHolden Karau2015-06-232-12/+167
| | | | | | | | | | | | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #6927 from holdenk/SPARK-7888-Be-able-to-disable-intercept-in-Linear-Regression-in-ML-package and squashes the following commits: 0ad384c [Holden Karau] Add MiMa excludes 4016fac [Holden Karau] Switch to wild card import, remove extra blank lines ae5baa8 [Holden Karau] CR feedback, move the fitIntercept down rather than changing ymean and etc above f34971c [Holden Karau] Fix some more long lines 319bd3f [Holden Karau] Fix long lines 3bb9ee1 [Holden Karau] Update the regression suite tests 7015b9f [Holden Karau] Our code performs the same with R, except we need more than one data point but that seems reasonable 0b0c8c0 [Holden Karau] fix the issue with the sample R code e2140ba [Holden Karau] Add a test, it fails! 5e84a0b [Holden Karau] Write out thoughts and use the correct trait 91ffc0a [Holden Karau] more murh 006246c [Holden Karau] murp?
* [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max binsHolden Karau2015-06-221-1/+3
| | | | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits: 2894695 [Holden Karau] remove extra blank line 2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too 3a09170 [Holden Karau] add maxBins to to the train method as well af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
* [SPARK-8455] [ML] Implement n-gram feature transformerFeynman Liang2015-06-222-0/+163
| | | | | | | | | | | | Implementation of n-gram feature transformer for ML. Author: Feynman Liang <fliang@databricks.com> Closes #6887 from feynmanliang/ngram-featurizer and squashes the following commits: d2c839f [Feynman Liang] Make n > input length yield empty output 9fadd36 [Feynman Liang] Add empty and corner test cases, fix names and spaces fe93873 [Feynman Liang] Implement n-gram feature transformer
* [SPARK-7426] [MLLIB] [ML] Updated Attribute.fromStructField to allow any ↵Mike Dusenberry2015-06-212-2/+7
| | | | | | | | | | | | NumericType. Updated `Attribute.fromStructField` to allow any `NumericType`, rather than just `DoubleType`, and added unit tests for a few of the other NumericTypes. Author: Mike Dusenberry <dusenberrymw@gmail.com> Closes #6540 from dusenberrymw/SPARK-7426_AttributeFactory.fromStructField_Should_Allow_NumericTypes and squashes the following commits: 87fecb3 [Mike Dusenberry] Updated Attribute.fromStructField to allow any NumericType, rather than just DoubleType, and added unit tests for a few of the other NumericTypes.
* [SPARK-7604] [MLLIB] Python API for PCA and PCAModelYanbo Liang2015-06-211-0/+10
| | | | | | | | | | | Python API for PCA and PCAModel Author: Yanbo Liang <ybliang8@gmail.com> Closes #6315 from yanboliang/spark-7604 and squashes the following commits: 1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior 4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
* [SPARK-8468] [ML] Take the negative of some metrics in RegressionEvaluator ↵Liang-Chi Hsieh2015-06-204-8/+43
| | | | | | | | | | | | | | | | to get correct cross validation JIRA: https://issues.apache.org/jira/browse/SPARK-8468 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6905 from viirya/cv_min and squashes the following commits: 930d3db [Liang-Chi Hsieh] Fix python unit test and add document. d632135 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cv_min 16e3b2c [Liang-Chi Hsieh] Take the negative instead of reciprocal. c3dd8d9 [Liang-Chi Hsieh] For comments. b5f52c1 [Liang-Chi Hsieh] Add param to CrossValidator for choosing whether to maximize evaulation value.
* [SPARK-4118] [MLLIB] [PYSPARK] Python bindings for StreamingKMeansMechCoder2015-06-191-0/+15
| | | | | | | | | | | | | | | | | | | | | | Python bindings for StreamingKMeans Will change status to MRG once docs, tests and examples are updated. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6499 from MechCoder/spark-4118 and squashes the following commits: 7722d16 [MechCoder] minor style fixes 51052d3 [MechCoder] Doc fixes 2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes 81482fd [MechCoder] minor 5d9fe61 [MechCoder] predictOn should take into account the latest model 8ab9e89 [MechCoder] Fix Python3 error a9817df [MechCoder] Better tests and minor fixes c80e451 [MechCoder] Add ignore_unicode_prefix ee8ce16 [MechCoder] Update tests, doc and examples 4b1481f [MechCoder] Some changes and tests d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
* [SPARK-8151] [MLLIB] pipeline components should correctly implement copyXiangrui Meng2015-06-1960-55/+343
| | | | | | | | | | | | | | | | | | Otherwise, extra params get ignored in `PipelineModel.transform`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6622 from mengxr/SPARK-8087 and squashes the following commits: 0e4c8c4 [Xiangrui Meng] fix merge issues 26fc1f0 [Xiangrui Meng] address comments e607a04 [Xiangrui Meng] merge master b85b57e [Xiangrui Meng] fix examples/compile d6f7891 [Xiangrui Meng] rename defaultCopyWithParams to defaultCopy 84ec278 [Xiangrui Meng] remove setter checks due to generics 2cf2ed0 [Xiangrui Meng] snapshot 291814f [Xiangrui Meng] OneVsRest.copy 1dfe3bd [Xiangrui Meng] PipelineModel.copy should copy stages
* [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProductMechCoder2015-06-171-0/+8
| | | | | | | | | | | Python API for org.apache.spark.mllib.feature.ElementwiseProduct Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6346 from MechCoder/spark-7605 and squashes the following commits: 79d1ef5 [MechCoder] Consistent and support list / array types 5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
* [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to PySparkMechCoder2015-06-171-0/+2
| | | | | | | | | | | | | MatrixUDT was recently coded in scala. This has been ported to PySpark Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6354 from MechCoder/spark-6390 and squashes the following commits: fc4dc1e [MechCoder] Better error message c940a44 [MechCoder] Added test aa9c391 [MechCoder] Add pyUDT to MatrixUDT 62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
* [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and ↵Yanbo Liang2015-06-161-1/+1
| | | | | | | | | | | | | | | regression Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc. Author: Yanbo Liang <ybliang8@gmail.com> Closes #6460 from yanboliang/spark-7916 and squashes the following commits: f8deda4 [Yanbo Liang] trigger jenkins 6dc4d99 [Yanbo Liang] address comments ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse 3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
* [SPARK-8314][MLlib] improvement in performance of MLUtils.appendBiasRoger Menezes2015-06-122-7/+23
| | | | | | | | | | | | | | | | | | | | | MLUtils.appendBias method is heavily used in creating intercepts for linear models. This method uses Breeze's vector concatenation which is very slow compared to the plain System.arrayCopy. This improvement is to change the implementation to use System.arrayCopy. I saw the following performance improvements after the change: Benchmark with mnist dataset for 50 times: MLUtils.appendBias (SparseVector Before): 47320 ms MLUtils.appendBias (SparseVector After): 1935 ms MLUtils.appendBias (DenseVector Before): 5340 ms MLUtils.appendBias (DenseVector After): 4080 ms This is almost a 24 times performance boost for SparseVectors. Author: Roger Menezes <rmenezes@netflix.com> Closes #6768 from rogermenezes/improve-append-bias and squashes the following commits: 4e42f75 [Roger Menezes] address feedback e999d79 [Roger Menezes] first commit
* [SPARK-8200] [MLLIB] Check for empty RDDs in StreamingLinearAlgorithmPaavo2015-06-103-6/+43
| | | | | | | | | | | | | | | | | Test cases for both StreamingLinearRegression and StreamingLogisticRegression, and code fix. Edit: This contribution is my original work and I license the work to the project under the project's open source license. Author: Paavo <pparkkin@gmail.com> Closes #6713 from pparkkin/streamingmodel-empty-rdd and squashes the following commits: ff5cd78 [Paavo] Update strings to use interpolation. db234cf [Paavo] Use !rdd.isEmpty. 54ad89e [Paavo] Test case for empty stream. 393e36f [Paavo] Ignore empty RDDs. 0bfc365 [Paavo] Test case for empty stream.
* [SPARK-8140] [MLLIB] Remove construct to get weights in StreamingLinearAlgorithmMechCoder2015-06-091-6/+1
| | | | | | | | Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6720 from MechCoder/empty_model_check and squashes the following commits: 3a07de5 [MechCoder] Remove construct to get weights in StreamingLinearAlgorithm
* [SPARK-8168] [MLLIB] Add Python friendly constructor to PipelineModelXiangrui Meng2015-06-082-0/+25
| | | | | | | | | | This makes the constructor callable in Python. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #6709 from mengxr/SPARK-8168 and squashes the following commits: f871de4 [Xiangrui Meng] Add Python friendly constructor to PipelineModel
* [SPARK-8140] [MLLIB] Remove empty model check in StreamingLinearAlgorithmMechCoder2015-06-084-8/+5
| | | | | | | | | | | 1. Prevent creating a map of data to find numFeatures 2. If model is empty, then initialize with a zero vector of numFeature Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6684 from MechCoder/spark-8140 and squashes the following commits: 7fbf5f9 [MechCoder] [SPARK-8140] Remove empty model check in StreamingLinearAlgorithm And other minor cosmits
* [SPARK-7639] [PYSPARK] [MLLIB] Python API for KernelDensityMechCoder2015-06-061-1/+11
| | | | | | | | | | | | | Python API for KernelDensity Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6387 from MechCoder/spark-7639 and squashes the following commits: 17abc62 [MechCoder] add tests 2de6540 [MechCoder] style tests bf4acc0 [MechCoder] Added doctests 84359d5 [MechCoder] [SPARK-7639] Python API for KernelDensity
* [SPARK-6164] [ML] CrossValidatorModel should keep stats from fittingleahmcguire2015-06-032-3/+8
| | | | | | | | | | | | | | | | | Added stats from cross validation as a val in the cross validation model to save them for user access. Author: leahmcguire <lmcguire@salesforce.com> Closes #5915 from leahmcguire/saveCVmetrics and squashes the following commits: 49b507b [leahmcguire] fixed tyle error 67537b1 [leahmcguire] rebased 85907f0 [leahmcguire] fixed name 59987cc [leahmcguire] changed param name and test according to comments 36e71e3 [leahmcguire] rebasing 4b8223e [leahmcguire] fixed name 4ddffc6 [leahmcguire] changed param name and test according to comments 3a995da [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access
* [SPARK-8051] [MLLIB] make StringIndexerModel silent if input column does not ↵Xiangrui Meng2015-06-032-1/+23
| | | | | | | | | | | | | | | exist This is just a workaround to a bigger problem. Some pipeline stages may not be effective during prediction, and they should not complain about missing required columns, e.g. `StringIndexerModel`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #6595 from mengxr/SPARK-8051 and squashes the following commits: b6a36b9 [Xiangrui Meng] add doc f143fd4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-8051 8ee7c7e [Xiangrui Meng] use SparkFunSuite e112394 [Xiangrui Meng] make StringIndexerModel silent if input column does not exist
* [SPARK-8054] [MLLIB] Added several Java-friendly APIs + unit testsJoseph K. Bradley2015-06-0313-19/+284
| | | | | | | | | | | | | | | | | | | | | | | | | | Java-friendly APIs added: * GaussianMixture.run() * GaussianMixtureModel.predict() * DistributedLDAModel.javaTopicDistributions() * StreamingKMeans: trainOn, predictOn, predictOnValues * Statistics.corr * params * added doc to w() since Java docs do not inherit doc * removed non-Java-friendly w() from StringArrayParam and DoubleArrayParam * made DoubleArrayParam Java-friendly w() actually Java-friendly I generated the doc and verified all changes. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6562 from jkbradley/java-api-1.4 and squashes the following commits: c16821b [Joseph K. Bradley] Small fixes based on code review. d955581 [Joseph K. Bradley] unit test fixes 29b6b0d [Joseph K. Bradley] small fixes fe6dcfe [Joseph K. Bradley] Added several Java-friendly APIs + unit tests: NaiveBayes, GaussianMixture, LDA, StreamingKMeans, Statistics.corr, params
* [SPARK-7983] [MLLIB] Add require for one-based indices in loadLibSVMFileYuhao Yang2015-06-032-0/+47
| | | | | | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-7983 Customers frequently use zero-based indices in their LIBSVM files. No warnings or errors from Spark will be reported during their computation afterwards, and usually it will lead to wired result for many algorithms (like GBDT). add a quick check. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #6538 from hhbyyh/loadSVM and squashes the following commits: 79d9c11 [Yuhao Yang] optimization as respond to comments 4310710 [Yuhao Yang] merge conflict 96460f1 [Yuhao Yang] merge conflict 20a2811 [Yuhao Yang] use require 6e4f8ca [Yuhao Yang] add check for ascending order 9956365 [Yuhao Yang] add ut for 0-based loadlibsvm exception 5bd1f9a [Yuhao Yang] add require for one-based in loadLIBSVM
* [SPARK-8053] [MLLIB] renamed scalingVector to scalingVecJoseph K. Bradley2015-06-022-8/+8
| | | | | | | | | | | | I searched the Spark codebase for all occurrences of "scalingVector" CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #6596 from jkbradley/scalingVec-rename and squashes the following commits: d3812f8 [Joseph K. Bradley] renamed scalingVector to scalingVec
* [SPARK-7691] [SQL] Refactor CatalystTypeConverter to use type-specific row ↵Josh Rosen2015-06-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | accessors This patch significantly refactors CatalystTypeConverters to both clean up the code and enable these conversions to work with future Project Tungsten features. At a high level, I've reorganized the code so that all functions dealing with the same type are grouped together into type-specific subclasses of `CatalystTypeConveter`. In addition, I've added new methods that allow the Catalyst Row -> Scala Row conversions to access the Catalyst row's fields through type-specific `getTYPE()` methods rather than the generic `get()` / `Row.apply` methods. This refactoring is a blocker to being able to unit test new operators that I'm developing as part of Project Tungsten, since those operators may output `UnsafeRow` instances which don't support the generic `get()`. The stricter type usage of types here has uncovered some bugs in other parts of Spark SQL: - #6217: DescribeCommand is assigned wrong output attributes in SparkStrategies - #6218: DataFrame.describe() should cast all aggregates to String - #6400: Use output schema, not relation schema, for data source input conversion Spark SQL current has undefined behavior for what happens when you try to create a DataFrame from user-specified rows whose values don't match the declared schema. According to the `createDataFrame()` Scaladoc: > It is important to make sure that the structure of every [[Row]] of the provided RDD matches the provided schema. Otherwise, there will be runtime exception. Given this, it sounds like it's technically not a break of our API contract to fail-fast when the data types don't match. However, there appear to be many cases where we don't fail even though the types don't match. For example, `JavaHashingTFSuite.hasingTF` passes a column of integers values for a "label" column which is supposed to contain floats. This column isn't actually read or modified as part of query processing, so its actual concrete type doesn't seem to matter. In other cases, there could be situations where we have generic numeric aggregates that tolerate being called with different numeric types than the schema specified, but this can be okay due to numeric conversions. In the long run, we will probably want to come up with precise semantics for implicit type conversions / widening when converting Java / Scala rows to Catalyst rows. Until then, though, I think that failing fast with a ClassCastException is a reasonable behavior; this is the approach taken in this patch. Note that certain optimizations in the inbound conversion functions for primitive types mean that we'll probably preserve the old undefined behavior in a majority of cases. Author: Josh Rosen <joshrosen@databricks.com> Closes #6222 from JoshRosen/catalyst-converters-refactoring and squashes the following commits: 740341b [Josh Rosen] Optimize method dispatch for primitive type conversions befc613 [Josh Rosen] Add tests to document Option-handling behavior. 5989593 [Josh Rosen] Use new SparkFunSuite base in CatalystTypeConvertersSuite 6edf7f8 [Josh Rosen] Re-add convertToScala(), since a Hive test still needs it 3f7b2d8 [Josh Rosen] Initialize converters lazily so that the attributes are resolved first 6ad0ebb [Josh Rosen] Fix JavaHashingTFSuite ClassCastException 677ff27 [Josh Rosen] Fix null handling bug; add tests. 8033d4c [Josh Rosen] Fix serialization error in UserDefinedGenerator. 85bba9d [Josh Rosen] Fix wrong input data in InMemoryColumnarQuerySuite 9c0e4e1 [Josh Rosen] Remove last use of convertToScala(). ae3278d [Josh Rosen] Throw ClassCastException errors during inbound conversions. 7ca7fcb [Josh Rosen] Comments and cleanup 1e87a45 [Josh Rosen] WIP refactoring of CatalystTypeConverters
* [SPARK-7547] [ML] Scala Example code for ElasticNetDB Tsai2015-06-025-9/+13
| | | | | | | | | | | | | | This is scala example code for both linear and logistic regression. Python and Java versions are to be added. Author: DB Tsai <dbt@netflix.com> Closes #6576 from dbtsai/elasticNetExample and squashes the following commits: e7ca406 [DB Tsai] fix test 6bb6d77 [DB Tsai] fix suite and remove duplicated setMaxIter 136e0dd [DB Tsai] address feedback 1ec29d4 [DB Tsai] fix style 9462f5f [DB Tsai] add example
* [SPARK-8049] [MLLIB] drop tmp col from OneVsRest outputXiangrui Meng2015-06-022-0/+10
| | | | | | | | | | | The temporary column should be dropped after we get the prediction column. harsha2010 Author: Xiangrui Meng <meng@databricks.com> Closes #6592 from mengxr/SPARK-8049 and squashes the following commits: 1d89107 [Xiangrui Meng] use SparkFunSuite 6ee70de [Xiangrui Meng] drop tmp col from OneVsRest output