aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* SPARK-1791 - SVM implementation does not use threshold parameterAndrew Tulloch2014-05-132-1/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | Summary: https://issues.apache.org/jira/browse/SPARK-1791 Simple fix, and backward compatible, since - anyone who set the threshold was getting completely wrong answers. - anyone who did not set the threshold had the default 0.0 value for the threshold anyway. Test Plan: Unit test added that is verified to fail under the old implementation, and pass under the new implementation. Reviewers: CC: Author: Andrew Tulloch <andrew@tullo.ch> Closes #725 from ajtulloch/SPARK-1791-SVM and squashes the following commits: 770f55d [Andrew Tulloch] SPARK-1791 - SVM implementation does not use threshold parameter (cherry picked from commit d1e487473fd509f28daf28dcda856f3c2f1194ec) Signed-off-by: Reynold Xin <rxin@apache.org>
* [maven-release-plugin] prepare for next development iterationPatrick Wendell2014-05-131-1/+1
|
* [maven-release-plugin] prepare release v1.0.0-rc5Patrick Wendell2014-05-131-1/+1
|
* Revert "[maven-release-plugin] prepare release v1.0.0-rc4"Patrick Wendell2014-05-121-1/+1
| | | | This reverts commit 3d0a44833ab50360bf9feccc861cb5e8c44a4866.
* Revert "[maven-release-plugin] prepare for next development iteration"Patrick Wendell2014-05-121-1/+1
| | | | This reverts commit 9772d85c6f3893d42044f4bab0e16f8b6287613a.
* [maven-release-plugin] prepare for next development iterationPatrick Wendell2014-05-131-1/+1
|
* [maven-release-plugin] prepare release v1.0.0-rc4Patrick Wendell2014-05-131-1/+1
|
* Rollback versions for 1.0.0-rc4Patrick Wendell2014-05-121-1/+1
|
* [maven-release-plugin] prepare for next development iterationPatrick Wendell2014-05-121-1/+1
|
* [maven-release-plugin] prepare release v1.0.0-rc4Patrick Wendell2014-05-121-1/+1
|
* SPARK-1798. Tests should clean up temp filesSean Owen2014-05-122-14/+5
| | | | | | | | | | | | | | | | | | | | | | | Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent. Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former. The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules. Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method. _If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._ Author: Sean Owen <sowen@cloudera.com> Closes #732 from srowen/SPARK-1798 and squashes the following commits: 5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests (cherry picked from commit 7120a2979d0a9f0f54a88b2416be7ca10e74f409) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Bug fix of sparse vector conversionFunes2014-05-082-1/+14
| | | | | | | | | | | | | | | | | | Fixed a small bug caused by the inconsistency of index/data array size and vector length. Author: Funes <tianshaocun@gmail.com> Author: funes <tianshaocun@gmail.com> Closes #661 from funes/bugfix and squashes the following commits: edb2b9d [funes] remove unused import 75dced3 [Funes] update test case d129a66 [Funes] Add test for sparse breeze by vector builder 64e7198 [Funes] Copy data only when necessary b85806c [Funes] Bug fix of sparse vector conversion (cherry picked from commit 191279ce4edb940821d11a6b25cd33c8ad0af054) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1157][MLlib] Bug fix: lossHistory should exclude rejection steps, and ↵DB Tsai2014-05-082-48/+30
| | | | | | | | | | | | | | | | | remove miniBatch Getting the lossHistory from Breeze's API which already excludes the rejection steps in line search. Also, remove the miniBatch in LBFGS since those quasi-Newton methods approximate the inverse of Hessian. It doesn't make sense if the gradients are computed from a varying objective. Author: DB Tsai <dbtsai@alpinenow.com> Closes #582 from dbtsai/dbtsai-lbfgs-bug and squashes the following commits: 9cc6cf9 [DB Tsai] Removed the miniBatch in LBFGS. 1ba6a33 [DB Tsai] Formatting the code. d72c679 [DB Tsai] Using Breeze's states to get the loss. (cherry picked from commit 910a13b3c52a6309068b4997da6df6b7d6058a1b) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1544 Add support for deep decision trees.Manish Amde2014-05-073-23/+170
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | @etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels. To summarize: 1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver). 2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth. cc: @atalwalkar, @hirakendu, @mengxr Author: Manish Amde <manish9ue@gmail.com> Author: manishamde <manish9ue@gmail.com> Author: Evan Sparks <sparks@cs.berkeley.edu> Closes #475 from manishamde/deep_tree and squashes the following commits: 968ca9d [Manish Amde] merged master 7fc9545 [Manish Amde] added docs ce004a1 [Manish Amde] minor formatting b27ad2c [Manish Amde] formatting 426bb28 [Manish Amde] programming guide blurb 8053fed [Manish Amde] more formatting 5eca9e4 [Manish Amde] grammar 4731cda [Manish Amde] formatting 5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation cbd9f14 [Manish Amde] modified scala.math to math dad9652 [Manish Amde] removed unused imports e0426ee [Manish Amde] renamed parameter 718506b [Manish Amde] added unit test 1517155 [Manish Amde] updated documentation 9dbdabe [Manish Amde] merge from master 719d009 [Manish Amde] updating user documentation fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree 0287772 [Evan Sparks] Fixing scalastyle issue. 2f1e093 [Manish Amde] minor: added doc for maxMemory parameter 2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree abc5a23 [Evan Sparks] Parameterizing max memory. 50b143a [Manish Amde] adding support for very deep trees (cherry picked from commit f269b016acb17b24d106dc2b32a1be389489bb01) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Update GradientDescentSuite.scalabaishuo(白硕)2014-05-071-3/+3
| | | | | | | | | | | | | | | use more faster way to construct an array Author: baishuo(白硕) <vc_java@hotmail.com> Closes #588 from baishuo/master and squashes the following commits: 45b95fb [baishuo(白硕)] Update GradientDescentSuite.scala c03b61c [baishuo(白硕)] Update GradientDescentSuite.scala b666d27 [baishuo(白硕)] Update GradientDescentSuite.scala (cherry picked from commit 0c19bb161b9b2b96c0c55d3ea09e81fd798cbec0) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1727. Correct small compile errors, typos, and markdown issues in ↵Sean Owen2014-05-061-6/+6
| | | | | | | | | | | | | | | | | | | | | (primarly) MLlib docs While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs. Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown. Author: Sean Owen <sowen@cloudera.com> Closes #653 from srowen/SPARK-1727 and squashes the following commits: 6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count 8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output) 99966a9 [Sean Owen] Update issue tracker URL in docs 23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak) 8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs (cherry picked from commit 25ad8f93012730115a8a1fac649fe3e842c045b3) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guideXiangrui Meng2014-05-0534-320/+381
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Final pass before the v1.0 release. * Remove `VectorRDDs` * Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation` * Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso. * Clean `DecisionTree` package doc and test suite. * Mark model constructors `private[spark]` * Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users. * Add `saveAsLibSVMFile`. * Add `appendBias` to `MLUtils`. Author: Xiangrui Meng <meng@databricks.com> Closes #524 from mengxr/mllib-cleaning and squashes the following commits: 295dc8b [Xiangrui Meng] update loadLibSVMFile doc 1977ac1 [Xiangrui Meng] fix doc of appendBias 649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs 54b812c [Xiangrui Meng] add appendBias a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib] 9b02b93 [Xiangrui Meng] minor code style update a593ddc [Xiangrui Meng] fix python tests fc28c18 [Xiangrui Meng] mark more classes experimental f6cbbff [Xiangrui Meng] fix Java tests 0af70b0 [Xiangrui Meng] minor 6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning 94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext c81807f [Xiangrui Meng] set the default value of AddIntercept to false 03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso c66c56f [Xiangrui Meng] move tree md to package object doc a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics 9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up 1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain (cherry picked from commit 98750a74daf7e2b873da85d2d5067f47e3bbdc4e) Signed-off-by: Matei Zaharia <matei@databricks.com>
* [SPARK-1646] Micro-optimisation of ALSTor Myklebust2014-04-291-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This change replaces some Scala `for` and `foreach` constructs with `while` constructs. There may be a slight performance gain on the order of 1-2% when training an ALS model. I trained an ALS model on the Movielens 10M-rating dataset repeatedly both with and without these changes. All 7 runs in both columns were done in a Scala `for` loop like this: for (iter <- 0 to 10) { val before = System.currentTimeMillis() val model = ALS.train(rats, 20, 10) val after = System.currentTimeMillis() println("%d ms".format(after-before)) println("rmse %g".format(computeRmse(model, rats, numRatings))) } The timings were done on a multiuser machine, and I stopped one set of timings after 7 had been completed. It would be nice if somebody with dedicated hardware could confirm my timings. After Before 121980 ms 122041 ms 117069 ms 117127 ms 115332 ms 117523 ms 115381 ms 117402 ms 114635 ms 116550 ms 114140 ms 114076 ms 112993 ms 117200 ms Ratios are about 1.0005, 1.0005, 1.019, 1.0175, 1.01671, 0.99944, and 1.03723. I therefore suspect these changes make for a slight performance gain on the order of 1-2%. Author: Tor Myklebust <tmyklebu@gmail.com> Closes #568 from tmyklebu/alsopt and squashes the following commits: 5ded80f [Tor Myklebust] Fix style. 79595ff [Tor Myklebust] Fix style error. 4ef0313 [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsopt 114fb74 [Tor Myklebust] Turn some 'for' loops into 'while' loops. dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go. 23d6f91 [Tor Myklebust] Stop making the partitioner configurable. 495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark 674933a [Tor Myklebust] Fix style. 40edc23 [Tor Myklebust] Fix missing space. f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach. 5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'. 36a0f43 [Tor Myklebust] Make the partitioner private. d872b09 [Tor Myklebust] Add negative id ALS test. df27697 [Tor Myklebust] Support custom partitioners. Currently we use the same partitioner for users and products. c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing. c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly. (cherry picked from commit 5c0cd5c1a594c181a3f7536639122ab7d97b271b) Signed-off-by: Reynold Xin <rxin@apache.org>
* [SPARK-1636][MLLIB] Move main methods to examplesXiangrui Meng2014-04-299-304/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * `NaiveBayes` -> `SparseNaiveBayes` * `KMeans` -> `DenseKMeans` * `SVMWithSGD` and `LogisticRegerssionWithSGD` -> `BinaryClassification` * `ALS` -> `MovieLensALS` * `LinearRegressionWithSGD`, `LassoWithSGD`, and `RidgeRegressionWithSGD` -> `LinearRegression` * `DecisionTree` -> `DecisionTreeRunner` `scopt` is used for parsing command-line parameters. `scopt` has MIT license and it only depends on `scala-library`. Example help message: ~~~ BinaryClassification: an example app for binary classification. Usage: BinaryClassification [options] <input> --numIterations <value> number of iterations --stepSize <value> initial step size, default: 1.0 --algorithm <value> algorithm (SVM,LR), default: LR --regType <value> regularization type (L1,L2), default: L2 --regParam <value> regularization parameter, default: 0.1 <input> input paths to labeled examples in LIBSVM format ~~~ Author: Xiangrui Meng <meng@databricks.com> Closes #584 from mengxr/mllib-main and squashes the following commits: 7b58c60 [Xiangrui Meng] minor 6e35d7e [Xiangrui Meng] make imports explicit and fix code style c6178c9 [Xiangrui Meng] update TS PCA/SVD to use new spark-submit 6acff75 [Xiangrui Meng] use scopt for DecisionTreeRunner be86069 [Xiangrui Meng] use main instead of extending App b3edf68 [Xiangrui Meng] move DecisionTree's main method to examples 8bfaa5a [Xiangrui Meng] change NaiveBayesParams to Params fe23dcb [Xiangrui Meng] remove main from KMeans and add DenseKMeans as an example 67f4448 [Xiangrui Meng] remove main methods from linear regression algorithms and add LinearRegression example b066bbc [Xiangrui Meng] remove main from ALS and add MovieLensALS example b040f3b [Xiangrui Meng] change BinaryClassificationParams to Params 577945b [Xiangrui Meng] remove unused imports from NB 3d299bc [Xiangrui Meng] remove main from LR/SVM and add an example app for binary classification f70878e [Xiangrui Meng] remove main from NaiveBayes and add an example NaiveBayes app 01ec2cd [Xiangrui Meng] Merge branch 'master' into mllib-main 9420692 [Xiangrui Meng] add scopt to examples dependencies (cherry picked from commit 3f38334f441940ed0a5bbf5588ca7f22d3940359) Signed-off-by: Reynold Xin <rxin@apache.org>
* [maven-release-plugin] prepare for next development iterationPatrick Wendell2014-04-291-1/+1
|
* [maven-release-plugin] prepare release v1.0.0-rc3Patrick Wendell2014-04-291-1/+1
|
* Manual revert of rc2 version changes.Patrick Wendell2014-04-281-1/+1
|
* Improved build configurationwitgo2014-04-281-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x 2, Fix SPARK-1491: maven hadoop-provided profile fails to build 3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency 4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces) Author: witgo <witgo@qq.com> Closes #480 from witgo/format_pom and squashes the following commits: 03f652f [witgo] review commit b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence 7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence 0da4bc3 [witgo] merge master d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom e345919 [witgo] add avro dependency to yarn-alpha 77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency 1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom 934f24d [witgo] review commit cf46edc [witgo] exclude jruby 06e7328 [witgo] Merge branch 'SparkBuild' into format_pom 99464d2 [witgo] fix maven hadoop-provided profile fails to build 0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x 6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml (cherry picked from commit 030f2c2126d5075576cd6d83a1ee7462c48b953b) Conflicts: sql/catalyst/pom.xml sql/core/pom.xml sql/hive/pom.xml
* [Fix #79] Replace Breakable For Loops By While LoopsSandeep2014-04-231-29/+31
| | | | | | | | | | | | | Author: Sandeep <sandeep@techaddict.me> Closes #503 from techaddict/fix-79 and squashes the following commits: e3f6746 [Sandeep] Style changes 07a4f6b [Sandeep] for loop to While loop 0a6d8e9 [Sandeep] Breakable for loop to While loop (cherry picked from commit bb68f47745eec2954814d3da277a672d5cf89980) Signed-off-by: Reynold Xin <rxin@apache.org>
* [maven-release-plugin] prepare for next development iterationPatrick Wendell2014-04-241-1/+1
|
* [maven-release-plugin] prepare release v1.0.0-rc2Patrick Wendell2014-04-241-1/+1
|
* [Fix #274] Document + fix annotation usagesAndrew Or2014-04-222-6/+3
| | | | | | | | | | | | | | | | | | ... so that we don't follow an unspoken set of forbidden rules for adding **@AlphaComponent**, **@DeveloperApi**, and **@Experimental** annotations in the code. In addition, this PR (1) removes unnecessary `:: * ::` tags, (2) adds missing `:: * ::` tags, and (3) removes annotations for internal APIs. Author: Andrew Or <andrewor14@gmail.com> Closes #470 from andrewor14/annotations-fix and squashes the following commits: 92a7f42 [Andrew Or] Document + fix annotation usages (cherry picked from commit b3e5366f696c463f1c2f033b0d5c7365e5d6b0f8) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1281] Improve partitioning in ALSTor Myklebust2014-04-222-23/+54
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ALS was using HashPartitioner and explicit uses of `%` together. Further, the naked use of `%` meant that, if the number of partitions corresponded with the stride of arithmetic progressions appearing in user and product ids, users and products could be mapped into buckets in an unfair or unwise way. This pull request: 1) Makes the Partitioner an instance variable of ALS. 2) Replaces the direct uses of `%` with calls to a Partitioner. 3) Defines an anonymous Partitioner that scrambles the bits of the object's hashCode before reducing to the number of present buckets. This pull request does not make the partitioner user-configurable. I'm not all that happy about the way I did (1). It introduces an icky lifetime issue and dances around it by nulling something. However, I don't know a better way to make the partitioner visible everywhere it needs to be visible. Author: Tor Myklebust <tmyklebu@gmail.com> Closes #407 from tmyklebu/master and squashes the following commits: dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go. 23d6f91 [Tor Myklebust] Stop making the partitioner configurable. 495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark 674933a [Tor Myklebust] Fix style. 40edc23 [Tor Myklebust] Fix missing space. f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach. 5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'. 36a0f43 [Tor Myklebust] Make the partitioner private. d872b09 [Tor Myklebust] Add negative id ALS test. df27697 [Tor Myklebust] Support custom partitioners. Currently we use the same partitioner for users and products. c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing. c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly. (cherry picked from commit bf9d49b6d1f668b49795c2d380ab7d64ec0029da) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1506][MLLIB] Documentation improvements for MLlib 1.0Xiangrui Meng2014-04-221-0/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Preview: http://54.82.240.23:4000/mllib-guide.html Table of contents: * Basics * Data types * Summary statistics * Classification and regression * linear support vector machine (SVM) * logistic regression * linear linear squares, Lasso, and ridge regression * decision tree * naive Bayes * Collaborative Filtering * alternating least squares (ALS) * Clustering * k-means * Dimensionality reduction * singular value decomposition (SVD) * principal component analysis (PCA) * Optimization * stochastic gradient descent * limited-memory BFGS (L-BFGS) Author: Xiangrui Meng <meng@databricks.com> Closes #422 from mengxr/mllib-doc and squashes the following commits: 944e3a9 [Xiangrui Meng] merge master f9fda28 [Xiangrui Meng] minor 9474065 [Xiangrui Meng] add alpha to ALS examples 928e630 [Xiangrui Meng] initialization_mode -> initializationMode 5bbff49 [Xiangrui Meng] add imports to labeled point examples c17440d [Xiangrui Meng] fix python nb example 28f40dc [Xiangrui Meng] remove localhost:4000 369a4d3 [Xiangrui Meng] Merge branch 'master' into mllib-doc 7dc95cc [Xiangrui Meng] update linear methods 053ad8a [Xiangrui Meng] add links to go back to the main page abbbf7e [Xiangrui Meng] update ALS argument names 648283e [Xiangrui Meng] level down statistics 14e2287 [Xiangrui Meng] add sample libsvm data and use it in guide 8cd2441 [Xiangrui Meng] minor updates 186ab07 [Xiangrui Meng] update section names 6568d65 [Xiangrui Meng] update toc, level up lr and svm 162ee12 [Xiangrui Meng] rename section names 5c1e1b1 [Xiangrui Meng] minor 8aeaba1 [Xiangrui Meng] wrap long lines 6ce6a6f [Xiangrui Meng] add summary statistics to toc 5760045 [Xiangrui Meng] claim beta cc604bf [Xiangrui Meng] remove classification and regression 92747b3 [Xiangrui Meng] make section titles consistent e605dd6 [Xiangrui Meng] add LIBSVM loader f639674 [Xiangrui Meng] add python section to migration guide c82ffb4 [Xiangrui Meng] clean optimization 31660eb [Xiangrui Meng] update linear algebra and stat 0a40837 [Xiangrui Meng] first pass over linear methods 1fc8271 [Xiangrui Meng] update toc 906ed0a [Xiangrui Meng] add a python example to naive bayes 5f0a700 [Xiangrui Meng] update collaborative filtering 656d416 [Xiangrui Meng] update mllib-clustering 86e143a [Xiangrui Meng] remove data types section from main page 8d1a128 [Xiangrui Meng] move part of linear algebra to data types and add Java/Python examples d1b5cbf [Xiangrui Meng] merge master 72e4804 [Xiangrui Meng] one pass over tree guide 64f8995 [Xiangrui Meng] move decision tree guide to a separate file 9fca001 [Xiangrui Meng] add first version of linear algebra guide 53c9552 [Xiangrui Meng] update dependencies f316ec2 [Xiangrui Meng] add migration guide f399f6c [Xiangrui Meng] move linear-algebra to dimensionality-reduction 182460f [Xiangrui Meng] add guide for naive Bayes 137fd1d [Xiangrui Meng] re-organize toc a61e434 [Xiangrui Meng] update mllib's toc (cherry picked from commit 26d35f3fd942761b0adecd1a720e1fa834db4de9) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Revert "[maven-release-plugin] prepare release v1.0.0-rc1"Patrick Wendell2014-04-211-1/+1
| | | | This reverts commit 6cc698fc378256fee9111f66c691ced27f54e973.
* Revert "[maven-release-plugin] prepare for next development iteration"Patrick Wendell2014-04-211-1/+1
| | | | This reverts commit 188f7c3f68e93b3e9347ec02e21f5943874b4741.
* [maven-release-plugin] prepare for next development iterationPatrick Wendell2014-04-211-1/+1
|
* [maven-release-plugin] prepare release v1.0.0-rc1Patrick Wendell2014-04-211-1/+1
|
* Revert "[maven-release-plugin] prepare release v1.0.0"Patrick Wendell2014-04-211-1/+1
| | | | This reverts commit c3c6ea05d9d02a38b97388d583828ca82c5181db.
* Revert "[maven-release-plugin] prepare for next development iteration"Patrick Wendell2014-04-211-1/+1
| | | | This reverts commit 0b49305297033b70cbb525bef54b70a14deeb238.
* [maven-release-plugin] prepare for next development iterationPatrick Wendell2014-04-211-1/+1
|
* [maven-release-plugin] prepare release v1.0.0Patrick Wendell2014-04-211-1/+1
|
* [SPARK-1535] ALS: Avoid the garbage-creating ctor of DoubleMatrixTor Myklebust2014-04-191-2/+11
| | | | | | | | | | | | | | `new DoubleMatrix(double[])` creates a garbage `double[]` of the same length as its argument and immediately throws it away. This pull request avoids that constructor in the ALS code. Author: Tor Myklebust <tmyklebu@gmail.com> Closes #442 from tmyklebu/foo2 and squashes the following commits: 2784fc5 [Tor Myklebust] Mention that this is probably fixed as of jblas 1.2.4; repunctuate. a09904f [Tor Myklebust] Helper function for wrapping Array[Double]'s with DoubleMatrix's. (cherry picked from commit 25fc31884b0382b2d43c55e1f55e305a73dfae91) Signed-off-by: Matei Zaharia <matei@databricks.com>
* SPARK-1357 (addendum). More Experimental items in MLlibSean Owen2014-04-183-0/+10
| | | | | | | | | | | | | | | Per discussion, this is my suggestion to make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0. See what you think of this much. Author: Sean Owen <sowen@cloudera.com> Closes #372 from srowen/SPARK-1357Addendum and squashes the following commits: 17cf1ea [Sean Owen] Remove (another) blank line after ":: Experimental ::" 6800e4c [Sean Owen] Remove blank line after ":: Experimental ::" b3a88d2 [Sean Owen] Make ALS Rating, ClassificationModel, RegressionModel experimental for now, to reserve the right to possibly change after 1.0 (cherry picked from commit 8aa1f4c4f6d60168737699b5a9eafd6a05660976) Signed-off-by: Reynold Xin <rxin@apache.org>
* SPARK-1483: Rename minSplits to minPartitions in public APIsCodingCat2014-04-181-6/+6
| | | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-1483 From the original JIRA: " The parameter name is part of the public API in Scala and Python, since you can pass named parameters to a method, so we should name it to this more descriptive term. Everywhere else we refer to "splits" as partitions." - @mateiz Author: CodingCat <zhunansjtu@gmail.com> Closes #430 from CodingCat/SPARK-1483 and squashes the following commits: 4b60541 [CodingCat] deprecate defaultMinSplits ba2c663 [CodingCat] Rename minSplits to minPartitions in public APIs (cherry picked from commit e31c8ffca65e0e5cd5f1a6229f3d654a24b7b18c) Signed-off-by: Reynold Xin <rxin@apache.org>
* SPARK-1310: Start adding k-fold cross validation to MLLib [adds kFold to ↵Holden Karau2014-04-162-0/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MLUtils & fixes bug in BernoulliSampler] Author: Holden Karau <holden@pigscanfly.ca> Closes #18 from holdenk/addkfoldcrossvalidation and squashes the following commits: 208db9b [Holden Karau] Fix a bad space e84f2fc [Holden Karau] Fix the test, we should be looking at the second element instead 6ddbf05 [Holden Karau] swap training and validation order 7157ae9 [Holden Karau] CR feedback 90896c7 [Holden Karau] New line 150889c [Holden Karau] Fix up error messages in the MLUtilsSuite 2cb90b3 [Holden Karau] Fix the names in kFold c702a96 [Holden Karau] Fix imports in MLUtils e187e35 [Holden Karau] Move { up to same line as whenExecuting(random) in RandomSamplerSuite.scala c5b723f [Holden Karau] clean up 7ebe4d5 [Holden Karau] CR feedback, remove unecessary learners (came back during merge mistake) and insert an empty line bb5fa56 [Holden Karau] extra line sadness 163c5b1 [Holden Karau] code review feedback 1.to -> 1 to and folds -> numFolds 5a33f1d [Holden Karau] Code review follow up. e8741a7 [Holden Karau] CR feedback b78804e [Holden Karau] Remove cross validation [TODO in another pull request] 91eae64 [Holden Karau] Consolidate things in mlutils 264502a [Holden Karau] Add a test for the bug that was found with BernoulliSampler not copying the complement param dd0b737 [Holden Karau] Wrap long lines (oops) c0b7fa4 [Holden Karau] Switch FoldedRDD to use BernoulliSampler and PartitionwiseSampledRDD 08f8e4d [Holden Karau] Fix BernoulliSampler to respect complement a751ec6 [Holden Karau] Add k-fold cross validation to MLLib (cherry picked from commit c3527a333a0877f4b49614f3fd1f041b01749651) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [WIP] SPARK-1430: Support sparse data in Python MLlibMatei Zaharia2014-04-153-50/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type. On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models. Some to-do items left: - [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector. - [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling. - [x] Explain how to use these in the Python MLlib docs. CC @mengxr, @joshrosen Author: Matei Zaharia <matei@databricks.com> Closes #341 from mateiz/py-ml-update and squashes the following commits: d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge b9f97a3 [Matei Zaharia] Fix test 1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python 88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs 37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script. a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights 74eefe7 [Matei Zaharia] Added LabeledPoint class in Python 889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict a5d6426 [Matei Zaharia] Add linalg.py to run-tests script 0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data 2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data 154f45d [Matei Zaharia] Update docs, name some magic values 881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact (cherry picked from commit 63ca581d9c84176549b1ea0a1d8d7c0cca982acc) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* Decision Tree documentation for MLlib programming guideManish Amde2014-04-151-0/+569
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added documentation for user to use the decision tree algorithms for classification and regression in Spark 1.0 release. Apart from a general review, I need specific input on the following: * I had to move a lot of the existing documentation under the *linear methods* umbrella to accommodate decision trees. I wonder if there is a better way to organize the programming guide given we are so close to the release. * I have not looked closely at pyspark but I am wondering new mllib algorithms are automatically plugged in or do we need to some extra work to call mllib functions from pyspark. I will add to the pyspark examples based upon the advice I get. cc: @mengxr, @hirakendu, @etrain, @atalwalkar Author: Manish Amde <manish9ue@gmail.com> Closes #402 from manishamde/tree_doc and squashes the following commits: 022485a [Manish Amde] more documentation 865826e [Manish Amde] minor: grammar dbb0e5e [Manish Amde] minor improvements to text b9ef6c4 [Manish Amde] basic decision tree code examples 6e297d7 [Manish Amde] added subsections f427e84 [Manish Amde] renaming sections 9c0c4be [Manish Amde] split candidate 6925275 [Manish Amde] impurity and information gain 94fd2f9 [Manish Amde] more reorg b93125c [Manish Amde] more subsection reorg 3ecb2ad [Manish Amde] minor text addition 1537dd3 [Manish Amde] added placeholders and some doc d06511d [Manish Amde] basic skeleton (cherry picked from commit 07d72fe6965aaf299d61bf6156d48bcfebc41b32) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [SPARK-1157][MLlib] L-BFGS Optimizer based on Breeze's implementation.DB Tsai2014-04-153-14/+480
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR uses Breeze's L-BFGS implement, and Breeze dependency has already been introduced by Xiangrui's sparse input format work in SPARK-1212. Nice work, @mengxr ! When use with regularized updater, we need compute the regVal and regGradient (the gradient of regularized part in the cost function), and in the currently updater design, we can compute those two values by the following way. Let's review how updater works when returning newWeights given the input parameters. w' = w - thisIterStepSize * (gradient + regGradient(w)) Note that regGradient is function of w! If we set gradient = 0, thisIterStepSize = 1, then regGradient(w) = w - w' As a result, for regVal, it can be computed by val regVal = updater.compute( weights, new DoubleMatrix(initialWeights.length, 1), 0, 1, regParam)._2 and for regGradient, it can be obtained by val regGradient = weights.sub( updater.compute(weights, new DoubleMatrix(initialWeights.length, 1), 1, 1, regParam)._1) The PR includes the tests which compare the result with SGD with/without regularization. We did a comparison between LBFGS and SGD, and often we saw 10x less steps in LBFGS while the cost of per step is the same (just computing the gradient). The following is the paper by Prof. Ng at Stanford comparing different optimizers including LBFGS and SGD. They use them in the context of deep learning, but worth as reference. http://cs.stanford.edu/~jngiam/papers/LeNgiamCoatesLahiriProchnowNg2011.pdf Author: DB Tsai <dbtsai@alpinenow.com> Closes #353 from dbtsai/dbtsai-LBFGS and squashes the following commits: 984b18e [DB Tsai] L-BFGS Optimizer based on Breeze's implementation. Also fixed indentation issue in GradientDescent optimizer. (cherry picked from commit 6843d637e72e5262d05cfa2b1935152743f2bd5a) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* SPARK-1488. Resolve scalac feature warnings during buildSean Owen2014-04-143-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For your consideration: scalac currently notes a number of feature warnings during compilation: ``` [warn] there were 65 feature warning(s); re-run with -feature for details ``` Warnings are like: ``` [warn] /Users/srowen/Documents/spark/core/src/main/scala/org/apache/spark/SparkContext.scala:1261: implicit conversion method rddToPairRDDFunctions should be enabled [warn] by making the implicit value scala.language.implicitConversions visible. [warn] This can be achieved by adding the import clause 'import scala.language.implicitConversions' [warn] or by setting the compiler option -language:implicitConversions. [warn] See the Scala docs for value scala.language.implicitConversions for a discussion [warn] why the feature should be explicitly enabled. [warn] implicit def rddToPairRDDFunctions[K: ClassTag, V: ClassTag](rdd: RDD[(K, V)]) = [warn] ^ ``` scalac is suggesting that it's just best practice to explicitly enable certain language features by importing them where used. This PR simply adds the imports it suggests (and squashes one other Java warning along the way). This leaves just deprecation warnings in the build. Author: Sean Owen <sowen@cloudera.com> Closes #404 from srowen/SPARK-1488 and squashes the following commits: 8598980 [Sean Owen] Quiet scalac warnings about language features by explicitly importing language features. 39bc831 [Sean Owen] Enable -feature in scalac to emit language feature warnings (cherry picked from commit 0247b5c5467ca1b0d03ba929a78fa4d805582d84) Signed-off-by: Patrick Wendell <pwendell@gmail.com>
* [WIP] [SPARK-1328] Add vector statisticsXusen Yin2014-04-115-76/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As with the new vector system in MLlib, we find that it is good to add some new APIs to precess the `RDD[Vector]`. Beside, the former implementation of `computeStat` is not stable which could loss precision, and has the possibility to cause `Nan` in scientific computing, just as said in the [SPARK-1328](https://spark-project.atlassian.net/browse/SPARK-1328). APIs contain: * rowMeans(): RDD[Double] * rowNorm2(): RDD[Double] * rowSDs(): RDD[Double] * colMeans(): Vector * colMeans(size: Int): Vector * colNorm2(): Vector * colNorm2(size: Int): Vector * colSDs(): Vector * colSDs(size: Int): Vector * maxOption((Vector, Vector) => Boolean): Option[Vector] * minOption((Vector, Vector) => Boolean): Option[Vector] * rowShrink(): RDD[Vector] * colShrink(): RDD[Vector] This is working in process now, and some more APIs will add to `LabeledPoint`. Moreover, the implicit declaration will move from `MLUtils` to `MLContext` later. Author: Xusen Yin <yinxusen@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #268 from yinxusen/vector-statistics and squashes the following commits: d61363f [Xusen Yin] rebase to latest master 16ae684 [Xusen Yin] fix minor error and remove useless method 10cf5d3 [Xusen Yin] refine some return type b064714 [Xusen Yin] remove computeStat in MLUtils cbbefdb [Xiangrui Meng] update multivariate statistical summary interface and clean tests 4eaf28a [Xusen Yin] merge VectorRDDStatistics into RowMatrix 48ee053 [Xusen Yin] fix minor error e624f93 [Xusen Yin] fix scala style error 1fba230 [Xusen Yin] merge while loop together 69e1f37 [Xusen Yin] remove lazy eval, and minor memory footprint 548e9de [Xusen Yin] minor revision 86522c4 [Xusen Yin] add comments on functions dc77e38 [Xusen Yin] test sparse vector RDD 18cf072 [Xusen Yin] change def to lazy val to make sure that the computations in function be evaluated only once f7a3ca2 [Xusen Yin] fix the corner case of maxmin 967d041 [Xusen Yin] full revision with Aggregator class 138300c [Xusen Yin] add new Aggregator class 1376ff4 [Xusen Yin] rename variables and adjust code 4a5c38d [Xusen Yin] add scala doc, refine code and comments 036b7a5 [Xusen Yin] fix the bug of Nan occur f6e8e9a [Xusen Yin] add sparse vectors test 4cfbadf [Xusen Yin] fix bug of min max 4e4fbd1 [Xusen Yin] separate seqop and combop out as independent functions a6d5a2e [Xusen Yin] rewrite for only computing non-zero elements 3980287 [Xusen Yin] rename variables 62a2c3e [Xusen Yin] use axpy and in-place if possible 9a75ebd [Xusen Yin] add case class to wrap return values d816ac7 [Xusen Yin] remove useless APIs c4651bb [Xusen Yin] remove row-wise APIs and refine code 1338ea1 [Xusen Yin] all-in-one version test passed cc65810 [Xusen Yin] add parallel mean and variance 9af2e95 [Xusen Yin] refine the code style ad6c82d [Xusen Yin] add shrink test e09d5d2 [Xusen Yin] add scala docs and refine shrink method 8ef3377 [Xusen Yin] pass all tests 28cf060 [Xusen Yin] fix error of column means 54b19ab [Xusen Yin] add new API to shrink RDD[Vector] 8c6c0e1 [Xusen Yin] add basic statistics
* [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationMetricsXiangrui Meng2014-04-119-0/+671
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from https://github.com/apache/spark/pull/136 . It also contains refactoring of https://github.com/apache/spark/pull/160 for binary classification evaluation. Author: Xiangrui Meng <meng@databricks.com> Closes #364 from mengxr/auc and squashes the following commits: a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names 3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics b1b7dab [Xiangrui Meng] fix code styles 9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator ca31da5 [Xiangrui Meng] remove PredictionAndResponse 3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary 8f78958 [Xiangrui Meng] add PredictionAndResponse dda82d5 [Xiangrui Meng] add confusion matrix aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator 221ebce [Xiangrui Meng] add a new test to sliding a920865 [Xiangrui Meng] Merge branch 'sliding' into auc a9b250a [Xiangrui Meng] move sliding to mllib cab9a52 [Xiangrui Meng] use last for the last element db6cb30 [Xiangrui Meng] remove unnecessary toSeq 9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]] 284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD c1c6c22 [Xiangrui Meng] add AreaUnderCurve 65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc 5ee6001 [Xiangrui Meng] add TODO d2a600d [Xiangrui Meng] add sliding to rdd (cherry picked from commit f5ace8da34c58d1005c7c377cfe3df21102c1dd6) Signed-off-by: Matei Zaharia <matei@databricks.com>
* Remove Unnecessary Whitespace'sSandeep2014-04-102-3/+3
| | | | | | | | | | stack these together in a commit else they show up chunk by chunk in different commits. Author: Sandeep <sandeep@techaddict.me> Closes #380 from techaddict/white_space and squashes the following commits: b58f294 [Sandeep] Remove Unnecessary Whitespace's
* [SPARK-1357 (fix)] remove empty line after :: DeveloperApi/Experimental ::Xiangrui Meng2014-04-0933-71/+21
| | | | | | | | | | Remove empty line after :: DeveloperApi/Experimental :: in comments to make the original doc show up in the preview of the generated html docs. Thanks @andrewor14 ! Author: Xiangrui Meng <meng@databricks.com> Closes #373 from mengxr/api and squashes the following commits: 9c35bdc [Xiangrui Meng] remove the empty line after :: DeveloperApi/Experimental ::
* [SPARK-1357] [MLLIB] Annotate developer and experimental APIsXiangrui Meng2014-04-0942-122/+355
| | | | | | | | | | | | | | | | | | | | | Annotate developer and experimental APIs in MLlib. Author: Xiangrui Meng <meng@databricks.com> Closes #298 from mengxr/api and squashes the following commits: 13390e8 [Xiangrui Meng] Merge branch 'master' into api dc4cbb3 [Xiangrui Meng] mark distribute matrices experimental 6b9f8e2 [Xiangrui Meng] add Experimental annotation 8773d0d [Xiangrui Meng] add DeveloperApi annotation da31733 [Xiangrui Meng] update developer and experimental tags 555e0fe [Xiangrui Meng] Merge branch 'master' into api ef1a717 [Xiangrui Meng] mark some constructors private add default parameters to JavaDoc 00ffbcc [Xiangrui Meng] update tree API annotation 0b674fa [Xiangrui Meng] mark decision tree APIs 86b9e34 [Xiangrui Meng] one pass over APIs of GLMs, NaiveBayes, and ALS f21d862 [Xiangrui Meng] Merge branch 'master' into api 2b133d6 [Xiangrui Meng] intial annotation of developer and experimental apis