aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* Merge pull request #528 from mengxr/sample. Closes #528.Xiangrui Meng2014-02-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor RDD sampling and add randomSplit to RDD (update) Replace SampledRDD by PartitionwiseSampledRDD, which accepts a RandomSampler instance as input. The current sample with/without replacement can be easily integrated via BernoulliSampler and PoissonSampler. The benefits are: 1) RDD.randomSplit is implemented in the same way, related to https://github.com/apache/incubator-spark/pull/513 2) Stratified sampling and importance sampling can be implemented in the same manner as well. Unit tests are included for samplers and RDD.randomSplit. This should performance better than my previous request where the BernoulliSampler creates many Iterator instances: https://github.com/apache/incubator-spark/pull/513 Author: Xiangrui Meng <meng@databricks.com> == Merge branch commits == commit e8ce957e5f0a600f2dec057924f4a2ca6adba373 Author: Xiangrui Meng <meng@databricks.com> Date: Mon Feb 3 12:21:08 2014 -0800 more docs to PartitionwiseSampledRDD commit fbb4586d0478ff638b24bce95f75ff06f713d43b Author: Xiangrui Meng <meng@databricks.com> Date: Mon Feb 3 00:44:23 2014 -0800 move XORShiftRandom to util.random and use it in BernoulliSampler commit 987456b0ee8612fd4f73cb8c40967112dc3c4c2d Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 11:06:59 2014 -0800 relax assertions in SortingSuite because the RangePartitioner has large variance in this case commit 3690aae416b2dc9b2f9ba32efa465ba7948477f4 Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 09:56:28 2014 -0800 test split ratio of RDD.randomSplit commit 8a410bc933a60c4d63852606f8bbc812e416d6ae Author: Xiangrui Meng <meng@databricks.com> Date: Sat Feb 1 09:25:22 2014 -0800 add a test to ensure seed distribution and minor style update commit ce7e866f674c30ab48a9ceb09da846d5362ab4b6 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 18:06:22 2014 -0800 minor style change commit 750912b4d77596ed807d361347bd2b7e3b9b7a74 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 18:04:54 2014 -0800 fix some long lines commit c446a25c38d81db02821f7f194b0ce5ab4ed7ff5 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 17:59:59 2014 -0800 add complement to BernoulliSampler and minor style changes commit dbe2bc2bd888a7bdccb127ee6595840274499403 Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 17:45:08 2014 -0800 switch to partition-wise sampling for better performance commit a1fca5232308feb369339eac67864c787455bb23 Merge: ac712e4 cf6128f Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 31 16:33:09 2014 -0800 Merge branch 'sample' of github.com:mengxr/incubator-spark into sample commit cf6128fb672e8c589615adbd3eaa3cbdb72bd461 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 14:40:07 2014 -0800 set SampledRDD deprecated in 1.0 commit f430f847c3df91a3894687c513f23f823f77c255 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 14:38:59 2014 -0800 update code style commit a8b5e2021a9204e318c80a44d00c5c495f1befb6 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 12:56:27 2014 -0800 move package random to util.random commit ab0fa2c4965033737a9e3a9bf0a59cbb0df6a6f5 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 12:50:35 2014 -0800 add Apache headers and update code style commit 985609fe1a55655ad11966e05a93c18c138a403d Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 11:49:25 2014 -0800 add new lines commit b21bddf29850a2c006a868869b8f91960a029322 Author: Xiangrui Meng <meng@databricks.com> Date: Sun Jan 26 11:46:35 2014 -0800 move samplers to random.IndependentRandomSampler and add tests commit c02dacb4a941618e434cefc129c002915db08be6 Author: Xiangrui Meng <meng@databricks.com> Date: Sat Jan 25 15:20:24 2014 -0800 add RandomSampler commit 8ff7ba3c5cf1fc338c29ae8b5fa06c222640e89c Author: Xiangrui Meng <meng@databricks.com> Date: Fri Jan 24 13:23:22 2014 -0800 init impl of IndependentlySampledRDD
* Merge pull request #460 from srowen/RandomInitialALSVectorsSean Owen2014-01-271-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Choose initial user/item vectors uniformly on the unit sphere ...rather than within the unit square to possibly avoid bias in the initial state and improve convergence. The current implementation picks the N vector elements uniformly at random from [0,1). This means they all point into one quadrant of the vector space. As N gets just a little large, the vector tend strongly to point into the "corner", towards (1,1,1...,1). The vectors are not unit vectors either. I suggest choosing the elements as Gaussian ~ N(0,1) and normalizing. This gets you uniform random choices on the unit sphere which is more what's of interest here. It has worked a little better for me in the past. This is pretty minor but wanted to warm up suggesting a few tweaks to ALS. Please excuse my Scala, pretty new to it. Author: Sean Owen <sowen@cloudera.com> == Merge branch commits == commit 492b13a7469e5a4ed7591ee8e56d8bd7570dfab6 Author: Sean Owen <sowen@cloudera.com> Date: Mon Jan 27 08:05:25 2014 +0000 Style: spaces around binary operators commit ce2b5b5a4fefa0356875701f668f01f02ba4d87e Author: Sean Owen <sowen@cloudera.com> Date: Sun Jan 19 22:50:03 2014 +0000 Generate factors with all positive components, per discussion in https://github.com/apache/incubator-spark/pull/460 commit b6f7a8a61643a8209e8bc662e8e81f2d15c710c7 Author: Sean Owen <sowen@cloudera.com> Date: Sat Jan 18 15:54:42 2014 +0000 Choose initial user/item vectors uniformly on the unit sphere rather than within the unit square to possibly avoid bias in the initial state and improve convergence
* Merge pull request #315 from rezazadeh/sparsesvdMatei Zaharia2014-01-225-0/+433
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sparse SVD # Singular Value Decomposition Given an *m x n* matrix *A*, compute matrices *U, S, V* such that *A = U * S * V^T* There is no restriction on m, but we require n^2 doubles to fit in memory. Further, n should be less than m. The decomposition is computed by first computing *A^TA = V S^2 V^T*, computing svd locally on that (since n x n is small), from which we recover S and V. Then we compute U via easy matrix multiplication as *U = A * V * S^-1* Only singular vectors associated with the largest k singular values If there are k such values, then the dimensions of the return will be: * *S* is *k x k* and diagonal, holding the singular values on diagonal. * *U* is *m x k* and satisfies U^T*U = eye(k). * *V* is *n x k* and satisfies V^TV = eye(k). All input and output is expected in sparse matrix format, 0-indexed as tuples of the form ((i,j),value) all in RDDs. # Testing Tests included. They test: - Decomposition promise (A = USV^T) - For small matrices, output is compared to that of jblas - Rank 1 matrix test included - Full Rank matrix test included - Middle-rank matrix forced via k included # Example Usage import org.apache.spark.SparkContext import org.apache.spark.mllib.linalg.SVD import org.apache.spark.mllib.linalg.SparseMatrix import org.apache.spark.mllib.linalg.MatrixyEntry // Load and parse the data file val data = sc.textFile("mllib/data/als/test.data").map { line => val parts = line.split(',') MatrixEntry(parts(0).toInt, parts(1).toInt, parts(2).toDouble) } val m = 4 val n = 4 // recover top 1 singular vector val decomposed = SVD.sparseSVD(SparseMatrix(data, m, n), 1) println("singular values = " + decomposed.S.data.toArray.mkString) # Documentation Added to docs/mllib-guide.md
| * rename to MatrixSVDReza Zadeh2014-01-171-2/+2
| |
| * rename to MatrixSVDReza Zadeh2014-01-172-4/+4
| |
| * Merge remote-tracking branch 'upstream/master' into sparsesvdReza Zadeh2014-01-1717-48/+45
| |\
| * | prettifyReza Zadeh2014-01-171-2/+2
| | |
| * | add rename computeSVDReza Zadeh2014-01-171-1/+1
| | |
| * | replace this.type with SVDReza Zadeh2014-01-171-1/+1
| | |
| * | use 0-indexingReza Zadeh2014-01-174-12/+12
| | |
| * | Merge remote-tracking branch 'upstream/master' into sparsesvdReza Zadeh2014-01-1311-35/+191
| |\ \
| * \ \ Merge remote-tracking branch 'upstream/master' into sparsesvdReza Zadeh2014-01-097-5/+443
| |\ \ \ | | | | | | | | | | | | | | | | | | | | Conflicts: docs/mllib-guide.md
| * | | | More sparse matrix usage.Reza Zadeh2014-01-071-1/+2
| | | | |
| * | | | use SparseMatrix everywhereReza Zadeh2014-01-044-71/+84
| | | | |
| * | | | prettifyReza Zadeh2014-01-042-21/+22
| | | | |
| * | | | new example fileReza Zadeh2014-01-041-1/+0
| | | | |
| * | | | fix testsReza Zadeh2014-01-042-20/+36
| | | | |
| * | | | set methodsReza Zadeh2014-01-041-7/+52
| | | | |
| * | | | add k parameterReza Zadeh2014-01-042-14/+13
| | | | |
| * | | | using decomposed matrix struct nowReza Zadeh2014-01-043-17/+16
| | | | |
| * | | | new return structReza Zadeh2014-01-041-0/+33
| | | | |
| * | | | start using matrixentryReza Zadeh2014-01-031-9/+14
| | | | |
| * | | | rename sparsesvd.scalaReza Zadeh2014-01-031-0/+0
| | | | |
| * | | | New matrix entry fileReza Zadeh2014-01-031-0/+27
| | | | |
| * | | | fix error messageReza Zadeh2014-01-021-1/+1
| | | | |
| * | | | Merge remote-tracking branch 'upstream/master' into sparsesvdReza Zadeh2014-01-021-7/+6
| |\ \ \ \
| * | | | | more docs yayReza Zadeh2014-01-011-1/+4
| | | | | |
| * | | | | javadoc for sparsesvdReza Zadeh2014-01-011-3/+7
| | | | | |
| * | | | | tweaks to docsReza Zadeh2014-01-011-5/+4
| | | | | |
| * | | | | large scale considerationsReza Zadeh2013-12-271-2/+2
| | | | | |
| * | | | | initial large scale testing beginReza Zadeh2013-12-271-4/+4
| | | | | |
| * | | | | cleanup documentationReza Zadeh2013-12-271-2/+2
| | | | | |
| * | | | | add all testsReza Zadeh2013-12-271-0/+142
| | | | | |
| * | | | | test for truncated svdReza Zadeh2013-12-271-51/+50
| | | | | |
| * | | | | full rank matrix test addedReza Zadeh2013-12-261-1/+9
| | | | | |
| * | | | | Main method added for svdReza Zadeh2013-12-261-4/+4
| | | | | |
| * | | | | new main fileReza Zadeh2013-12-261-10/+19
| | | | | |
| * | | | | Object to hold the svd methodsReza Zadeh2013-12-261-58/+74
| | | | | |
| * | | | | Some documentationReza Zadeh2013-12-261-0/+47
| | | | | |
| * | | | | Initial files - no testsReza Zadeh2013-12-261-0/+68
| | | | | |
* | | | | | Fixed import orderAndrew Tulloch2014-01-215-7/+4
| | | | | |
* | | | | | LocalSparkContext for MLlibAndrew Tulloch2014-01-1910-109/+42
| | | | | |
* | | | | | Correct L2 regularized weight update with canonical formSean Owen2014-01-181-1/+5
| |_|_|_|/ |/| | | |
* | | | | Merge pull request #414 from soulmachine/code-styleReynold Xin2014-01-1516-48/+28
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Code clean up for mllib * Removed unnecessary parentheses * Removed unused imports * Simplified `filter...size()` to `count ...` * Removed obsoleted parameters' comments
| * | | | | Added parentheses for that getDouble() also has side effectFrank Dai2014-01-141-1/+1
| | | | | |
| * | | | | Merge remote-tracking branch 'upstream/master' into code-styleFrank Dai2014-01-1410-12/+168
| |\ \ \ \ \ | | | |_|_|/ | | |/| | |
| * | | | | Indent two spacesFrank Dai2014-01-144-6/+6
| | | | | |
| * | | | | Since getLong() and getInt() have side effect, get back parentheses, and ↵Frank Dai2014-01-142-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | remove an empty line
| * | | | | Code clean up for mllibFrank Dai2014-01-1416-63/+44
| | | | | |
* | | | | | Add missing header filesPatrick Wendell2014-01-141-0/+17
| |/ / / / |/| | | |