aboutsummaryrefslogtreecommitdiff
path: root/mllib/src/main/scala/org
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-11769][ML] Add save, load to all basic TransformersJoseph K. Bradley2015-11-1716-62/+279
| | | | | | | | | | | | | | | | | | | | | | | | | | This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds: * Bucketizer * DCT * HashingTF * Interaction * NGram * Normalizer * OneHotEncoder * PolynomialExpansion * QuantileDiscretizer * RFormula * SQLTransformer * StopWordsRemover * StringIndexer * Tokenizer * VectorAssembler * VectorSlicer CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9755 from jkbradley/transformer-io.
* [SPARK-11766][MLLIB] add toJson/fromJson to Vector/VectorsXiangrui Meng2015-11-171-0/+45
| | | | | | | | This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9751 from mengxr/SPARK-11766.
* [SPARK-11612][ML] Pipeline and PipelineModel persistenceJoseph K. Bradley2015-11-162-5/+174
| | | | | | | | | | | | Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable. Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9674 from jkbradley/pipeline-io.
* [SPARK-11672][ML] Set active SQLContext in MLlibTestSparkContext.beforeAllXiangrui Meng2015-11-131-2/+5
| | | | | | | | | | | | | | Still saw some error messages caused by `SQLContext.getOrCreate`: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3997/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/ This PR sets the active SQLContext in beforeAll, which is not automatically set in `new SQLContext`. This makes `SQLContext.getOrCreate` return the right SQLContext. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9694 from mengxr/SPARK-11672.3.
* [SPARK-11723][ML][DOC] Use LibSVM data source rather than ↵Yanbo Liang2015-11-131-1/+1
| | | | | | | | | | | | | | | | MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.
* [SPARK-11712][ML] Make spark.ml LDAModel be abstractJoseph K. Bradley2015-11-121-86/+94
| | | | | | | | | | Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.
* [SPARK-11674][ML] add private val after @transient in Word2VecModelXiangrui Meng2015-11-111-1/+1
| | | | | | | | This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen Author: Xiangrui Meng <meng@databricks.com> Closes #9644 from mengxr/SPARK-11674.
* [SPARK-11626][ML] ml.feature.Word2Vec.transform() function very slowYuming Wang2015-11-111-18/+16
| | | | | | | | | | org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence. Author: Yuming Wang <q79969786@gmail.com> Author: yuming.wang <q79969786@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #9592 from 979969786/master.
* [SPARK-6726][ML] Import/export for spark.ml LogisticRegressionModelJoseph K. Bradley2015-11-102-8/+134
| | | | | | | | | | This PR adds model save/load for spark.ml's LogisticRegressionModel. It also does minor refactoring of the default save/load classes to reuse code. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9606 from jkbradley/logreg-io2.
* [SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in ↵Yu ISHIKAWA2015-11-101-15/+6
| | | | | | | | | | Python cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9534 from yu-iskw/SPARK-11566.
* [SPARK-5565][ML] LDA wrapper for Pipelines APIJoseph K. Bradley2015-11-102-5/+725
| | | | | | | | | | | | | | | This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [https://github.com/apache/spark/pull/9484], but I'll try to merge [https://github.com/apache/spark/pull/9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.
* [SPARK-7316][MLLIB] RDD sliding window with stepunknown2015-11-102-35/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implementation of step capability for sliding window function in MLlib's RDD. Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points: Window | Step | Time | Windows produced ------------ | ------------- | ---------- | ---------- 128 | 1 | 6.38 | 9999873 128 | 10 | 0.9 | 999988 128 | 100 | 0.41 | 99999 1024 | 1 | 44.67 | 9998977 1024 | 10 | 4.74 | 999898 1024 | 100 | 0.78 | 99990 ``` import org.apache.spark.mllib.rdd.RDDFunctions._ val rdd = sc.parallelize(1 to 10000000, 10) rdd.count val window = 1024 val step = 1 val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9) ``` Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net> Author: Alexander Ulanov <nashb@yandex.ru> Author: Xiangrui Meng <meng@databricks.com> Closes #5855 from avulanov/SPARK-7316-sliding.
* [SPARK-11618][ML] Minor refactoring of basic ML import/exportJoseph K. Bradley2015-11-101-27/+30
| | | | | | | | | | | | Refactoring * separated overwrite and param save logic in DefaultParamsWriter * added sparkVersion to DefaultParamsWriter CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9587 from jkbradley/logreg-io.
* [SPARK-11069][ML] Add RegexTokenizer option to convert to lowercaseYuhao Yang2015-11-091-2/+17
| | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11069 quotes from jira: Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal: call the Boolean Param "toLowercase" set default to false (so behavior does not change) Actually sklearn converts to lowercase before tokenizing too Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9092 from hhbyyh/tokenLower.
* [SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical ClusteringYu ISHIKAWA2015-11-092-0/+586
| | | | | | | | | | | | | | | | | | I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering.
* [SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node ↵fazlan-nazeem2015-11-091-0/+1
| | | | | | | | | | of pmml model The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2. Author: fazlan-nazeem <fazlann@wso2.com> Closes #9558 from fazlan-nazeem/master.
* [SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for ↵Yanbo Liang2015-11-091-4/+46
| | | | | | | | | | | | | | | | | | | | | | linear regression Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like ```Java $DevianceResiduals Min Max -0.9509607 0.7291832 $Coefficients Estimate Std. Error t value Pr(>|t|) (Intercept) 1.6765 0.2353597 7.123139 4.456124e-11 Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12 Species_versicolor -0.9833885 0.07207471 -13.64402 0 Species_virginica -1.00751 0.09330565 -10.79796 0 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #9561 from yanboliang/spark-11494.
* [SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in PythonYu ISHIKAWA2015-11-062-2/+57
| | | | | | | | | | | | | Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8643 from yu-iskw/SPARK-8467-2.
* [SPARK-11217][ML] save/load for non-meta estimators and transformersXiangrui Meng2015-11-063-3/+230
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes: * class name * uid * timestamp * paramMap The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases. ~~~scala instance.save("path") instance.write.context(sqlContext).overwrite().save("path") Instance.load("path") ~~~ The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params. TODOs: * [x] Java test * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9454 from mengxr/SPARK-11217.
* [SPARK-11514][ML] Pass random seed to spark.ml DecisionTree*Yu ISHIKAWA2015-11-053-7/+12
| | | | | | | | cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9486 from yu-iskw/SPARK-11514.
* [SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regressionEhsan M.Kermani2015-11-055-18/+119
| | | | | | | | Here is my first commit. Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #8728 from ehsanmok/SinceAnn.
* [SPARK-11473][ML] R-like summary statistics with intercept for OLS via ↵Yanbo Liang2015-11-052-25/+32
| | | | | | | | | | normal equation solver Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9485 from yanboliang/spark-11473.
* [SPARK-11506][MLLIB] Removed redundant operation in Online LDA implementationa1singh2015-11-051-1/+1
| | | | | | | | | | | | | In file LDAOptimizer.scala: line 441: since "idx" was never used, replaced unrequired zipWithIndex.foreach with foreach. - nonEmptyDocs.zipWithIndex.foreach { case ((_, termCounts: Vector), idx: Int) => + nonEmptyDocs.foreach { case (_, termCounts: Vector) => Author: a1singh <a1singh@ucsd.edu> Closes #9456 from a1singh/master.
* [SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpanYu ISHIKAWA2015-11-042-1/+54
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9469 from yu-iskw/SPARK-10028.
* [SPARK-9492][ML][R] LogisticRegression in R should provide model statisticsYanbo Liang2015-11-042-7/+17
| | | | | | | | Like ml ```LinearRegression```, ```LogisticRegression``` should provide a training summary including feature names and their coefficients. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9303 from yanboliang/spark-9492.
* [SPARK-11349][ML] Support transform string label for RFormulaYanbo Liang2015-11-031-1/+9
| | | | | | | | | Currently ```RFormula``` can only handle label with ```NumericType``` or ```BinaryType``` (cast it to ```DoubleType``` as the label of Linear Regression training), we should also support label of ```StringType``` which is needed for Logistic Regression (glm with family = "binomial"). For label of ```StringType```, we should use ```StringIndexer``` to transform it to 0-based index. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9302 from yanboliang/spark-11349.
* [MINOR][ML] Fix naming conventions of AFTSurvivalRegression coefficientsYanbo Liang2015-11-031-19/+19
| | | | | | | | | Rename ```regressionCoefficients``` back to ```coefficients```, and name ```weights``` to ```parameters```. See discussion [here](https://github.com/apache/spark/pull/9311/files#diff-e277fd0bc21f825d3196b4551c01fe5fR230). mengxr vectorijk dbtsai Author: Yanbo Liang <ybliang8@gmail.com> Closes #9431 from yanboliang/aft-coefficients.
* [SPARK-9836][ML] Provide R-like summary statistics for OLS via normal ↵Yanbo Liang2015-11-033-7/+114
| | | | | | | | | | equation solver https://issues.apache.org/jira/browse/SPARK-9836 Author: Yanbo Liang <ybliang8@gmail.com> Closes #9413 from yanboliang/spark-9836.
* [MINOR][ML] removed the old `getModelWeights` functionDB Tsai2015-11-021-10/+0
| | | | | | | | Removed the old `getModelWeights` function which was private and renamed into `getModelCoefficients` Author: DB Tsai <dbt@netflix.com> Closes #9426 from dbtsai/feature-minor.
* [SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead ↵vectorijk2015-11-025-31/+46
| | | | | | | | | | in ML models Deprecated in `LogisticRegression` and `LinearRegression` Author: vectorijk <jiangkai@gmail.com> Closes #9311 from vectorijk/spark-10592.
* [SPARK-11343][ML] Allow float and double prediction/label columns in ↵Dominik Dahlem2015-11-021-4/+8
| | | | | | | | | | | | RegressionEvaluator mengxr, felixcheung This pull request just relaxes the type of the prediction/label columns to be float and double. Internally, these columns are casted to double. The other evaluators might need to be changed also. Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9296 from dahlem/ddahlem_regression_evaluator_double_predictions_27102015.
* [SPARK-11358][MLLIB] deprecate runs in k-meansXiangrui Meng2015-11-021-2/+2
| | | | | | | | | | This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation. cc: srowen Author: Xiangrui Meng <meng@databricks.com> Closes #9322 from mengxr/SPARK-11358.
* [SPARK-9722] [ML] Pass random seed to spark.ml DecisionTree*Yu ISHIKAWA2015-11-011-3/+5
| | | | | | Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9402 from yu-iskw/SPARK-9722.
* [SPARK-11385] [ML] foreachActive made public in MLLib's vector APINakul Jindal2015-10-301-3/+6
| | | | | | | | Made foreachActive public in MLLib's vector API Author: Nakul Jindal <njindal@us.ibm.com> Closes #9362 from nakul02/SPARK-11385_foreach_for_mllib_linalg_vector.
* [SPARK-11207] [ML] Add test cases for solver selection of LinearRegres…Lewuathe2015-10-301-7/+47
| | | | | | | | | | | | …sion as followup. This is the follow up work of SPARK-10668. * Fix miner style issues. * Add test case for checking whether solver is selected properly. Author: Lewuathe <lewuathe@me.com> Author: lewuathe <lewuathe@me.com> Closes #9180 from Lewuathe/SPARK-11207.
* [SPARK-11369][ML][R] SparkR glm should support setting standardizeYanbo Liang2015-10-281-0/+3
| | | | | | | | | | SparkR glm currently support : ```formula, family = c(“gaussian”, “binomial”), data, lambda = 0, alpha = 0``` We should also support setting standardize which has been defined at [design documentation](https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit) Author: Yanbo Liang <ybliang8@gmail.com> Closes #9331 from yanboliang/spark-11369.
* [SPARK-11332] [ML] Refactored to use ml.feature.Instance instead of ↵Nakul Jindal2015-10-282-19/+10
| | | | | | | | | | WeightedLeastSquare.Instance WeightedLeastSquares now uses the common Instance class in ml.feature instead of a private one. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9325 from nakul02/SPARK-11332_refactor_WeightedLeastSquares_dot_Instance.
* [MINOR][ML] fix compile warnsXiangrui Meng2015-10-271-1/+1
| | | | | | | | This fixes some compile time warnings. Author: Xiangrui Meng <meng@databricks.com> Closes #9319 from mengxr/mllib-compile-warn-20151027.
* [SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix ↵Sean Owen2015-10-271-4/+4
| | | | | | | | | | | | returns incorrect answer in some cases Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test. Supersedes https://github.com/apache/spark/pull/9293 Author: Sean Owen <sowen@cloudera.com> Closes #9309 from srowen/SPARK-11302.2.
* [SPARK-10654][MLLIB] Add columnSimilarities to IndexedRowMatrixReza Zadeh2015-10-261-0/+13
| | | | | | | | | | Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix. With a test. Author: Reza Zadeh <reza@databricks.com> Closes #8792 from rezazadeh/colsims.
* [SPARK-11184][MLLIB] Declare most of .mllib code not-ExperimentalSean Owen2015-10-2645-208/+43
| | | | | | | | Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier Author: Sean Owen <sowen@cloudera.com> Closes #9169 from srowen/SPARK-11184.
* [SPARK-6723] [MLLIB] Model import/export for ChiSqSelectorJayant Shekar2015-10-231-1/+69
| | | | | | | | | | | This is a PR for Parquet-based model import/export. * Added save/load for ChiSqSelectorModel * Updated the test suite ChiSqSelectorSuite Author: Jayant Shekar <jayant@user-MBPMBA-3.local> Closes #6785 from jayantshekhar/SPARK-6723.
* [SPARK-11273][SQL] Move ArrayData/MapData/DataTypeParser to catalyst.util ↵Reynold Xin2015-10-232-0/+2
| | | | | | | | package Author: Reynold Xin <rxin@databricks.com> Closes #9239 from rxin/types-private.
* [SPARK-10082][MLLIB] minor style updates for matrix indexing after #8271Xiangrui Meng2015-10-201-4/+4
| | | | | | | | | | | * `>=0` => `>= 0` * print `i`, `j` in the log message MechCoder Author: Xiangrui Meng <meng@databricks.com> Closes #9189 from mengxr/SPARK-10082.
* [SPARK-10082][MLLIB] Validate i, j in apply DenseMatrices and SparseMatricesMechCoder2015-10-201-0/+4
| | | | | | | | | | | Given row_ind should be less than the number of rows Given col_ind should be less than the number of cols. The current code in master gives unpredictable behavior for such cases. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8271 from MechCoder/hash_code_matrices.
* [SPARK-10261][DOCUMENTATION, ML] Fixed @Since annotation to ml.evaluationTijo Thomas2015-10-204-7/+43
| | | | | | | Author: Tijo Thomas <tijoparacka@gmail.com> Author: tijo <tijo@ezzoft.com> Closes #8554 from tijoparacka/SPARK-10261-2.
* [SPARK-10668] [ML] Use WeightedLeastSquares in LinearRegression with L…lewuathe2015-10-194-5/+70
| | | | | | | | | | | …2 regularization if the number of features is small Author: lewuathe <lewuathe@me.com> Author: Lewuathe <sasaki@treasure-data.com> Author: Kai Sasaki <sasaki@treasure-data.com> Author: Lewuathe <lewuathe@me.com> Closes #8884 from Lewuathe/SPARK-10668.
* [SPARK-9963] [ML] RandomForest cleanup: replace predictNodeIndex with ↵Luvsandondov Lkhamsuren2015-10-172-43/+38
| | | | | | | | | | predictImpl predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com> Closes #8609 from lkhamsurenl/SPARK-9963.
* [SPARK-11029] [ML] Add computeCost to KMeansModel in spark.mlYuhao Yang2015-10-171-0/+12
| | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11029 We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel. This will be a temp fix until we have proper evaluators defined for clustering. Author: Yuhao Yang <hhbyyh@gmail.com> Author: yuhaoyang <yuhao@zhanglipings-iMac.local> Closes #9073 from hhbyyh/computeCost.
* [SPARK-10599] [MLLIB] Lower communication for block matrix multiplicationBurak Yavuz2015-10-161-22/+58
| | | | | | | | | | | | | | | | | | | | | | | This PR aims to decrease communication costs in BlockMatrix multiplication in two ways: - Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled - Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition **NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was. Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle. Size A: 1e5 x 1e5 Size B: 1e5 x 1e5 Block Sizes: 1024 x 1024 Sparsity: 0.01 Old implementation: 1m 13s New implementation: 9s cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster Author: Burak Yavuz <brkyvz@gmail.com> Closes #8757 from brkyvz/opt-bmm.