aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12026][MLLIB] ChiSqTest gets slower and slower over time when number ↵Yuhao Yang2016-01-131-2/+4
| | | | | | | | | | | | | | of features is large jira: https://issues.apache.org/jira/browse/SPARK-12026 The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger. I tested on local and the change can improve the performance and the running time was stable. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10146 from hhbyyh/chiSq.
* [SPARK-7615][MLLIB] MLLIB Word2Vec wordVectors divided by Euclidean Norm ↵Sean Owen2016-01-121-1/+6
| | | | | | | | | | | | equals to zero Cosine similarity with 0 vector should be 0 Related to https://github.com/apache/spark/pull/10152 Author: Sean Owen <sowen@cloudera.com> Closes #10696 from srowen/SPARK-7615.
* [SPARK-10809][MLLIB] Single-document topicDistributions method for LocalLDAModelYuhao Yang2016-01-112-3/+38
| | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-10809 We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents. add some missing assert too. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9484 from hhbyyh/ldaTopicPre.
* [SPARK-12685][MLLIB] word2vec trainWordsCount gets overflowYuhao Yang2016-01-111-4/+4
| | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-12685 the log of `word2vec` reports trainWordsCount = -785727483 during computation over a large dataset. Update the priority as it will affect the computation process. `alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))` Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10627 from hhbyyh/w2voverflow.
* [SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support ↵Yanbo Liang2016-01-112-1/+5
| | | | | | | | | | single instance predict/predictSoft PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10552 from yanboliang/spark-12603.
* [SPARK-3873][BUILD] Enable import ordering error checking.Marcelo Vanzin2016-01-105-8/+7
| | | | | | | | | | | | | Turn import ordering violations into build errors, plus a few adjustments to account for how the checker behaves. I'm a little on the fence about whether the existing code is right, but it's easier to appease the checker than to discuss what's the more correct order here. Plus a few fixes to imports that cropped in since my recent cleanups. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10612 from vanzin/SPARK-3873-enable.
* [SPARK-12692][BUILD][MLLIB] Scala style: Fix the style violation (Space ↵Kousuke Saruta2016-01-1014-19/+19
| | | | | | | | | | | before "," or ":") Fix the style violation (space before , and :). This PR is a followup for #10643. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10684 from sarutak/SPARK-12692-followup-mllib.
* [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 editionSean Owen2016-01-083-15/+15
| | | | | | | | Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs. Author: Sean Owen <sowen@cloudera.com> Closes #10570 from srowen/SPARK-12618.
* [SPARK-12663][MLLIB] More informative error message in MLUtils.loadLibSVMFileRobert Dodier2016-01-061-1/+2
| | | | | | | | | | This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663). For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html) Author: Robert Dodier <robert_dodier@users.sourceforge.net> Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
* [SPARK-12368][ML][DOC] Better doc for the binary classification evaluator' ↵BenFradet2016-01-061-2/+1
| | | | | | | | | | | | | | | metricName For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC". Also, in the documentation, it is said that: "The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators." However, the method is called setMetricName. This PR aims to fix both issues. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10328 from BenFradet/SPARK-12368.
* [SPARK-3873][TESTS] Import ordering fixes.Marcelo Vanzin2016-01-0547-62/+58
| | | | | | Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10582 from vanzin/SPARK-3873-tests.
* [SPARK-12450][MLLIB] Un-persist broadcasted variables in KMeansRJ Nowling2016-01-051-0/+8
| | | | | | | | SPARK-12450 . Un-persist broadcasted variables in KMeans. Author: RJ Nowling <rnowling@gmail.com> Closes #10415 from rnowling/spark-12450.
* [SPARK-6724][MLLIB] Support model save/load for FPGrowthModelYanbo Liang2016-01-053-3/+205
| | | | | | | | Support model save/load for FPGrowthModel Author: Yanbo Liang <ybliang8@gmail.com> Closes #9267 from yanboliang/spark-6724.
* [SPARK-12331][ML] R^2 for regression through the origin.Imran Younus2016-01-053-71/+112
| | | | | | | | | Modified the definition of R^2 for regression through origin. Added modified test for regression metrics. Author: Imran Younus <iyounus@us.ibm.com> Author: Imran Younus <imranyounus@gmail.com> Closes #10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.
* [SPARK-9622][ML] DecisionTreeRegressor: provide variance of predictionYanbo Liang2016-01-045-4/+92
| | | | | | | | DecisionTreeRegressor will provide variance of prediction as a Double column. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8866 from yanboliang/spark-9622.
* [SPARK-11259][ML] Params.validateParams() should be called automaticallyYanbo Liang2016-01-0430-1/+63
| | | | | | | | See JIRA: https://issues.apache.org/jira/browse/SPARK-11259 Author: Yanbo Liang <ybliang8@gmail.com> Closes #9224 from yanboliang/spark-11259.
* [SPARK-12599][MLLIB][SQL] Remove the use of callUDF in MLlibReynold Xin2016-01-021-2/+2
| | | | | | | | callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that. Author: Reynold Xin <rxin@databricks.com> Closes #10547 from rxin/SPARK-12599.
* [SPARK-3873][MLLIB] Import order fixes.Marcelo Vanzin2015-12-3194-167/+158
| | | | | | | | | | | A slight adjustment to the checker configuration was needed; there is a handful of warnings still left, but those are because of a bug in the checker that I'll fix separately (before enabling errors for the checker, of course). Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #10535 from vanzin/SPARK-3873-mllib.
* [SPARK-12349][SPARK-12349][ML] Fix typo in Spark version regex introduced in ↵Sean Owen2015-12-291-1/+1
| | | | | | | | | | | / PR 10327 Sorry jkbradley Ref: https://github.com/apache/spark/pull/10327#discussion_r48502942 Author: Sean Owen <sowen@cloudera.com> Closes #10508 from srowen/SPARK-12349.2.
* [SPARK-12489][CORE][SQL][MLIB] Fix minor issues found by FindBugsShixiong Zhu2015-12-281-2/+2
| | | | | | | | | | | | Include the following changes: 1. Close `java.sql.Statement` 2. Fix incorrect `asInstanceOf`. 3. Remove unnecessary `synchronized` and `ReentrantLock`. Author: Shixiong Zhu <shixiong@databricks.com> Closes #10440 from zsxwing/findbugs.
* [SPARK-12424][ML] The implementation of ParamMap#filter is wrong.Kousuke Saruta2015-12-292-2/+34
| | | | | | | | | ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`. Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654). Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #10381 from sarutak/SPARK-12424.
* [SPARK-12311][CORE] Restore previous value of "os.arch" property in test ↵Kazuaki Ishizaki2015-12-244-14/+26
| | | | | | | | | | | | suites after forcing to set specific value to "os.arch" property Restore the original value of os.arch property after each test Since some of tests forced to set the specific value to os.arch property, we need to set the original value. Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #10289 from kiszk/SPARK-12311.
* [SPARK-12349][ML] Make spark.ml PCAModel load backwards compatibleSean Owen2015-12-211-5/+28
| | | | | | | | | Only load explainedVariance in PCAModel if it was written with Spark > 1.6.x jkbradley is this kind of what you had in mind? Author: Sean Owen <sowen@cloudera.com> Closes #10327 from srowen/SPARK-12349.
* [SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDsBryan Cutler2015-12-201-1/+11
| | | | | | | | Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647." Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
* Bump master version to 2.0.0-SNAPSHOT.Reynold Xin2015-12-191-1/+1
| | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #10387 from rxin/version-bump.
* [SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml ↵Yanbo Liang2015-12-165-11/+5
| | | | | | | | | | | | test suites Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new one for spark.ml test suites. I have checked thoroughly and found there are four test cases need to update. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10279 from yanboliang/spark-12309.
* [SPARK-9694][ML] Add random seed Param to Scala CrossValidatorYanbo Liang2015-12-162-3/+16
| | | | | | | | Add random seed Param to Scala CrossValidator Author: Yanbo Liang <ybliang8@gmail.com> Closes #9108 from yanboliang/spark-9694.
* [SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pysparkLiang-Chi Hsieh2015-12-142-33/+62
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12016 We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10100 from viirya/fix-load-py-wordvecmodel.
* [SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure ↵Mike Dusenberry2015-12-111-1/+1
| | | | | | | | | | | | | | Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.
* [SPARK-10991][ML] logistic regression training summary handle empty ↵Holden Karau2015-12-112-2/+29
| | | | | | | | | | | prediction col LogisticRegression training summary should still function if the predictionCol is set to an empty string or otherwise unset (related too https://issues.apache.org/jira/browse/SPARK-9718 ) Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #9037 from holdenk/SPARK-10991-LogisticRegressionTrainingSummary-handle-empty-prediction-col.
* [SPARK-11602][MLLIB] Refine visibility for 1.6 scala API auditYuhao Yang2015-12-104-5/+5
| | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11602 Made a pass on the API change of 1.6. Open the PR for efficient discussion. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9939 from hhbyyh/auditScala.
* [SPARK-11530][MLLIB] Return eigenvalues with PCA modelSean Owen2015-12-106-25/+64
| | | | | | | | | | Add `computePrincipalComponentsAndVariance` to also compute PCA's explained variance. CC mengxr Author: Sean Owen <sowen@cloudera.com> Closes #9736 from srowen/SPARK-11530.
* [SPARK-10299][ML] word2vec should allow users to specify the window sizeHolden Karau2015-12-093-4/+65
| | | | | | | | | Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size.
* [SPARK-11343][ML] Documentation of float and double prediction/label columns ↵Dominik Dahlem2015-12-081-2/+7
| | | | | | | | | | | | in RegressionEvaluator felixcheung , mengxr Just added a message to require() Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9598 from dahlem/ddahlem_regression_evaluator_double_predictions_message_04112015.
* [SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docsYuhao Yang2015-12-084-25/+94
| | | | | | | | | | | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11605 Check Java compatibility for MLlib for this release. fix: 1. `StreamingTest.registerStream` needs java friendly interface. 2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`. TBD: [updated] no fix for now per discussion. `org.apache.spark.mllib.classification.LogisticRegressionModel` `public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation. `SVMModel` has the similar issue. Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary. cc jkbradley feynmanliang Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10102 from hhbyyh/javaAPI.
* [SPARK-11439][ML] Optimization of creating sparse feature without dense oneNakul Jindal2015-12-083-122/+142
| | | | | | | | Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.
* [SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example codeYanbo Liang2015-12-071-2/+9
| | | | | | | | Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10006 from yanboliang/spark-11958.
* [SPARK-10259][ML] Add @since annotation to ml.classificationTakahashi Hiroshi2015-12-077-44/+185
| | | | | | | | Add since annotation to ml.classification Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp> Closes #8534 from taishi-oss/issue10259.
* [SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlibJoseph K. Bradley2015-12-0713-29/+29
| | | | | | | | | | | | Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods. This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml. CC: mengxr yhuai Author: Joseph K. Bradley <joseph@databricks.com> Closes #10161 from jkbradley/mllib-sqlcontext-fix.
* [SPARK-11988][ML][MLLIB] Update JPMML to 1.2.7Sean Owen2015-12-055-63/+58
| | | | | | | | Update JPMML pmml-model to 1.2.7 Author: Sean Owen <sowen@cloudera.com> Closes #9972 from srowen/SPARK-11988.
* [SPARK-11994][MLLIB] Word2VecModel load and save cause SparkException when ↵Antonio Murgia2015-12-052-4/+31
| | | | | | | | model is bigger than spark.kryoserializer.buffer.max Author: Antonio Murgia <antonio.murgia2@studio.unibo.it> Closes #9989 from tmnd1991/SPARK-11932.
* [SPARK-12096][MLLIB] remove the old constraint in word2vecYuhao Yang2015-12-051-2/+2
| | | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-12096 word2vec now can handle much bigger vocabulary. The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed. new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue) I tested with vocabsize over 18M and vectorsize = 100. srowen jkbradley Sorry to miss this in last PR. I was reminded today. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10103 from hhbyyh/w2vCapacity.
* [SPARK-12112][BUILD] Upgrade to SBT 0.13.9Josh Rosen2015-12-055-14/+16
| | | | | | | | | | We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
* [SPARK-6990][BUILD] Add Java linting script; fix minor warningsDmitry Erastov2015-12-042-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.
* [SPARK-12000] do not specify arg types when reference a method in ScalaDocXiangrui Meng2015-12-022-3/+3
| | | | | | | | | | This fixes SPARK-12000, verified on my local with JDK 7. It seems that `scaladoc` try to match method names and messed up with annotations. cc: JoshRosen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #10114 from mengxr/SPARK-12000.2.
* [SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunningYu ISHIKAWA2015-12-023-16/+58
| | | | | | | | | | | | cc mengxr noel-smith I worked on this issues based on https://github.com/apache/spark/pull/8729. ehsanmok thank you for your contricution! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9338 from yu-iskw/JIRA-10266.
* [SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issuesCheng Lian2015-12-011-5/+7
| | | | | | | | This PR backports PR #10039 to master Author: Cheng Lian <lian@databricks.com> Closes #10063 from liancheng/spark-12046.doc-fix.master.
* [SPARK-11898][MLLIB] Use broadcast for the global tables in Word2VecYuhao Yang2015-12-011-1/+6
| | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization. Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help, 1. decrease the worker memory consumption by 45%. 2. decrease running time by 40%. This will also help extend the upper limit for Word2Vec. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9878 from hhbyyh/w2vBC.
* [SPARK-11847][ML] Model export/import for spark.ml: LDAYuhao Yang2015-11-243-8/+150
| | | | | | | | | | | Add read/write support to LDA, similar to ALS. save/load for ml.LocalLDAModel is done. For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9894 from hhbyyh/ldaMLsave.
* [SPARK-11521][ML][DOC] Document that Logistic, Linear Regression summaries ↵Joseph K. Bradley2015-11-242-0/+33
| | | | | | | | | | | | | ignore weight col Doc for 1.6 that the summaries mostly ignore the weight column. To be corrected for 1.7 CC: mengxr thunterdb Author: Joseph K. Bradley <joseph@databricks.com> Closes #9927 from jkbradley/linregsummary-doc.