aboutsummaryrefslogtreecommitdiff
path: root/mllib
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-14164][MLLIB] Improve input layer validation of ↵Dongjoon Hyun2016-03-312-2/+18
| | | | | | | | | | | | | | | | | | | | | | MultilayerPerceptronClassifier ## What changes were proposed in this pull request? This issue improves an input layer validation and adds related testcases to MultilayerPerceptronClassifier. ```scala - // TODO: how to check ALSO that all elements are greater than 0? - ParamValidators.arrayLengthGt(1) + (t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1 ``` ## How was this patch tested? Pass the Jenkins tests including the new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11964 from dongjoon-hyun/SPARK-14164.
* [SPARK-11507][MLLIB] add compact in Matrices fromBreezeYuhao Yang2016-03-302-1/+21
| | | | | | | | | | | | | | | | | | | | | jira: https://issues.apache.org/jira/browse/SPARK-11507 "In certain situations when adding two block matrices, I get an error regarding colPtr and the operation fails. External issue URL includes full error and code for reproducing the problem." root cause: colPtr.last does NOT always equal to values.length in breeze SCSMatrix, which fails the require in SparseMatrix. easy step to repro: ``` val m1: BM[Double] = new CSCMatrix[Double] (Array (1.0, 1, 1), 3, 3, Array (0, 1, 2, 3), Array (0, 1, 2) ) val m2: BM[Double] = new CSCMatrix[Double] (Array (1.0, 2, 2, 4), 3, 3, Array (0, 0, 2, 4), Array (1, 2, 1, 2) ) val sum = m1 + m2 Matrices.fromBreeze(sum) ``` Solution: By checking the code in [CSCMatrix](https://github.com/scalanlp/breeze/blob/28000a7b901bc3cfbbbf5c0bce1d0a5dda8281b0/math/src/main/scala/breeze/linalg/CSCMatrix.scala), CSCMatrix in breeze can have extra zeros in the end of data array. Invoking compact will make sure it aligns with the require of SparseMatrix. This should add limited overhead as the actual compact operation is only performed when necessary. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9520 from hhbyyh/matricesFromBreeze.
* [MINOR][ML] Fix the wrong param name of LDA topicDistributionColYanbo Liang2016-03-301-1/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix the wrong param name of LDA ```topicDistributionCol```. ## How was this patch tested? No tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12065 from yanboliang/lda-topicDistributionCol.
* [SPARK-14181] TrainValidationSplit should have HasSeedXusen Yin2016-03-302-5/+14
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-14181 TrainValidationSplit should have HasSeed for the random split of RDD. I also changed the random split from the RDD function to the DataFrame function. Author: Xusen Yin <yinxusen@gmail.com> Closes #11985 from yinxusen/SPARK-14181.
* [SPARK-14154][MLLIB] Simplify the implementation for Kolmogorov–Smirnov testYuhao Yang2016-03-291-73/+4
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14154 I just read the code for KolmogorovSmirnovTest and find it could be much simplified following the original definition. Send a PR for discussion ## How was this patch tested? unit test Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11954 from hhbyyh/ksoptimize.
* [SPARK-13963][ML] Adding binary toggle param to HashingTFBryan Cutler2016-03-294-5/+69
| | | | | | | | | | | | ## What changes were proposed in this pull request? Adding binary toggle parameter to ml.feature.HashingTF, as well as mllib.feature.HashingTF since the former wraps this functionality. This parameter, if true, will set non-zero valued term counts to 1 to transform term count features to binary values that are well suited for discrete probability models. ## How was this patch tested? Added unit tests for ML and MLlib Author: Bryan Cutler <cutlerb@gmail.com> Closes #11832 from BryanCutler/binary-param-HashingTF-SPARK-13963.
* [SPARK-11730][ML] Add feature importances for GBTs.sethah2016-03-2812-135/+213
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a `featureImportances` attribute to `GBTClassifier` and `GBTRegressor` and adds tests for each. GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in [this paper](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf). ## How was this patch tested? Unit tests were added to `GBTClassifierSuite` and `GBTRegressorSuite` to validate feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes #11961 from sethah/SPARK-11730.
* [SPARK-11893] Model export/import for spark.ml: TrainValidationSplitXusen Yin2016-03-285-142/+310
| | | | | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-11893 jkbradley In order to share read/write with `TrainValidationSplit`, I move the `SharedReadWrite` out of `CrossValidator` into a new trait `SharedReadWrite` in the tunning package. To reduce the repeated tests, I move the complex tests from `CrossValidatorSuite` to `SharedReadWriteSuite`, and create a fake validator called `MyValidator` to test the shared code. With `SharedReadWrite`, potential newly added `Validator` can share the read/write common part, and only need to implement their extra params save/load. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #9971 from yinxusen/SPARK-11893.
* [SPARK-14187][MLLIB] Fix incorrect use of binarySearch in SparseMatrixChenliang Xu2016-03-282-1/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix incorrect use of binarySearch in SparseMatrix ## How was this patch tested? Unit test added. Author: Chenliang Xu <chexu@groupon.com> Closes #11992 from luckyrandom/SPARK-14187.
* [SPARK-12494][MLLIB] Array out of bound Exception in KMeans Yarn ModeSean Owen2016-03-281-0/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Better error message with k-means init can't be enough samples from input (because it is perhaps empty) ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #11979 from srowen/SPARK-12494.
* [SPARK-10691][ML] Make LogisticRegressionModel, LinearRegressionModel ↵Joseph K. Bradley2016-03-272-9/+11
| | | | | | | | | | | | | | | | evaluate() public ## What changes were proposed in this pull request? Made evaluate method public. Fixed LogisticRegressionModel evaluate to handle case when probabilityCol is not specified. ## How was this patch tested? There were already unit tests for these methods. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11928 from jkbradley/public-evaluate.
* [MINOR][MLLIB] Remove TODO comment DecisionTreeModel.scalaDongjoon Hyun2016-03-271-1/+1
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes the following line and the related code. Historically, this code was added in [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597). After [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597) was committed, [SPARK-3365](https://issues.apache.org/jira/browse/SPARK-3365) is fixed now. Now, we had better remove the comment without changing persistent code. ```scala - categories: Seq[Double]) { // TODO: Change to List once SPARK-3365 is fixed + categories: Seq[Double]) { ``` ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11966 from dongjoon-hyun/change_categories_type.
* [SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since ↵Liwei Lin2016-03-266-137/+0
| | | | | | | | | | | | | | | | | | 1.1, 1.2, 1.3, 1.4, and 1.5 ## What changes were proposed in this pull request? Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5. ## How was this patch tested? - manully checked that no codes in Spark call these methods any more - existing test suits Author: Liwei Lin <lwlin7@gmail.com> Author: proflin <proflin.me@gmail.com> Closes #11910 from lw-lin/remove-deprecates.
* [SPARK-14159][ML] Fixed bug in StringIndexer + related issue in RFormulaJoseph K. Bradley2016-03-253-13/+22
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? StringIndexerModel.transform sets the output column metadata to use name inputCol. It should not. Fixing this causes a problem with the metadata produced by RFormula. Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and I modified VectorAttributeRewriter to find and replace all "prefixes" since attributes collect multiple prefixes from StringIndexer + Interaction. Note that "prefixes" is no longer accurate since internal strings may be replaced. ## How was this patch tested? Unit test which failed before this fix. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11965 from jkbradley/StringIndexer-fix.
* [SPARK-13010][ML][SPARKR] Implement a simple wrapper of ↵Yanbo Liang2016-03-241-0/+99
| | | | | | | | | | | | | | | | | | AFTSurvivalRegression in SparkR ## What changes were proposed in this pull request? This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR. ## How was this patch tested? Test against output from R package survival's survreg. cc mengxr felixcheung Close #11447 Author: Yanbo Liang <ybliang8@gmail.com> Closes #11932 from yanboliang/spark-13010-new.
* [SPARK-11871] Add save/load for MLPCXusen Yin2016-03-242-9/+103
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-11871 Add save/load for MLPC ## How was this patch tested? Test with Scala unit test Author: Xusen Yin <yinxusen@gmail.com> Closes #9854 from yinxusen/SPARK-11871.
* [SPARK-14030][MLLIB] Add parameter check to MLLIBRuifeng Zheng2016-03-2413-13/+83
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? add parameter verification to MLLIB, like numCorrections > 0 tolerance >= 0 iters > 0 regParam >= 0 ## How was this patch tested? manual tests Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Zheng RuiFeng <mllabs@datanode1.(none)> Author: mllabs <mllabs@datanode1.(none)> Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11852 from zhengruifeng/lbfgs_check.
* Fix typo in ALS.scalaJuarez Bochi2016-03-241-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Just a typo ## How was this patch tested? N/A Author: Juarez Bochi <jbochi@gmail.com> Closes #11896 from jbochi/patch-1.
* [SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml oneJoseph K. Bradley2016-03-2313-1778/+538
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Primary change: * Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning. * spark.mllib now calls the spark.ml implementation. * Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed. ml.tree.DecisionTreeModel * Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses. These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation. ml.tree.Node * Added ```private[tree] def deepCopy```, used by unit tests Copied developer comments from spark.mllib implementation to spark.ml one. Moving unit tests * Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite. * Those tests were all moved to spark.ml.tree.impl.RandomForestSuite. The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side. * I made minimal changes to each test to allow it to run. Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values. * No new unit tests were added. * mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in. Those same split calculations were already being tested in other unit tests, for each dataset type. **Changes of behavior** (to be noted in SPARK-13448 once this PR is merged) * spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB. This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit(). Once this PR is merged, I will note the change of behavior on SPARK-13448. * spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats. This does not remove information from the tree, and it will save a bit of storage. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11855 from jkbradley/remove-mllib-tree-impl.
* [SPARK-13952][ML] Add random seed to GBTsethah2016-03-239-39/+66
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? `GBTClassifier` and `GBTRegressor` should use random seed for reproducible results. Because of the nature of current unit tests, which compare GBTs in ML and GBTs in MLlib for equality, I also added a random seed to MLlib GBT algorithm. I made alternate constructors in `mllib.tree.GradientBoostedTrees` to accept a random seed, but left them as private so as to not change the API unnecessarily. ## How was this patch tested? Existing unit tests verify that functionality did not change. Other ML algorithms do not seem to have unit tests that directly test the functionality of random seeding, but reproducibility with seeding for GBTs is effectively verified in existing tests. I can add more tests if needed. Author: sethah <seth.hendrickson16@gmail.com> Closes #11903 from sethah/SPARK-13952.
* [SPARK-14035][MLLIB] Make error message more verbose for mllib NaiveBayesSuiteJoseph K. Bradley2016-03-231-10/+18
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Print more info about failed NaiveBayesSuite tests which have exhibited flakiness. ## How was this patch tested? Ran locally with incorrect check to cause failure. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11858 from jkbradley/naive-bayes-bug-log.
* [SPARK-13449] Naive Bayes wrapper in SparkRXusen Yin2016-03-221-0/+75
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli. I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes. I removed the preprocess part that omit NA values because we don't know which columns to process. ## How was this patch tested? Test against output from R package e1071's naiveBayes. cc: yanboliang yinxusen Closes #11486 Author: Xusen Yin <yinxusen@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #11890 from mengxr/SPARK-13449.
* [SPARK-13986][CORE][MLLIB] Remove `DeveloperApi`-annotations for non-publicsDongjoon Hyun2016-03-216-11/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark uses `DeveloperApi` annotation, but sometimes it seems to conflict with visibility. This PR tries to fix those conflict by removing annotations for non-publics. The following is the example. **JobResult.scala** ```scala DeveloperApi sealed trait JobResult DeveloperApi case object JobSucceeded extends JobResult -DeveloperApi private[spark] case class JobFailed(exception: Exception) extends JobResult ``` ## How was this patch tested? Pass the existing Jenkins test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11797 from dongjoon-hyun/SPARK-13986.
* [SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle ruleDongjoon Hyun2016-03-213-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`. ```xml - <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places --> - <!-- <module name="LineLength"> <property name="max" value="100"/> <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/> </module> - --> <module name="NoLineWrap"/> <module name="EmptyBlock"> <property name="option" value="TEXT"/> -167,5 +164,7 </module> <module name="CommentsIndentation"/> <module name="UnusedImports"/> + <module name="RedundantImport"/> + <module name="RedundantModifier"/> ``` ## How was this patch tested? Currently, `lint-java` is disabled in Jenkins. It needs a manual test. After passing the Jenkins tests, `dev/lint-java` should passes locally. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11831 from dongjoon-hyun/SPARK-14011.
* [SPARK-12182][ML] Distributed binning for trees in spark.mlsethah2016-03-202-61/+60
| | | | | | | | This PR changes the `findSplits` method in spark.ml to perform split calculations on the workers. This PR is meant to copy [PR-8246](https://github.com/apache/spark/pull/8246) which added the same feature for MLlib. Author: sethah <seth.hendrickson16@gmail.com> Closes #10231 from sethah/SPARK-12182.
* [SPARK-13629][ML] Add binary toggle Param to CountVectorizerYuhao Yang2016-03-181-14/+9
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is a continued work for https://github.com/apache/spark/pull/11536#issuecomment-198511013, containing some comment update and style adjustment. jkbradley ## How was this patch tested? unit tests. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11830 from hhbyyh/cvToggle.
* [MINOR][ML] When trainingSummary is None, it should throw RuntimeException.Yanbo Liang2016-03-184-20/+8
| | | | | | | | | | | | ## What changes were proposed in this pull request? When trainingSummary is None, it should throw ```RuntimeException```. cc mengxr ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11784 from yanboliang/fix-summary.
* [SPARK-10788][MLLIB][ML] Remove duplicate bins for decision treessethah2016-03-179-49/+54
| | | | | | | | | | | | | | | | | | Decision trees in spark.ml (RandomForest.scala) communicate twice as much data as needed for unordered categorical features. Here's an example. Say there are 3 categories A, B, C. We consider 3 splits: * A vs. B, C * A, B vs. C * A, C vs. B Currently, we collect statistics for each of the 6 subsets of categories (3 * 2 = 6). However, we could instead collect statistics for the 3 subsets on the left-hand side of the 3 possible splits: A and A,B and A,C. If we also have stats for the entire node, then we can compute the stats for the 3 subsets on the right-hand side of the splits. In pseudomath: stats(B,C) = stats(A,B,C) - stats(A). This patch adds a parent stats array to the `DTStatsAggregator` so that the right child stats do not need to be stored. The right child stats are computed by subtracting left child stats from the parent stats for unordered categorical features. Author: sethah <seth.hendrickson16@gmail.com> Closes #9474 from sethah/SPARK-10788.
* [SPARK-13761][ML] Remove remaining uses of validateParamsJoseph K. Bradley2016-03-177-61/+27
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Cleanups from [https://github.com/apache/spark/pull/11620]: remove remaining uses of validateParams, and put functionality into transformSchema ## How was this patch tested? Existing unit tests, modified to check using transformSchema instead of validateParams Author: Joseph K. Bradley <joseph@databricks.com> Closes #11790 from jkbradley/SPARK-13761-cleanup.
* [SPARK-11891] Model export/import for RFormula and RFormulaModelXusen Yin2016-03-173-17/+207
| | | | | | | | https://issues.apache.org/jira/browse/SPARK-11891 Author: Xusen Yin <yinxusen@gmail.com> Closes #9884 from yinxusen/SPARK-11891.
* [SPARK-13928] Move org.apache.spark.Logging into ↵Wenchen Fan2016-03-1766-69/+90
| | | | | | | | | | | | | | | | org.apache.spark.internal.Logging ## What changes were proposed in this pull request? Logging was made private in Spark 2.0. If we move it, then users would be able to create a Logging trait themselves to avoid changing their own code. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #11764 from cloud-fan/logger.
* [SPARK-13629][ML] Add binary toggle Param to CountVectorizerYuhao Yang2016-03-172-2/+46
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? It would be handy to add a binary toggle Param to CountVectorizer, as in the scikit-learn one: http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html If set, then all non-zero counts will be set to 1. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11536 from hhbyyh/cvToggle.
* [SPARK-13761][ML] Deprecate validateParamsYuhao Yang2016-03-1633-89/+36
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Deprecate validateParams() method here: https://github.com/apache/spark/blob/035d3acdf3c1be5b309a861d5c5beb803b946b5e/mllib/src/main/scala/org/apache/spark/ml/param/params.scala#L553 Move all functionality in overridden methods to transformSchema(). Check docs to make sure they indicate complex Param interaction checks should be done in transformSchema. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11620 from hhbyyh/depreValid.
* [SPARK-11011][SQL] Narrow type of UDT serializationJakob Odersky2016-03-162-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Narrow down the parameter type of `UserDefinedType#serialize()`. Currently, the parameter type is `Any`, however it would logically make more sense to narrow it down to the type of the actual user defined type. ## How was this patch tested? Existing tests were successfully run on local machine. Author: Jakob Odersky <jakob@odersky.com> Closes #11379 from jodersky/SPARK-11011-udt-types.
* [SPARK-13927][MLLIB] add row/column iterator to local matricesXiangrui Meng2016-03-162-1/+76
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add row/column iterator to local matrices to simplify tasks like BlockMatrix => RowMatrix conversion. It handles dense and sparse matrices properly. ## How was this patch tested? Unit tests on sparse and dense matrix. cc: dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #11757 from mengxr/SPARK-13927.
* [SPARK-11888][ML] Decision tree persistence in spark.mlJoseph K. Bradley2016-03-1623-71/+428
| | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? Made these MLReadable and MLWritable: DecisionTreeClassifier, DecisionTreeClassificationModel, DecisionTreeRegressor, DecisionTreeRegressionModel * The shared implementation is in treeModels.scala * I use case classes to create a DataFrame to save, and I use the Dataset API to parse loaded files. Other changes: * Made CategoricalSplit.numCategories public (to use in persistence) * Fixed a bug in DefaultReadWriteTest.testEstimatorAndModelReadWrite, where it did not call the checkModelData function passed as an argument. This caused an error in LDASuite, which I fixed. ### How was this patch tested? Persistence is tested via unit tests. For each algorithm, there are 2 non-trivial trees (depth 2). One is built with continuous features, and one with categorical; this ensures that both types of splits are tested. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11581 from jkbradley/dt-io.
* [SPARK-13613][ML] Provide ignored tests to export test dataset into CSV formatYanbo Liang2016-03-164-33/+97
| | | | | | | | | | | | ## What changes were proposed in this pull request? Provide ignored test cases to export the test dataset into CSV format in ```LinearRegressionSuite```, ```LogisticRegressionSuite```, ```AFTSurvivalRegressionSuite``` and ```GeneralizedLinearRegressionSuite```, so users can validate the training accuracy compared with R's glm, glmnet and survival package. cc mengxr ## How was this patch tested? The test suite is ignored, but I have enabled all these cases offline and it works as expected. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11463 from yanboliang/spark-13613.
* [SPARK-13894][SQL] SqlContext.range return type from DataFrame to DataSetCheng Hao2016-03-161-1/+1
| | | | | | | | | | | | | ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13894 Change the return type of the `SQLContext.range` API from `DataFrame` to `Dataset`. ## How was this patch tested? No additional unit test required. Author: Cheng Hao <hao.cheng@intel.com> Closes #11730 from chenghao-intel/range.
* [SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset ↵Sean Owen2016-03-161-2/+2
| | | | | | | | | | | | | | | | | | | | follow up ## What changes were proposed in this pull request? Follow up to https://github.com/apache/spark/pull/11657 - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8` - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests) - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11725 from srowen/SPARK-13823.2.
* [SPARK-9837][ML] R-like summary statistics for GLMs via iteratively ↵Yanbo Liang2016-03-153-11/+796
| | | | | | | | | | | | | reweighted least squares ## What changes were proposed in this pull request? Provide R-like summary statistics for GLMs via iteratively reweighted least squares. ## How was this patch tested? unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #11694 from yanboliang/spark-9837.
* [SPARK-12379][ML][MLLIB] Copy GBT implementation to spark.mlsethah2016-03-1512-15/+306
| | | | | | | | | | | Currently, GBTs in spark.ml wrap the implementation in spark.mllib. This is preventing several improvements to GBTs in spark.ml, so we need to move the implementation to ml and use spark.ml decision trees in the implementation. At first, we should make minimal changes to the implementation. Performance testing should be done to ensure there were no regressions. Performance testing results are [here](https://docs.google.com/document/d/1dYd2mnfGdUKkQ3vZe2BpzsTnI5IrpSLQ-NNKDZhUkgw/edit?usp=sharing) Author: sethah <seth.hendrickson16@gmail.com> Closes #10607 from sethah/SPARK-12379.
* [SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed ↵Michael Armbrust2016-03-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | scans of files This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed. Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties: - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat` - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf) - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning. - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm. Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API. A stub for `FileScanRDD` is also added, but most methods remain unimplemented. Other minor cleanups: - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore) - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out. - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes. Author: Michael Armbrust <michael@databricks.com> Closes #11646 from marmbrus/fileStrategy.
* [SPARK-11826][MLLIB] Refactor add() and subtract() methodsEhsan M.Kermani2016-03-142-13/+88
| | | | | | | | srowen Could you please check this when you have time? Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9916 from ehsanmok/JIRA-11826.
* [SPARK-13686][MLLIB][STREAMING] Add a constructor parameter `reqParam` to ↵Dongjoon Hyun2016-03-142-6/+18
| | | | | | | | | | | | | | | | | (Streaming)LinearRegressionWithSGD ## What changes were proposed in this pull request? `LinearRegressionWithSGD` and `StreamingLinearRegressionWithSGD` does not have `regParam` as their constructor arguments. They just depends on GradientDescent's default reqParam values. To be consistent with other algorithms, we had better add them. The same default value is used. ## How was this patch tested? Pass the existing unit test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11527 from dongjoon-hyun/SPARK-13686.
* [MINOR][DOCS] Fix more typos in comments/strings.Dongjoon Hyun2016-03-149-9/+9
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes 135 typos over 107 files: * 121 typos in comments * 11 typos in testcase name * 3 typos in log messages ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11689 from dongjoon-hyun/fix_more_typos.
* [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> ↵Sean Owen2016-03-134-10/+11
| | | | | | | | | | | | | | | | | | | | byte[] conversions (and remaining Coverity items) ## What changes were proposed in this pull request? - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8 - Same for `InputStreamReader` and `OutputStreamWriter` constructors - Standardizes on UTF-8 everywhere - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`) - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit https://github.com/srowen/spark/commit/1deecd8d9ca986d8adb1a42d315890ce5349d29c ) ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #11657 from srowen/SPARK-13823.
* [MINOR][DOCS] Replace `DataFrame` with `Dataset` in Javadoc.Dongjoon Hyun2016-03-132-7/+7
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? SPARK-13817 (PR #11656) replaces `DataFrame` with `Dataset` from Java. This PR fixes the remaining broken links and sample Java code in `package-info.java`. As a result, it will update the following Javadoc. * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/attribute/package-summary.html * http://spark.apache.org/docs/latest/api/java/org/apache/spark/ml/feature/package-summary.html ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11675 from dongjoon-hyun/replace_dataframe_with_dataset_in_javadoc.
* [SPARK-13841][SQL] Removes Dataset.collectRows()/takeRows()Cheng Lian2016-03-131-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR removes two methods, `collectRows()` and `takeRows()`, from `Dataset[T]`. These methods were added in PR #11443, and were later considered not useful. ## How was this patch tested? Existing tests should do the work. Author: Cheng Lian <lian@databricks.com> Closes #11678 from liancheng/remove-collect-rows-and-take-rows.
* [SPARK-13244][SQL] Migrates DataFrame to DatasetCheng Lian2016-03-1029-99/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`. Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`). There are several noticeable API changes related to those returning arrays: 1. `collect`/`take` - Old APIs in class `DataFrame`: ```scala def collect(): Array[Row] def take(n: Int): Array[Row] ``` - New APIs in class `Dataset[T]`: ```scala def collect(): Array[T] def take(n: Int): Array[T] def collectRows(): Array[Row] def takeRows(n: Int): Array[Row] ``` Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side. Normally, Java users may fall back to `collectAsList` and `takeAsList`. The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here). 1. `randomSplit` - Old APIs in class `DataFrame`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame] def randomSplit(weights: Array[Double]): Array[DataFrame] ``` - New APIs in class `Dataset[T]`: ```scala def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]] def randomSplit(weights: Array[Double]): Array[Dataset[T]] ``` Similar problem as above, but hasn't been addressed for Java API yet. We can probably add `randomSplitAsList` to fix this one. 1. `groupBy` Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods. To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`. Other noticeable changes: 1. Dataset always do eager analysis now We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure. However, Dataset encoders requires eager analysi during Dataset construction. To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures. This plan is passed by `QueryExecution.assertAnalyzed`. ## How was this patch tested? Existing tests do the work. ## TODO - [ ] Fix all tests - [ ] Re-enable MiMA check - [ ] Update ScalaDoc (`since`, `group`, and example code) Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Closes #11443 from liancheng/ds-to-df.
* [SPARK-3854][BUILD] Scala style: require spaces before `{`.Dongjoon Hyun2016-03-1011-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time. ``` // Correct: if (true) { println("Wow!") } // Incorrect: if (true){ println("Wow!") } ``` IntelliJ also shows new warnings based on this. ## How was this patch tested? Pass the Jenkins ScalaStyle test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11637 from dongjoon-hyun/SPARK-3854.