spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-14497][ML] Use top instead of sortBy() to get top N frequent words as ↵	fwang1	2016-04-10	2	-13/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dict in ConutVectorizer ## What changes were proposed in this pull request? Replace sortBy() with top() to calculate the top N frequent words as dictionary. ## How was this patch tested? existing unit tests. The terms with same TF would be sorted in descending order. The test would fail if hardcode the terms with same TF the dictionary like "c", "d"... Author: fwang1 <desperado.wf@gmail.com> Closes #12265 from lionelfeng/master.
*	Revert "[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom"	Xiangrui Meng	2016-04-09	1	-12/+0
\| \| \| \|	This reverts commit 1598d11bb0248384872cf88bc2b16f3b238046ad.
*	[SPARK-14462][ML][MLLIB] add the mllib-local build to maven pom	DB Tsai	2016-04-09	1	-0/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In order to separate the linear algebra, and vector matrix classes into a standalone jar, we need to setup the build first. This PR will create a new jar called mllib-local with minimal dependencies. The test scope will still depend on spark-core and spark-core-test in order to use the common utilities, but the runtime will avoid any platform dependency. Couple platform independent classes will be moved to this package to demonstrate how this work. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Closes #12241 from dbtsai/dbtsai-mllib-local-build.
*	[SPARK-14392][ML] CountVectorizer Estimator should include binary toggle Param	wm624@hotmail.com	2016-04-09	2	-20/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? CountVectorizerModel has a binary toggle param. This PR is to add binary toggle param for estimator CountVectorizer. As discussed in the JIRA, instead of adding a param into CountVerctorizer, I moved the binary param to CountVectorizerParams. Therefore, the estimator inherits the binary param. ## How was this patch tested? Add a new test case, which fits the model with binary flag set to true and then check the trained model's all non-zero counts is set to 1.0. All tests in CounterVectorizerSuite.scala are passed. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12200 from wangmiao1981/binary_param.
*	[SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs	Joseph K. Bradley	2016-04-08	7	-40/+80
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Cleanups to documentation. No changes to code. * GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor * GLM regParam: needs doc saying it is for L2 only * TrainValidationSplitModel: add .. versionadded:: 2.0.0 * Rename “_transformer_params_from_java” to “_transfer_params_from_java” * LogReg Summary classes: “probability” col should not say “calibrated” * LR summaries: coefficientStandardErrors —> document that intercept stderr comes last. Same for t,p-values * approxCountDistinct: Document meaning of “rsd" argument. * LDA: note which params are for online LDA only ## How was this patch tested? Doc build Author: Joseph K. Bradley <joseph@databricks.com> Closes #12266 from jkbradley/ml-doc-cleanups.
*	[SPARK-14298][ML][MLLIB] LDA should support disable checkpoint	Yanbo Liang	2016-04-08	2	-3/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In the doc of [```checkpointInterval```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/param/shared/sharedParams.scala#L241), we told users that they can disable checkpoint by setting ```checkpointInterval = -1```. But we did not handle this situation for LDA actually, we should fix this bug. ## How was this patch tested? Existing tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12089 from yanboliang/spark-14298.
*	[SPARK-13048][ML][MLLIB] keepLastCheckpoint option for LDA EM optimizer	Joseph K. Bradley	2016-04-07	7	-32/+191
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The EMLDAOptimizer should generally not delete its last checkpoint since that can cause failures when DistributedLDAModel methods are called (if any partitions need to be recovered from the checkpoint). This PR adds a "deleteLastCheckpoint" option which defaults to false. This is a change in behavior from Spark 1.6, in that the last checkpoint will not be removed by default. This involves adding the deleteLastCheckpoint option to both spark.ml and spark.mllib, and modifying PeriodicCheckpointer to support the option. This also: * Makes MLlibTestSparkContext extend TempDirectory and set the checkpointDir to tempDir * Updates LibSVMRelationSuite because of a name conflict with "tempDir" (and fixes a bug where it failed to delete a temp directory) * Adds a MIMA exclude for DistributedLDAModel constructor, which is already ```private[clustering]``` ## How was this patch tested? Added 2 new unit tests to spark.ml LDASuite, which calls into spark.mllib. Author: Joseph K. Bradley <joseph@databricks.com> Closes #12166 from jkbradley/emlda-save-checkpoint.
*	[SPARK-14134][CORE] Change the package name used for shading classes.	Marcelo Vanzin	2016-04-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current package name uses a dash, which is a little weird but seemed to work. That is, until a new test tried to mock a class that references one of those shaded types, and then things started failing. Most changes are just noise to fix the logging configs. For reference, SPARK-8815 also raised this issue, although at the time it did not cause any issues in Spark, so it was not addressed. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11941 from vanzin/SPARK-14134.
*	[SPARK-12382][ML] Remove mllib GBT implementation and wrap ml	sethah	2016-04-06	6	-226/+207
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch removes the implementation of gradient boosted trees in mllib/tree/GradientBoostedTrees.scala and changes mllib GBTs to call the implementation in spark.ML. Primary changes: * Removed `boost` method in mllib GradientBoostedTrees.scala * Created new test suite GradientBoostedTreesSuite in ML, which contains unit tests that were specific to GBT internals from mllib Other changes: * Added an `updatePrediction` method in GradientBoostedTrees package. This method is added to provide consistency for methods that build predictions from boosted models. There are several methods that hard code the method of predicting as: sum_{i=1}^{numTrees} (treePredictiontreeWeight). Calling this function ensures that test methods that check accuracy use the same prediction method that the algorithm uses during training Added methods that were previously only used in testing, but were public methods, to GradientBoostedTrees. This includes `computeError` (previously part of `Loss` trait) and `evaluateEachIteration`. These are used in the new spark.ML unit tests. They are left in mllib as well so as to not break the API. ## How was this patch tested? Existing unit tests which compare ML and MLlib ensure that mllib GBTs have not changed. Only a single unit test was moved to ML, which verifies that `runWithValidation` performs as expected. Author: sethah <seth.hendrickson16@gmail.com> Closes #12050 from sethah/SPARK-12382.
*	[SPARK-14444][BUILD] Add a new scalastyle `NoScalaDoc` to prevent ↵	Dongjoon Hyun	2016-04-06	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ScalaDoc-style multiline comments ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings. ``` /** In Spark, we don't use the ScalaDoc style so this * is not correct. */ ``` ## How was this patch tested? Pass the Jenkins tests (including `lint-scala`). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12221 from dongjoon-hyun/SPARK-14444.
*	[SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and ↵	Bryan Cutler	2016-04-06	2	-12/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	logistic regression ## What changes were proposed in this pull request? Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML. ## How was this patch tested? Added unit tests to exercise the api calls for the summary classes. Also, manually verified values are expected and match those from Scala directly. Author: Bryan Cutler <cutlerb@gmail.com> Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
*	[SPARK-13538][ML] Add GaussianMixture to ML	Zheng RuiFeng	2016-04-06	2	-0/+444
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-13538 ## What changes were proposed in this pull request? Add GaussianMixture and GaussianMixtureModel to ML package ## How was this patch tested? unit tests and manual tests were done. Local Scalastyle checks passed. Author: Zheng RuiFeng <ruifengz@foxmail.com> Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #11419 from zhengruifeng/mlgmm.
*	[SPARK-14322][MLLIB] Use treeAggregate instead of reduce in OnlineLDAOptimizer	Yuhao Yang	2016-04-06	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14322 OnlineLDAOptimizer uses RDD.reduce in two places where it could use treeAggregate. This can cause scalability issues. This should be an easy fix. This is also a bug since it modifies the first argument to reduce, so we should use aggregate or treeAggregate. See this line: https://github.com/apache/spark/blob/f12f11e578169b47e3f8b18b299948c0670ba585/mllib/src/main/scala/org/apache/spark/mllib/clustering/LDAOptimizer.scala#L452 and a few lines below it. ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #12106 from hhbyyh/ldaTreeReduce.
*	[SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuning	Xusen Yin	2016-04-06	3	-0/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-13786 Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model. ## How was this patch tested? Test with Python doctest. Author: Xusen Yin <yinxusen@gmail.com> Closes #12020 from yinxusen/SPARK-13786.
*	[SPARK-14284][ML] KMeansSummary deprecating size; adding clusterSizes	Shally Sangal	2016-04-05	2	-2/+3
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? KMeansSummary class : deprecated size and added clusterSizes Author: Shally Sangal <shallysangal@gmail.com> Closes #12084 from shallys/master.
*	[SPARK-14386][ML] Changed spark.ml ensemble trees methods to return concrete ↵	Joseph K. Bradley	2016-04-04	6	-19/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	types ## What changes were proposed in this pull request? In spark.ml, GBT and RandomForest expose the trait DecisionTreeModel in the trees method, but they should not since it is a private trait (and not ready to be made public). It will also be more useful to users if we return the concrete types. This PR: return concrete types The MIMA checks appear to be OK with this change. ## How was this patch tested? Existing unit tests Author: Joseph K. Bradley <joseph@databricks.com> Closes #12158 from jkbradley/hide-dtm.
*	[SPARK-13784][ML] Persistence for RandomForestClassifier, RandomForestRegressor	Joseph K. Bradley	2016-04-04	9	-103/+424
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Main change: Added save/load for RandomForestClassifier, RandomForestRegressor (implementation details below) Modified numTrees method (deprecation) * Goal: Use default implementations of unit tests which assume Estimators and Models share the same set of Params. * What this PR does: Moves method numTrees outside of trait TreeEnsembleModel. Adds it to GBT and RF Models. Deprecates it in RF Models in favor of new method getNumTrees. In Spark 2.1, we can have RF Models include Param numTrees. Minor items * Fixes bugs in GBTClassificationModel, GBTRegressionModel fromOld methods where they assign the wrong old UID. Implementation details * Split DecisionTreeModelReadWrite.loadTreeNodes into 2 methods in order to reuse some code for ensembles. * Added EnsembleModelReadWrite object with save/load implementations usable for RFs and GBTs * These store all trees' nodes in a single DataFrame, and all trees' metadata in a second DataFrame. * Split trait RandomForestParams into parts in order to add more Estimator Params to RF models * Split DefaultParamsWriter.saveMetadata into two methods to allow ensembles to store sub-models' metadata in a single DataFrame. Same for DefaultParamsReader.loadMetadata ## How was this patch tested? Adds standard unit tests for RF save/load Author: Joseph K. Bradley <joseph@databricks.com> Author: GayathriMurali <gayathri.m.softie@gmail.com> Closes #12118 from jkbradley/GayathriMurali-SPARK-13784.
*	[SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static ↵	Dongjoon Hyun	2016-04-03	12	-17/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	analysis results ## What changes were proposed in this pull request? This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines). - Fix typos(exception/log strings, testcase name, comments) in 44 lines. - Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011) - Use diamond operators in 40 lines. (New codes after SPARK-13702) - Fix redundant semicolon in 5 lines. - Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala. ## How was this patch tested? Manual and pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12139 from dongjoon-hyun/SPARK-14355.
*	[MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.	Dongjoon Hyun	2016-04-02	5	-17/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes. (All comment-only changes over 77 files: +786 lines, −747 lines) ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
*	[MINOR] Typo fixes	Jacek Laskowski	2016-04-02	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Typo fixes. No functional changes. ## How was this patch tested? Built the sources and ran with samples. Author: Jacek Laskowski <jacek@japila.pl> Closes #11802 from jaceklaskowski/typo-fixes.
*	[SPARK-14308][ML][MLLIB] Remove unused mllib tree classes and move private ↵	sethah	2016-04-01	18	-409/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	classes to ML ## What changes were proposed in this pull request? Decision tree helper classes will be migrated to ML. This patch moves those internal classes that are not part of the public API and removes ones that are no longer used, after [SPARK-12183](https://github.com/apache/spark/pull/11855). No functional changes are made. Details: * Bin.scala is removed as the ML implementation does not require bins * mllib NodeIdCache is removed. It was only used by the mllib implementation previously, which no longer exists * mllib TreePoint is removed. It was only used by the mllib implementation previously, which no longer exists * BaggedPoint, DTStatsAggregator, DecisionTreeMetadata, BaggedPointSuite and TimeTracker are all moved to ML. ## How was this patch tested? No functional changes are made. Existing unit tests ensure behavior is unchanged. Author: sethah <seth.hendrickson16@gmail.com> Closes #12097 from sethah/cleanup_mllib_tree.
*	[SPARK-7425][ML] spark.ml Predictor should support other numeric types for label	BenFradet	2016-04-01	24	-49/+294
\| \| \| \| \| \| \| \|	Currently, the Predictor abstraction expects the input labelCol type to be DoubleType, but we should support other numeric types. This will involve updating the PredictorParams.validateAndTransformSchema method. Author: BenFradet <benjamin.fradet@gmail.com> Closes #10355 from BenFradet/SPARK-7425.
*	[SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure	Cheng Lian	2016-04-01	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fixes a compilation failure introduced in PR #12088 under Scala 2.10. ## How was this patch tested? Compilation. Author: Cheng Lian <lian@databricks.com> Closes #12107 from liancheng/spark-14295-hotfix.
*	[SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans	Yanbo Liang	2016-03-31	2	-51/+86
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper. ## How was this patch tested? Existing tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12039 from yanboliang/spark-14059.
*	[SPARK-11262][ML] Unit test for gradient, loss layers, memory management for ↵	Alexander Ulanov	2016-03-31	7	-386/+595
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	multilayer perceptron 1.Implement LossFunction trait and implement squared error and cross entropy loss with it 2.Implement unit test for gradient and loss 3.Implement InPlace trait and in-place layer evaluation 4.Refactor interface for ActivationFunction 5.Update of Layer and LayerModel interfaces 6.Fix random weights assignment 7.Implement memory allocation by MLP model instead of individual layers These features decreased the memory usage and increased flexibility of internal API. Author: Alexander Ulanov <nashb@yandex.ru> Author: avulanov <avulanov@gmail.com> Closes #9229 from avulanov/mlp-refactoring.
*	[SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM	Cheng Lian	2016-03-31	2	-33/+127
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR implements `FileFormat.buildReader()` for the LibSVM data source. Besides that, a new interface method `prepareRead()` is added to `FileFormat`: ```scala def prepareRead( sqlContext: SQLContext, options: Map[String, String], files: Seq[FileStatus]): Map[String, String] = options ``` After migrating from `buildInternalScan()` to `buildReader()`, we lost the opportunity to collect necessary global information, since `buildReader()` works in a per-partition manner. For example, LibSVM needs to infer the total number of features if the `numFeatures` data source option is not set. Any necessary collected global information should be returned using the data source options map. By default, this method just returns the original options untouched. An alternative approach is to absorb `inferSchema()` into `prepareRead()`, since schema inference is also some kind of global information gathering. However, this approach wasn't chosen because schema inference is optional, while `prepareRead()` must be called whenever a `HadoopFsRelation` based data source relation is instantiated. One unaddressed problem is that, when `numFeatures` is absent, now the input data will be scanned twice. The `buildInternalScan()` code path doesn't need to do this because it caches the raw parsed RDD in memory before computing the total number of features. However, with `FileScanRDD`, the raw parsed RDD is created in a different way (e.g. partitioning) from the final RDD. ## How was this patch tested? Tested using existing test suites. Author: Cheng Lian <lian@databricks.com> Closes #12088 from liancheng/spark-14295-libsvm-build-reader.
*	[SPARK-11892][ML] Model export/import for spark.ml: OneVsRest	Xusen Yin	2016-03-31	3	-18/+223
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	# What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-11892 Add save/load for spark ml.OneVsRest and its model. Also add OneVsRest and OneVsRestModel in MetaAlgorithmReadWrite. # How was this patch tested? Test with Scala unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #9934 from yinxusen/SPARK-11892.
*	[SPARK-13782][ML] Model export/import for spark.ml: BisectingKMeans	Yuhao Yang	2016-03-31	5	-9/+190
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-13782 Model export/import for BisectingKMeans in spark.ml and mllib ## How was this patch tested? unit tests Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11933 from hhbyyh/bisectingsave.
*	[SPARK-14164][MLLIB] Improve input layer validation of ↵	Dongjoon Hyun	2016-03-31	2	-2/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MultilayerPerceptronClassifier ## What changes were proposed in this pull request? This issue improves an input layer validation and adds related testcases to MultilayerPerceptronClassifier. ```scala - // TODO: how to check ALSO that all elements are greater than 0? - ParamValidators.arrayLengthGt(1) + (t: Array[Int]) => t.forall(ParamValidators.gt(0)) && t.length > 1 ``` ## How was this patch tested? Pass the Jenkins tests including the new testcases. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11964 from dongjoon-hyun/SPARK-14164.
*	[SPARK-11507][MLLIB] add compact in Matrices fromBreeze	Yuhao Yang	2016-03-30	2	-1/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11507 "In certain situations when adding two block matrices, I get an error regarding colPtr and the operation fails. External issue URL includes full error and code for reproducing the problem." root cause: colPtr.last does NOT always equal to values.length in breeze SCSMatrix, which fails the require in SparseMatrix. easy step to repro: ``` val m1: BM[Double] = new CSCMatrix[Double] (Array (1.0, 1, 1), 3, 3, Array (0, 1, 2, 3), Array (0, 1, 2) ) val m2: BM[Double] = new CSCMatrix[Double] (Array (1.0, 2, 2, 4), 3, 3, Array (0, 0, 2, 4), Array (1, 2, 1, 2) ) val sum = m1 + m2 Matrices.fromBreeze(sum) ``` Solution: By checking the code in [CSCMatrix](https://github.com/scalanlp/breeze/blob/28000a7b901bc3cfbbbf5c0bce1d0a5dda8281b0/math/src/main/scala/breeze/linalg/CSCMatrix.scala), CSCMatrix in breeze can have extra zeros in the end of data array. Invoking compact will make sure it aligns with the require of SparseMatrix. This should add limited overhead as the actual compact operation is only performed when necessary. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9520 from hhbyyh/matricesFromBreeze.
*	[MINOR][ML] Fix the wrong param name of LDA topicDistributionCol	Yanbo Liang	2016-03-30	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix the wrong param name of LDA ```topicDistributionCol```. ## How was this patch tested? No tests. cc jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #12065 from yanboliang/lda-topicDistributionCol.
*	[SPARK-14181] TrainValidationSplit should have HasSeed	Xusen Yin	2016-03-30	2	-5/+14
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-14181 TrainValidationSplit should have HasSeed for the random split of RDD. I also changed the random split from the RDD function to the DataFrame function. Author: Xusen Yin <yinxusen@gmail.com> Closes #11985 from yinxusen/SPARK-14181.
*	[SPARK-14154][MLLIB] Simplify the implementation for Kolmogorov–Smirnov test	Yuhao Yang	2016-03-29	1	-73/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14154 I just read the code for KolmogorovSmirnovTest and find it could be much simplified following the original definition. Send a PR for discussion ## How was this patch tested? unit test Author: Yuhao Yang <hhbyyh@gmail.com> Closes #11954 from hhbyyh/ksoptimize.
*	[SPARK-13963][ML] Adding binary toggle param to HashingTF	Bryan Cutler	2016-03-29	4	-5/+69
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Adding binary toggle parameter to ml.feature.HashingTF, as well as mllib.feature.HashingTF since the former wraps this functionality. This parameter, if true, will set non-zero valued term counts to 1 to transform term count features to binary values that are well suited for discrete probability models. ## How was this patch tested? Added unit tests for ML and MLlib Author: Bryan Cutler <cutlerb@gmail.com> Closes #11832 from BryanCutler/binary-param-HashingTF-SPARK-13963.
*	[SPARK-11730][ML] Add feature importances for GBTs.	sethah	2016-03-28	12	-135/+213
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Now that GBTs have been moved to ML, they can use the implementation of feature importance for random forests. This patch simply adds a `featureImportances` attribute to `GBTClassifier` and `GBTRegressor` and adds tests for each. GBT feature importances here simply average the feature importances for each tree in its ensemble. This follows the implementation from scikit-learn. This method is also suggested by J Friedman in [this paper](https://statweb.stanford.edu/~jhf/ftp/trebst.pdf). ## How was this patch tested? Unit tests were added to `GBTClassifierSuite` and `GBTRegressorSuite` to validate feature importances. Author: sethah <seth.hendrickson16@gmail.com> Closes #11961 from sethah/SPARK-11730.
*	[SPARK-11893] Model export/import for spark.ml: TrainValidationSplit	Xusen Yin	2016-03-28	5	-142/+310
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11893 jkbradley In order to share read/write with `TrainValidationSplit`, I move the `SharedReadWrite` out of `CrossValidator` into a new trait `SharedReadWrite` in the tunning package. To reduce the repeated tests, I move the complex tests from `CrossValidatorSuite` to `SharedReadWriteSuite`, and create a fake validator called `MyValidator` to test the shared code. With `SharedReadWrite`, potential newly added `Validator` can share the read/write common part, and only need to implement their extra params save/load. Author: Xusen Yin <yinxusen@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #9971 from yinxusen/SPARK-11893.
*	[SPARK-14187][MLLIB] Fix incorrect use of binarySearch in SparseMatrix	Chenliang Xu	2016-03-28	2	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix incorrect use of binarySearch in SparseMatrix ## How was this patch tested? Unit test added. Author: Chenliang Xu <chexu@groupon.com> Closes #11992 from luckyrandom/SPARK-14187.
*	[SPARK-12494][MLLIB] Array out of bound Exception in KMeans Yarn Mode	Sean Owen	2016-03-28	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Better error message with k-means init can't be enough samples from input (because it is perhaps empty) ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #11979 from srowen/SPARK-12494.
*	[SPARK-10691][ML] Make LogisticRegressionModel, LinearRegressionModel ↵	Joseph K. Bradley	2016-03-27	2	-9/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	evaluate() public ## What changes were proposed in this pull request? Made evaluate method public. Fixed LogisticRegressionModel evaluate to handle case when probabilityCol is not specified. ## How was this patch tested? There were already unit tests for these methods. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11928 from jkbradley/public-evaluate.
*	[MINOR][MLLIB] Remove TODO comment DecisionTreeModel.scala	Dongjoon Hyun	2016-03-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes the following line and the related code. Historically, this code was added in [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597). After [SPARK-5597](https://issues.apache.org/jira/browse/SPARK-5597) was committed, [SPARK-3365](https://issues.apache.org/jira/browse/SPARK-3365) is fixed now. Now, we had better remove the comment without changing persistent code. ```scala - categories: Seq[Double]) { // TODO: Change to List once SPARK-3365 is fixed + categories: Seq[Double]) { ``` ## How was this patch tested? Pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11966 from dongjoon-hyun/change_categories_type.
*	[SPARK-14089][CORE][MLLIB] Remove methods that has been deprecated since ↵	Liwei Lin	2016-03-26	6	-137/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1.1, 1.2, 1.3, 1.4, and 1.5 ## What changes were proposed in this pull request? Removed methods that has been deprecated since 1.1, 1.2, 1.3, 1.4, and 1.5. ## How was this patch tested? - manully checked that no codes in Spark call these methods any more - existing test suits Author: Liwei Lin <lwlin7@gmail.com> Author: proflin <proflin.me@gmail.com> Closes #11910 from lw-lin/remove-deprecates.
*	[SPARK-14159][ML] Fixed bug in StringIndexer + related issue in RFormula	Joseph K. Bradley	2016-03-25	3	-13/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? StringIndexerModel.transform sets the output column metadata to use name inputCol. It should not. Fixing this causes a problem with the metadata produced by RFormula. Fix in RFormula: I added the StringIndexer columns to prefixesToRewrite, and I modified VectorAttributeRewriter to find and replace all "prefixes" since attributes collect multiple prefixes from StringIndexer + Interaction. Note that "prefixes" is no longer accurate since internal strings may be replaced. ## How was this patch tested? Unit test which failed before this fix. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11965 from jkbradley/StringIndexer-fix.
*	[SPARK-13010][ML][SPARKR] Implement a simple wrapper of ↵	Yanbo Liang	2016-03-24	1	-0/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AFTSurvivalRegression in SparkR ## What changes were proposed in this pull request? This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR. ## How was this patch tested? Test against output from R package survival's survreg. cc mengxr felixcheung Close #11447 Author: Yanbo Liang <ybliang8@gmail.com> Closes #11932 from yanboliang/spark-13010-new.
*	[SPARK-11871] Add save/load for MLPC	Xusen Yin	2016-03-24	2	-9/+103
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-11871 Add save/load for MLPC ## How was this patch tested? Test with Scala unit test Author: Xusen Yin <yinxusen@gmail.com> Closes #9854 from yinxusen/SPARK-11871.
*	[SPARK-14030][MLLIB] Add parameter check to MLLIB	Ruifeng Zheng	2016-03-24	13	-13/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? add parameter verification to MLLIB, like numCorrections > 0 tolerance >= 0 iters > 0 regParam >= 0 ## How was this patch tested? manual tests Author: Ruifeng Zheng <ruifengz@foxmail.com> Author: Zheng RuiFeng <mllabs@datanode1.(none)> Author: mllabs <mllabs@datanode1.(none)> Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #11852 from zhengruifeng/lbfgs_check.
*	Fix typo in ALS.scala	Juarez Bochi	2016-03-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Just a typo ## How was this patch tested? N/A Author: Juarez Bochi <jbochi@gmail.com> Closes #11896 from jbochi/patch-1.
*	[SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one	Joseph K. Bradley	2016-03-23	13	-1778/+538
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Primary change: * Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning. * spark.mllib now calls the spark.ml implementation. * Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed. ml.tree.DecisionTreeModel * Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses. These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation. ml.tree.Node * Added ```private[tree] def deepCopy```, used by unit tests Copied developer comments from spark.mllib implementation to spark.ml one. Moving unit tests * Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite. * Those tests were all moved to spark.ml.tree.impl.RandomForestSuite. The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side. * I made minimal changes to each test to allow it to run. Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values. * No new unit tests were added. * mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in. Those same split calculations were already being tested in other unit tests, for each dataset type. Changes of behavior (to be noted in SPARK-13448 once this PR is merged) * spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB. This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit(). Once this PR is merged, I will note the change of behavior on SPARK-13448. * spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats. This does not remove information from the tree, and it will save a bit of storage. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11855 from jkbradley/remove-mllib-tree-impl.
*	[SPARK-13952][ML] Add random seed to GBT	sethah	2016-03-23	9	-39/+66
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `GBTClassifier` and `GBTRegressor` should use random seed for reproducible results. Because of the nature of current unit tests, which compare GBTs in ML and GBTs in MLlib for equality, I also added a random seed to MLlib GBT algorithm. I made alternate constructors in `mllib.tree.GradientBoostedTrees` to accept a random seed, but left them as private so as to not change the API unnecessarily. ## How was this patch tested? Existing unit tests verify that functionality did not change. Other ML algorithms do not seem to have unit tests that directly test the functionality of random seeding, but reproducibility with seeding for GBTs is effectively verified in existing tests. I can add more tests if needed. Author: sethah <seth.hendrickson16@gmail.com> Closes #11903 from sethah/SPARK-13952.
*	[SPARK-14035][MLLIB] Make error message more verbose for mllib NaiveBayesSuite	Joseph K. Bradley	2016-03-23	1	-10/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Print more info about failed NaiveBayesSuite tests which have exhibited flakiness. ## How was this patch tested? Ran locally with incorrect check to cause failure. Author: Joseph K. Bradley <joseph@databricks.com> Closes #11858 from jkbradley/naive-bayes-bug-log.
*	[SPARK-13449] Naive Bayes wrapper in SparkR	Xusen Yin	2016-03-22	1	-0/+75
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli. I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes. I removed the preprocess part that omit NA values because we don't know which columns to process. ## How was this patch tested? Test against output from R package e1071's naiveBayes. cc: yanboliang yinxusen Closes #11486 Author: Xusen Yin <yinxusen@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #11890 from mengxr/SPARK-13449.