spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-16241][ML] model loading backward compatibility for ml NaiveBayes	zlpmichelle	2016-06-30	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? model loading backward compatibility for ml NaiveBayes ## How was this patch tested? existing ut and manual test for loading models saved by Spark 1.6. Author: zlpmichelle <zlpmichelle@gmail.com> Closes #13940 from zlpmichelle/naivebayes.
*	[SPARK-15858][ML] Fix calculating error by tree stack over flow prob…	Mahmoud Rawas	2016-06-29	2	-43/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? What changes were proposed in this pull request? Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees ## How was this patch tested? the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail. PS: It would be better if this PR can be cherry picked into release branches 1.6.1 and 2.0 Author: Mahmoud Rawas <mhmoudr@gmail.com> Author: Mahmoud Rawas <Mahmoud.Rawas@quantium.com.au> Closes #13624 from mhmoudr/SPARK-15858.master.
*	[SPARK-16245][ML] model loading backward compatibility for ml.feature.PCA	Yanbo Liang	2016-06-28	1	-10/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? model loading backward compatibility for ml.feature.PCA. ## How was this patch tested? existing ut and manual test for loading models saved by Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13937 from yanboliang/spark-16245.
*	[SPARK-16242][MLLIB][PYSPARK] Conversion between old/new matrix columns in a ↵	Yanbo Liang	2016-06-28	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrame (Python) ## What changes were proposed in this pull request? This PR implements python wrappers for #13888 to convert old/new matrix columns in a DataFrame. ## How was this patch tested? Doctest in python. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13935 from yanboliang/spark-16242.
*	[SPARK-16187][ML] Implement util method for ML Matrix conversion in scala/java	Yuhao Yang	2016-06-27	4	-7/+187
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16187 This is to provide conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. ## How was this patch tested? java and scala ut Author: Yuhao Yang <yuhao.yang@intel.com> Closes #13888 from hhbyyh/matComp.
*	[MLLIB] org.apache.spark.mllib.util.SVMDataGenerator generates ↵	José Antonio	2016-06-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ArrayIndexOutOfBoundsException. I have found the bug and tested the solution. ## What changes were proposed in this pull request? Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66. ## How was this patch tested? Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results. line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1 crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length. To fix this just make trueWeights be the same length as x. I have recompiled the project with the change and it is working now: [spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test And it generates the data successfully now in the specified folder. Author: José Antonio <joseanmunoz@gmail.com> Closes #13895 from j4munoz/patch-2.
*	[SPARK-16133][ML] model loading backward compatibility for ml.feature	Yuhao Yang	2016-06-23	3	-5/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? model loading backward compatibility for ml.feature, ## How was this patch tested? existing ut and manual test for loading 1.6 models. Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #13844 from hhbyyh/featureComp.
*	[SPARK-16177][ML] model loading backward compatibility for ml.regression	Yuhao Yang	2016-06-23	2	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16177 model loading backward compatibility for ml.regression ## How was this patch tested? existing ut and manual test for loading 1.6 models. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #13879 from hhbyyh/regreComp.
*	[SPARK-16130][ML] model loading backward compatibility for ↵	Yuhao Yang	2016-06-23	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ml.classfication.LogisticRegression ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16130 model loading backward compatibility for ml.classfication.LogisticRegression ## How was this patch tested? existing ut and manual test for loading old models. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #13841 from hhbyyh/lrcomp.
*	[SPARK-16154][MLLIB] Update spark.ml and spark.mllib package docs	Xiangrui Meng	2016-06-23	5	-9/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change. ## How was this patch tested? Manually checked generated APIs. Author: Xiangrui Meng <meng@databricks.com> Closes #13859 from mengxr/SPARK-16154.
*	[SPARK-16153][MLLIB] switch to multi-line doc to avoid a genjavadoc bug	Xiangrui Meng	2016-06-22	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We recently deprecated setLabelCol in ChiSqSelectorModel (#13823): ~~~scala /** group setParam / Since("1.6.0") deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0") def setLabelCol(value: String): this.type = set(labelCol, value) ~~~ This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code: ~~~java /* group setParam / public org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value) { throw new RuntimeException(); } * deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0. */ public org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value) { throw new RuntimeException(); } ~~~ Switching to multiline is a workaround. Author: Xiangrui Meng <meng@databricks.com> Closes #13855 from mengxr/SPARK-16153.
*	[MINOR][MLLIB] DefaultParamsReadable/Writable should be DeveloperApi	Xiangrui Meng	2016-06-22	1	-8/+5
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `DefaultParamsReadable/Writable` are not user-facing. Only developers who implement `Transformer/Estimator` would use it. So this PR changes the annotation to `DeveloperApi`. Author: Xiangrui Meng <meng@databricks.com> Closes #13828 from mengxr/default-readable-should-be-developer-api.
*	[SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg	Nick Pentreath	2016-06-22	12	-31/+31
\| \| \| \| \| \| \| \| \| \| \| \|	[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
*	[SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs	Holden Karau	2016-06-22	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc. ## How was this patch tested? Built docs locally & PySpark SQL tests Author: Holden Karau <holden@us.ibm.com> Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
*	[SPARK-15644][MLLIB][SQL] Replace SQLContext with SparkSession in MLlib	gatorsmile	2016-06-21	30	-80/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`. Also fix a test case issue in `BroadcastJoinSuite`. BTW, `SQLContext` is not being used in the `MLlib` test suites. #### How was this patch tested? Existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13380 from gatorsmile/sqlContextML.
*	[MINOR][MLLIB] deprecate setLabelCol in ChiSqSelectorModel	Xiangrui Meng	2016-06-21	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Deprecate `labelCol`, which is not used by ChiSqSelectorModel. Author: Xiangrui Meng <meng@databricks.com> Closes #13823 from mengxr/deprecate-setLabelCol-in-ChiSqSelectorModel.
*	[SPARK-16118][MLLIB] add getDropLast to OneHotEncoder	Xiangrui Meng	2016-06-21	2	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We forgot the getter of `dropLast` in `OneHotEncoder` ## How was this patch tested? unit test Author: Xiangrui Meng <meng@databricks.com> Closes #13821 from mengxr/SPARK-16118.
*	[SPARK-16117][MLLIB] hide LibSVMFileFormat and move its doc to LibSVMDataSource	Xiangrui Meng	2016-06-21	2	-38/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? LibSVMFileFormat implements data source for LIBSVM format. However, users do not really need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy class to hold the documentation, as a workaround to https://issues.scala-lang.org/browse/SI-8124. ## How was this patch tested? Manually checked the generated API doc and tested loading LIBSVM data. Author: Xiangrui Meng <meng@databricks.com> Closes #13819 from mengxr/SPARK-16117.
*	[MINOR][MLLIB] move setCheckpointInterval to non-expert setters	Xiangrui Meng	2016-06-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group. Author: Xiangrui Meng <meng@databricks.com> Closes #13813 from mengxr/checkpoint-non-expert.
*	[SPARK-15177][.1][R] make SparkR model params and default values consistent ↵	Xiangrui Meng	2016-06-21	2	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with MLlib ## What changes were proposed in this pull request? This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation. Main changes: * `spark.glm`: epsilon -> tol, maxit -> maxIter * `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means\|\|" * `spark.naiveBayes`: laplace -> smoothing, default 1.0 ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #13801 from mengxr/SPARK-15177.1.
*	[SPARK-10258][DOC][ML] Add @Since annotations to ml.feature	Nick Pentreath	2016-06-21	27	-63/+357
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds missing `Since` annotations to `ml.feature` package. Closes #8505. ## How was this patch tested? Existing tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13641 from MLnick/add-since-annotations.
*	[SPARK-16074][MLLIB] expose VectorUDT/MatrixUDT in a public API	Xiangrui Meng	2016-06-20	3	-0/+93
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Both VectorUDT and MatrixUDT are private APIs, because UserDefinedType itself is private in Spark. However, in order to let developers implement their own transformers and estimators, we should expose both types in a public API to simply the implementation of transformSchema, transform, etc. Otherwise, they need to get the data types using reflection. ## How was this patch tested? Unit tests in Scala and Java. Author: Xiangrui Meng <meng@databricks.com> Closes #13789 from mengxr/SPARK-16074.
*	[SPARK-15946][MLLIB] Conversion between old/new vector columns in a ↵	Xiangrui Meng	2016-06-17	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrame (Python) ## What changes were proposed in this pull request? This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame. ## How was this patch tested? doctest in Python cc: yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #13731 from mengxr/SPARK-15946.
*	[SPARK-16008][ML] Remove unnecessary serialization in logistic regression	sethah	2016-06-17	1	-28/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: [SPARK-16008](https://issues.apache.org/jira/browse/SPARK-16008) ## What changes were proposed in this pull request? `LogisticAggregator` stores references to two arrays of dimension `numFeatures` which are serialized before the combine op, unnecessarily. This results in the shuffle write being ~3x (for multiclass logistic regression, this number will go up) larger than it should be (in MLlib, for instance, it is 3x smaller). This patch modifies `LogisticAggregator.add` to accept the two arrays as method parameters which avoids the serialization. ## How was this patch tested? I tested this locally and verified the serialization reduction. ![image](https://cloud.githubusercontent.com/assets/7275795/16140387/d2974bac-3404-11e6-94f9-268860c931a2.png) Additionally, I ran some tests of a 4 node cluster (4x48 cores, 4x128 GB RAM). Data set size of 2M rows and 10k features showed >2x iteration speedup. Author: sethah <seth.hendrickson16@gmail.com> Closes #13729 from sethah/lr_improvement.
*	[SPARK-15922][MLLIB] `toIndexedRowMatrix` should consider the case `cols < ↵	Dongjoon Hyun	2016-06-16	2	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	offset+colsPerBlock` ## What changes were proposed in this pull request? SPARK-15922 reports the following scenario throwing an exception due to the mismatched vector sizes. This PR handles the exceptional case, `cols < (offset + colsPerBlock)`. Before ```scala scala> import org.apache.spark.mllib.linalg.distributed._ scala> import org.apache.spark.mllib.linalg._ scala> val rows = IndexedRow(0L, new DenseVector(Array(1,2,3))) :: IndexedRow(1L, new DenseVector(Array(1,2,3))):: IndexedRow(2L, new DenseVector(Array(1,2,3))):: Nil scala> val rdd = sc.parallelize(rows) scala> val matrix = new IndexedRowMatrix(rdd, 3, 3) scala> val bmat = matrix.toBlockMatrix scala> val imat = bmat.toIndexedRowMatrix scala> imat.rows.collect ... // java.lang.IllegalArgumentException: requirement failed: Vectors must be the same length! ``` After ```scala ... scala> imat.rows.collect res0: Array[org.apache.spark.mllib.linalg.distributed.IndexedRow] = Array(IndexedRow(0,[1.0,2.0,3.0]), IndexedRow(1,[1.0,2.0,3.0]), IndexedRow(2,[1.0,2.0,3.0])) ``` ## How was this patch tested? Pass the Jenkins tests (including the above case) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13643 from dongjoon-hyun/SPARK-15922.
*	[SPARK-15983][SQL] Removes FileFormat.prepareRead	Cheng Lian	2016-06-16	1	-16/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Interface method `FileFormat.prepareRead()` was added in #12088 to handle a special case in the LibSVM data source. However, the semantics of this interface method isn't intuitive: it returns a modified version of the data source options map. Considering that the LibSVM case can be easily handled using schema metadata inside `inferSchema`, we can remove this interface method to keep the `FileFormat` interface clean. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #13698 from liancheng/remove-prepare-read.
*	[SPARK-15979][SQL] Rename various Parquet support classes.	Reynold Xin	2016-06-15	1	-5/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch renames various Parquet support classes from CatalystAbc to ParquetAbc. This new naming makes more sense for two reasons: 1. These are not optimizer related (i.e. Catalyst) classes. 2. We are in the Spark code base, and as a result it'd be more clear to call out these are Parquet support classes, rather than some Spark classes. ## How was this patch tested? Renamed test cases as well. Author: Reynold Xin <rxin@databricks.com> Closes #13696 from rxin/parquet-rename.
*	[DOCS] Fix Gini and Entropy scaladocs in context of multiclass classification	Wojciech Jurczyk	2016-06-15	2	-3/+2
\| \| \| \| \| \| \| \|	The PR changes outdated scaladocs for Gini and Entropy classes. Since PR #886 Spark supports multiclass classification, but the docs tell only about binary classification. Author: Wojciech Jurczyk <wojciech.jurczyk@codilime.com> Closes #11252 from wjur/wjur/docs_multiclass.
*	[SPARK-15945][MLLIB] Conversion between old/new vector columns in a ↵	Xiangrui Meng	2016-06-14	3	-8/+218
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrame (Scala/Java) ## What changes were proposed in this pull request? This PR provides conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. The methods are implemented under `MLUtils` and called `convertVectorColumnsToML` and `convertVectorColumnsFromML`. Both take a DataFrame and a list of vector columns to be converted. It is a no-op on vector columns that are already converted. A warning message is logged if actual conversion happens. This is the first sub-task under SPARK-15944 to make it easier to migrate existing pipelines to Spark 2.0. ## How was this patch tested? Unit tests in Scala and Java. cc: yanboliang Author: Xiangrui Meng <meng@databricks.com> Closes #13662 from mengxr/SPARK-15945.
*	[SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ↵	Liang-Chi Hsieh	2016-06-13	3	-241/+364
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ml.Matrix under spark.ml.python ## What changes were proposed in this pull request? Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead. ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #13219 from viirya/pyspark-pickler-ml.
*	[SPARK-15892][ML] Incorrectly merged AFTAggregator with zero total count	hyukjinkwon	2016-06-12	2	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, `AFTAggregator` is not being merged correctly. For example, if there is any single empty partition in the data, this creates an `AFTAggregator` with zero total count which causes the exception below: ``` IllegalArgumentException: u'requirement failed: The number of instances should be greater than 0.0, but got 0.' ``` Please see [AFTSurvivalRegression.scala#L573-L575](https://github.com/apache/spark/blob/6ecedf39b44c9acd58cdddf1a31cf11e8e24428c/mllib/src/main/scala/org/apache/spark/ml/regression/AFTSurvivalRegression.scala#L573-L575) as well. Just to be clear, the python example `aft_survival_regression.py` seems using 5 rows. So, if there exist partitions more than 5, it throws the exception above since it contains empty partitions which results in an incorrectly merged `AFTAggregator`. Executing `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py` on a machine with CPUs more than 5 is being failed because it creates tasks with some empty partitions with defualt configurations (AFAIK, it sets the parallelism level to the number of CPU cores). ## How was this patch tested? An unit test in `AFTSurvivalRegressionSuite.scala` and manually tested by `bin/spark-submit examples/src/main/python/ml/aft_survival_regression.py`. Author: hyukjinkwon <gurwls223@gmail.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Closes #13619 from HyukjinKwon/SPARK-15892.
*	[SPARK-15654] [SQL] fix non-splitable files for text based file formats	Davies Liu	2016-06-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Currently, we always split the files when it's bigger than maxSplitBytes, but Hadoop LineRecordReader does not respect the splits for compressed files correctly, we should have a API for FileFormat to check whether the file could be splitted or not. This PR is based on #13442, closes #13442 ## How was this patch tested? add regression tests. Author: Davies Liu <davies@databricks.com> Closes #13531 from davies/fix_split.
*	[SPARK-15875] Try to use Seq.isEmpty and Seq.nonEmpty instead of Seq.length ↵	wangyang	2016-06-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	== 0 and Seq.length > 0 ## What changes were proposed in this pull request? In scala, immutable.List.length is an expensive operation so we should avoid using Seq.length == 0 or Seq.lenth > 0, and use Seq.isEmpty and Seq.nonEmpty instead. ## How was this patch tested? existing tests Author: wangyang <wangyang@haizhi.com> Closes #13601 from yangw1234/isEmpty.
*	[SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar ↵	Bryan Cutler	2016-06-10	2	-2/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	to Scala API ## What changes were proposed in this pull request? Adding __str__ to RFormula and model that will show the set formula param and resolved formula. This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review. ## How was this patch tested? run pyspark-ml tests locally Author: Bryan Cutler <cutlerb@gmail.com> Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
*	[SPARK-15793][ML] Add maxSentenceLength for ml.Word2Vec	yinxusen	2016-06-08	2	-0/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-15793 Word2vec in ML package should have maxSentenceLength method for feature parity. ## How was this patch tested? Tested with Spark unit test. Author: yinxusen <yinxusen@gmail.com> Closes #13536 from yinxusen/SPARK-15793.
*	[SPARK-13590][ML][DOC] Document spark.ml LiR, LoR and AFTSurvivalRegression ↵	Yanbo Liang	2016-06-07	3	-1/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	behavior difference ## What changes were proposed in this pull request? When fitting ```LinearRegressionModel```(by "l-bfgs" solver) and ```LogisticRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce same model as R glmnet but different from LIBSVM. When fitting ```AFTSurvivalRegressionModel``` w/o intercept on dataset with constant nonzero column, spark.ml produce different model compared with R survival::survreg. We should output a warning message and clarify in document for this condition. ## How was this patch tested? Document change, no unit test. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12731 from yanboliang/spark-13590.
*	[SPARK-15721][ML] Make DefaultParamsReadable, DefaultParamsWritable public	Joseph K. Bradley	2016-06-06	1	-3/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Made DefaultParamsReadable, DefaultParamsWritable public. Also added relevant doc and annotations. Added UnaryTransformerExample to demonstrate use of UnaryTransformer and DefaultParamsReadable,Writable. ## How was this patch tested? Wrote example making use of the now-public APIs. Compiled and ran locally Author: Joseph K. Bradley <joseph@databricks.com> Closes #13461 from jkbradley/defaultparamswritable.
*	[MINOR] Fix Typos 'an -> a'	Zheng RuiFeng	2016-06-06	9	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `an -> a` Use cmds like `find . -name '*.R' \| xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one. ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13515 from zhengruifeng/an_a.
*	[SPARK-15748][SQL] Replace inefficient foldLeft() call with flatMap() in ↵	Josh Rosen	2016-06-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	PartitionStatistics `PartitionStatistics` uses `foldLeft` and list concatenation (`++`) to flatten an iterator of lists, but this is extremely inefficient compared to simply doing `flatMap`/`flatten` because it performs many unnecessary object allocations. Simply replacing this `foldLeft` by a `flatMap` results in decent performance gains when constructing PartitionStatistics instances for tables with many columns. This patch fixes this and also makes two similar changes in MLlib and streaming to try to fix all known occurrences of this pattern. Author: Josh Rosen <joshrosen@databricks.com> Closes #13491 from JoshRosen/foldleft-to-flatmap.
*	[SPARK-15770][ML] Annotation audit for Experimental and DeveloperApi	Zheng RuiFeng	2016-06-05	13	-3/+50
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1, remove comments `:: Experimental ::` for non-experimental API 2, add comments `:: Experimental ::` for experimental API 3, add comments `:: DeveloperApi ::` for developerApi API ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13514 from zhengruifeng/del_experimental.
*	[SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" ↵	Ruifeng Zheng	2016-06-04	2	-8/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f1_score ## What changes were proposed in this pull request? 1, del precision,recall in `ml.MulticlassClassificationEvaluator` 2, update user guide for `mlllib.weightedFMeasure` ## How was this patch tested? local build Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #13390 from zhengruifeng/clarify_f1.
*	[SPARK-15494][SQL] encoder code cleanup	Wenchen Fan	2016-06-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #13269 from cloud-fan/clean-encoder.
*	[SPARK-15740][MLLIB] ignore big model load / save in Word2VecSuite	Xiangrui Meng	2016-06-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? andrewor14 noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed. This PR disables the test. I will leave the JIRA open for a proper fix ## How was this patch tested? No new features. Author: Xiangrui Meng <meng@databricks.com> Closes #13478 from mengxr/SPARK-15740.
*	[SPARK-15668][ML] ml.feature: update check schema to avoid confusion when ↵	Yuhao Yang	2016-06-02	4	-36/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	user use MLlib.vector as input type ## What changes were proposed in this pull request? ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type ## How was this patch tested? existing ut Author: Yuhao Yang <yuhao.yang@intel.com> Closes #13411 from hhbyyh/schemaCheck.
*	[MINOR] clean up style for storage param setters in ALS	Nick Pentreath	2016-06-02	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \|	Clean up style for param setter methods in ALS to match standard style and the other setter in class (this is an artefact of one of my previous PRs that wasn't cleaned up). ## How was this patch tested? Existing tests - no functionality change. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13480 from MLnick/als-param-minor-cleanup.
*	[SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.feature	Yanbo Liang	2016-06-01	4	-16/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include: * Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless. * Scala API docs update. * Sync Scala and Python API docs for these changes. ## How was this patch tested? Exist tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13410 from yanboliang/spark-15587.
*	[SPARK-15664][MLLIB] Replace FileSystem.get(conf) with ↵	Lianhui Wang	2016-06-01	5	-20/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	path.getFileSystem(conf) when removing CheckpointFile in MLlib ## What changes were proposed in this pull request? if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib. So we should always get the FileSystem from Path to avoid wrong FS problem. ## How was this patch tested? N/A Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #13408 from lianhuiwang/SPARK-15664.
*	[SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable.	Dongjoon Hyun	2016-05-31	22	-51/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR changes function `SparkSession.builder.sparkContext(..)` from private[sql] into private[spark], and uses it if applicable like the followings. ``` - val spark = SparkSession.builder().config(sc.getConf).getOrCreate() + val spark = SparkSession.builder().sparkContext(sc).getOrCreate() ``` ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13365 from dongjoon-hyun/SPARK-15618.
*	[MINOR] Resolve a number of miscellaneous build warnings	Sean Owen	2016-05-29	2	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately. ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #13377 from srowen/BuildWarnings.
*	[SPARK-15610][ML] update error message for k in pca	Zheng RuiFeng	2016-05-27	2	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix the wrong bound of `k` in `PCA` `require(k <= sources.first().size, ...` -> `require(k < sources.first().size` BTW, remove unused import in `ml.ElementwiseProduct` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13356 from zhengruifeng/fix_pca.