spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	[SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions ↵	Sean Owen	2016-07-30	4	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	whose side effects are required ## What changes were proposed in this pull request? Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #14332 from srowen/SPARK-16694.
*	[SPARK-16750][ML] Fix GaussianMixture training failed due to feature column ↵	Yanbo Liang	2016-07-29	10	-7/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	type mistake ## What changes were proposed in this pull request? ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake. See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug. Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR. ## How was this patch tested? No new tests, should pass existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14378 from yanboliang/spark-16750.
*	[SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc	krishnakalyan3	2016-07-27	1	-2/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Updated ML pipeline Cross Validation Scaladoc & PyDoc. ## How was this patch tested? Documentation update (If this patch involves UI changes, please attach a screenshot; otherwise, remove this) Author: krishnakalyan3 <krishnakalyan3@gmail.com> Closes #13894 from krishnakalyan3/kfold-cv.
*	[MINOR][ML] Fix some mistake in LinearRegression formula.	Yanbo Liang	2016-07-27	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix some mistake in ```LinearRegression``` formula. ## How was this patch tested? Documents change, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14369 from yanboliang/LiR-formula.
*	[SPARK-16697][ML][MLLIB] improve LDA submitMiniBatch method to avoid ↵	WeichenXu	2016-07-26	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	redundant RDD computation ## What changes were proposed in this pull request? In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]` and also move the place of unpersisting `expElogbetaBc` broadcast variable, to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early, and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)` ## How was this patch tested? Existing test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14335 from WeichenXu123/improve_LDA.
*	[SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default ↵	WeichenXu	2016-07-25	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	to 1e-6 ## What changes were proposed in this pull request? replace ANN convergence tolerance param default from 1e-4 to 1e-6 so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer. ## How was this patch tested? Existing Test. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14286 from WeichenXu123/update_ann_tol.
*	[SPARK-16561][MLLIB] fix multivarOnlineSummary min/max bug	WeichenXu	2016-07-23	2	-28/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? renaming var names to make code more clear: nnz => weightSum weightSum => totalWeightSum and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number. using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case. ## How was this patch tested? A new testcase added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14216 from WeichenXu123/multivarOnlineSummary.
*	[SPARK-16440][MLLIB] Destroy broadcasted variables even on driver	Anthony Truchet	2016-07-20	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver. ## How was this patch tested? Unit Tests in Word2VecSuite were run locally. This contribution is done on behalf of Criteo, according to the terms of the Apache license 2.0. Author: Anthony Truchet <a.truchet@criteo.com> Closes #14268 from AnthonyTruchet/SPARK-16440.
*	[SPARK-16494][ML] Upgrade breeze version to 0.12	Yanbo Liang	2016-07-19	9	-30/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? breeze 0.12 has been released for more than half a year, and it brings lots of new features, performance improvement and bug fixes. One of the biggest features is ```LBFGS-B``` which is an implementation of ```LBFGS``` with box constraints and much faster for some special case. We would like to implement Huber loss function for ```LinearRegression``` ([SPARK-3181](https://issues.apache.org/jira/browse/SPARK-3181)) and it requires ```LBFGS-B``` as the optimization solver. So we should bump up the dependent breeze version to 0.12. For more features, improvements and bug fixes of breeze 0.12, you can refer the following link: https://groups.google.com/forum/#!topic/scala-breeze/nEeRi_DcY5c ## How was this patch tested? No new tests, should pass the existing ones. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14150 from yanboliang/spark-16494.
*	[SPARK-16600][MLLIB] fix some latex formula syntax error	WeichenXu	2016-07-19	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `\partial\x` ==> `\partial x` `har{x_i}` ==> `hat{x_i}` ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14246 from WeichenXu123/fix_formular_err.
*	[SPARK-16535][BUILD] In pom.xml, remove groupId which is redundant ↵	Xin Ren	2016-07-19	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	definition and inherited from the parent https://issues.apache.org/jira/browse/SPARK-16535 ## What changes were proposed in this pull request? When I scan through the pom.xml of sub projects, I found this warning as below and attached screenshot ``` Definition of groupId is redundant, because it's inherited from the parent ``` ![screen shot 2016-07-13 at 3 13 11 pm](https://cloud.githubusercontent.com/assets/3925641/16823121/744f893e-4916-11e6-8a52-042f83b9db4e.png) I've tried to remove some of the lines with groupId definition, and the build on my local machine is still ok. ``` <groupId>org.apache.spark</groupId> ``` As I just find now `<maven.version>3.3.9</maven.version>` is being used in Spark 2.x, and Maven-3 supports versionless parent elements: Maven 3 will remove the need to specify the parent version in sub modules. THIS is great (in Maven 3.1). ref: http://stackoverflow.com/questions/3157240/maven-3-worth-it/3166762#3166762 ## How was this patch tested? I've tested by re-building the project, and build succeeded. Author: Xin Ren <iamshrek@126.com> Closes #14189 from keypointt/SPARK-16535.
*	[MINOR][TYPO] fix fininsh typo	WeichenXu	2016-07-18	4	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? fininsh => finish ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14238 from WeichenXu123/fix_fininsh_typo.
*	[SPARK-16588][SQL] Deprecate monotonicallyIncreasingId in Scala/Java	Reynold Xin	2016-07-17	1	-2/+2
\| \| \| \| \| \|	This patch deprecates monotonicallyIncreasingId in Scala/Java, as done in Python. This patch was originally written by HyukjinKwon. Closes #14236.
*	[SPARK-3359][DOCS] More changes to resolve javadoc 8 errors that will help ↵	Sean Owen	2016-07-16	19	-57/+55
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	unidoc/genjavadoc compatibility ## What changes were proposed in this pull request? These are yet more changes that resolve problems with unidoc/genjavadoc and Java 8. It does not fully resolve the problem, but gets rid of as many errors as we can from this end. ## How was this patch tested? Jenkins build of docs Author: Sean Owen <sowen@cloudera.com> Closes #14221 from srowen/SPARK-3359.3.
*	[SPARK-16426][MLLIB] Fix bug that caused NaNs in IsotonicRegression	z001qdp	2016-07-15	2	-3/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fixed a bug that caused `NaN`s in `IsotonicRegression`. The problem occurs when training rows with the same feature value but different labels end up on different partitions. This patch changes a `sortBy` call to a `partitionBy(RangePartitioner)` followed by a `mapPartitions(sortBy)` in order to ensure that all rows with the same feature value end up on the same partition. ## How was this patch tested? Added a unit test. Author: z001qdp <Nicholas.Eggert@target.com> Closes #14140 from neggert/SPARK-16426-isotonic-nan.
*	[SPARK-16500][ML][MLLIB][OPTIMIZER] add LBFGS convergence warning for all ↵	WeichenXu	2016-07-14	3	-0/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	used place in MLLib ## What changes were proposed in this pull request? Add warning_for the following case when LBFGS training not actually convergence: 1) LogisticRegression 2) AFTSurvivalRegression 3) LBFGS algorithm wrapper in mllib package ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14157 from WeichenXu123/add_lbfgs_convergence_warning_for_all_used_place.
*	[SPARK-16485][ML][DOC] Fix privacy of GLM members, rename sqlDataTypes for ↵	Joseph K. Bradley	2016-07-13	6	-14/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ML, doc fixes ## What changes were proposed in this pull request? Fixing issues found during 2.0 API checks: * GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should not be exposed * sqlDataTypes: name does not follow conventions. Do we need to expose it? * Evaluator: inconsistent doc between evaluate and isLargerBetter * MinMaxScaler: math rendering --> hard to make it great, but I'll change it a little * GeneralizedLinearRegressionSummary: aic doc is incorrect --> will change to use more common name ## How was this patch tested? Existing unit tests. Docs generated locally. (MinMaxScaler is improved a tiny bit.) Author: Joseph K. Bradley <joseph@databricks.com> Closes #14187 from jkbradley/final-api-check-2.0.
*	[SPARK-14812][ML][MLLIB][PYTHON] Experimental, DeveloperApi annotation audit ↵	Joseph K. Bradley	2016-07-13	60	-271/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for ML ## What changes were proposed in this pull request? General decisions to follow, except where noted: * spark.mllib, pyspark.mllib: Remove all Experimental annotations. Leave DeveloperApi annotations alone. * spark.ml, pyspark.ml Annotate Estimator-Model pairs of classes and companion objects the same way. For all algorithms marked Experimental with Since tag <= 1.6, remove Experimental annotation. ** For all algorithms marked Experimental with Since tag = 2.0, leave Experimental annotation. * DeveloperApi annotations are left alone, except where noted. * No changes to which types are sealed. Exceptions where I am leaving items Experimental in spark.ml, pyspark.ml, mainly because the items are new: * Model Summary classes * MLWriter, MLReader, MLWritable, MLReadable * Evaluator and subclasses: There is discussion of changes around evaluating multiple metrics at once for efficiency. * RFormula: Its behavior may need to change slightly to match R in edge cases. * AFTSurvivalRegression * MultilayerPerceptronClassifier DeveloperApi changes: * ml.tree.Node, ml.tree.Split, and subclasses should no longer be DeveloperApi ## How was this patch tested? N/A Note to reviewers: * spark.ml.clustering.LDA underwent significant changes (additional methods), so let me know if you want me to leave it Experimental. * Be careful to check for cases where a class should no longer be Experimental but has an Experimental method, val, or other feature. I did not find such cases, but please verify. Author: Joseph K. Bradley <joseph@databricks.com> Closes #14147 from jkbradley/experimental-audit.
*	[SPARK-16469] enhanced simulate multiply	oraviv	2016-07-13	1	-4/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We have a use case of multiplying very big sparse matrices. we have about 1000x1000 distributed block matrices multiplication and the simulate multiply goes like O(n^4) (n being 1000). it takes about 1.5 hours. We modified it slightly with classical hashmap and now run in about 30 seconds O(n^2). ## How was this patch tested? We have added a performance test and verified the reduced time. Author: oraviv <oraviv@paypal.com> Closes #14068 from uzadude/master.
*	[SPARK-16440][MLLIB] Undeleted broadcast variables in Word2Vec causing OoM ↵	Sean Owen	2016-07-13	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for long runs ## What changes were proposed in this pull request? Unpersist broadcasted vars in Word2Vec.fit for more timely / reliable resource cleanup ## How was this patch tested? Jenkins tests Author: Sean Owen <sowen@cloudera.com> Closes #14153 from srowen/SPARK-16440.
*	[SPARK-16470][ML][OPTIMIZER] Check linear regression training whether ↵	WeichenXu	2016-07-12	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	actually reach convergence and add warning if not ## What changes were proposed in this pull request? In `ml.regression.LinearRegression`, it use breeze `LBFGS` and `OWLQN` optimizer to do data training, but do not check whether breeze's optimizer returned result actually reached convergence. The `LBFGS` and `OWLQN` optimizer in breeze finish iteration may result the following situations: 1) reach max iteration number 2) function reach value convergence 3) objective function stop improving 4) gradient reach convergence 5) search failed(due to some internal numerical error) I add warning printing code so that if the iteration result is (1) or (3) or (5) in above, it will print a warning with respective reason string. ## How was this patch tested? Manual. Author: WeichenXu <WeichenXu123@outlook.com> Closes #14122 from WeichenXu123/add_lr_not_convergence_warn.
*	[MINOR][ML] update comment where is inconsistent with code in ↵	WeichenXu	2016-07-12	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ml.regression.LinearRegression ## What changes were proposed in this pull request? In `train` method of `ml.regression.LinearRegression` when handling situation `std(label) == 0` the code replace `std(label)` with `mean(label)` but the relative comment is inconsistent, I update it. ## How was this patch tested? N/A Author: WeichenXu <WeichenXu123@outlook.com> Closes #14121 from WeichenXu123/update_lr_comment.
*	[SPARK-16477] Bump master version to 2.1.0-SNAPSHOT	Reynold Xin	2016-07-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number. ## How was this patch tested? N/A Author: Reynold Xin <rxin@databricks.com> Closes #14130 from rxin/SPARK-16477.
*	[SPARK-16369][MLLIB] tallSkinnyQR of RowMatrix should aware of empty partition	Xusen Yin	2016-07-08	2	-2/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? tallSkinnyQR of RowMatrix should aware of empty partition, which could cause exception from Breeze qr decomposition. See the [archived dev mail](https://mail-archives.apache.org/mod_mbox/spark-dev/201510.mbox/%3CCAF7ADNrycvPL3qX-VZJhq4OYmiUUhoscut_tkOm63Cm18iK1tQmail.gmail.com%3E) for more details. ## How was this patch tested? Scala unit test. Author: Xusen Yin <yinxusen@gmail.com> Closes #14049 from yinxusen/SPARK-16369.
*	[SPARK-16372][MLLIB] Retag RDD to tallSkinnyQR of RowMatrix	Xusen Yin	2016-07-07	3	-2/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The following Java code because of type erasing: ```Java JavaRDD<Vector> rows = jsc.parallelize(...); RowMatrix mat = new RowMatrix(rows.rdd()); QRDecomposition<RowMatrix, Matrix> result = mat.tallSkinnyQR(true); ``` We should use retag to restore the type to prevent the following exception: ```Java java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to [Lorg.apache.spark.mllib.linalg.Vector; ``` ## How was this patch tested? Java unit test Author: Xusen Yin <yinxusen@gmail.com> Closes #14051 from yinxusen/SPARK-16372.
*	[SPARK-15740][MLLIB] Word2VecSuite "big model load / save" caused OOM in ↵	tmnd1991	2016-07-06	2	-10/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	maven jenkins builds ## What changes were proposed in this pull request? "test big model load / save" in Word2VecSuite, lately resulted into OOM. Therefore we decided to make the partitioning adaptive (not based on spark default "spark.kryoserializer.buffer.max" conf) and then testing it using a small buffer size in order to trigger partitioning without allocating too much memory for the test. ## How was this patch tested? It was tested running the following unit test: org.apache.spark.mllib.feature.Word2VecSuite Author: tmnd1991 <antonio.murgia2@studio.unibo.it> Closes #13509 from tmnd1991/SPARK-15740.
*	[SPARK-16307][ML] Add test to verify the predicted variances of a DT on toy data	MechCoder	2016-07-06	2	-0/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The current tests assumes that `impurity.calculate()` returns the variance correctly. It should be better to make the tests independent of this assumption. In other words verify that the variance computed equals the variance computed manually on a small tree. ## How was this patch tested? The patch is a test.... Author: MechCoder <mks542@nyu.edu> Closes #13981 from MechCoder/dt_variance.
*	[SPARK-16249][ML] Change visibility of Object ml.clustering.LDA to public ↵	Yuhao Yang	2016-07-06	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for loading ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16249 Change visibility of Object ml.clustering.LDA to public for loading, thus users can invoke LDA.load("path"). ## How was this patch tested? existing ut and manually test for load ( saved with current code) Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #13941 from hhbyyh/ldapublic.
*	[SPARK-14608][ML] transformSchema needs better documentation	Yuhao Yang	2016-06-30	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14608 PipelineStage.transformSchema currently has minimal documentation. It should have more to explain it can: check schema check parameter interactions ## How was this patch tested? unit test Author: Yuhao Yang <hhbyyh@gmail.com> Author: Yuhao Yang <yuhao.yang@intel.com> Closes #12384 from hhbyyh/transformSchemaDoc.
*	[SPARK-16241][ML] model loading backward compatibility for ml NaiveBayes	zlpmichelle	2016-06-30	1	-4/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? model loading backward compatibility for ml NaiveBayes ## How was this patch tested? existing ut and manual test for loading models saved by Spark 1.6. Author: zlpmichelle <zlpmichelle@gmail.com> Closes #13940 from zlpmichelle/naivebayes.
*	[SPARK-15858][ML] Fix calculating error by tree stack over flow prob…	Mahmoud Rawas	2016-06-29	2	-43/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? What changes were proposed in this pull request? Improving evaluateEachIteration function in mllib as it fails when trying to calculate error by tree for a model that has more than 500 trees ## How was this patch tested? the batch tested on productions data set (2K rows x 2K features) training a gradient boosted model without validation with 1000 maxIteration settings, then trying to produce the error by tree, the new patch was able to perform the calculation within 30 seconds, while previously it was take hours then fail. PS: It would be better if this PR can be cherry picked into release branches 1.6.1 and 2.0 Author: Mahmoud Rawas <mhmoudr@gmail.com> Author: Mahmoud Rawas <Mahmoud.Rawas@quantium.com.au> Closes #13624 from mhmoudr/SPARK-15858.master.
*	[SPARK-16245][ML] model loading backward compatibility for ml.feature.PCA	Yanbo Liang	2016-06-28	1	-10/+8
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? model loading backward compatibility for ml.feature.PCA. ## How was this patch tested? existing ut and manual test for loading models saved by Spark 1.6. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13937 from yanboliang/spark-16245.
*	[SPARK-16242][MLLIB][PYSPARK] Conversion between old/new matrix columns in a ↵	Yanbo Liang	2016-06-28	1	-0/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrame (Python) ## What changes were proposed in this pull request? This PR implements python wrappers for #13888 to convert old/new matrix columns in a DataFrame. ## How was this patch tested? Doctest in python. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13935 from yanboliang/spark-16242.
*	[SPARK-16187][ML] Implement util method for ML Matrix conversion in scala/java	Yuhao Yang	2016-06-27	4	-7/+187
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16187 This is to provide conversion utils between old/new vector columns in a DataFrame. So users can use it to migrate their datasets and pipelines manually. ## How was this patch tested? java and scala ut Author: Yuhao Yang <yuhao.yang@intel.com> Closes #13888 from hhbyyh/matComp.
*	[MLLIB] org.apache.spark.mllib.util.SVMDataGenerator generates ↵	José Antonio	2016-06-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ArrayIndexOutOfBoundsException. I have found the bug and tested the solution. ## What changes were proposed in this pull request? Just adjust the size of an array in line 58 so it does not cause an ArrayOutOfBoundsException in line 66. ## How was this patch tested? Manual tests. I have recompiled the entire project with the fix, it has been built successfully and I have run the code, also with good results. line 66: val yD = blas.ddot(trueWeights.length, x, 1, trueWeights, 1) + rnd.nextGaussian() * 0.1 crashes because trueWeights has length "nfeatures + 1" while "x" has length "features", and they should have the same length. To fix this just make trueWeights be the same length as x. I have recompiled the project with the change and it is working now: [spark-1.6.1]$ spark-submit --master local[*] --class org.apache.spark.mllib.util.SVMDataGenerator mllib/target/spark-mllib_2.11-1.6.1.jar local /home/user/test And it generates the data successfully now in the specified folder. Author: José Antonio <joseanmunoz@gmail.com> Closes #13895 from j4munoz/patch-2.
*	[SPARK-16133][ML] model loading backward compatibility for ml.feature	Yuhao Yang	2016-06-23	3	-5/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? model loading backward compatibility for ml.feature, ## How was this patch tested? existing ut and manual test for loading 1.6 models. Author: Yuhao Yang <yuhao.yang@intel.com> Author: Yuhao Yang <hhbyyh@gmail.com> Closes #13844 from hhbyyh/featureComp.
*	[SPARK-16177][ML] model loading backward compatibility for ml.regression	Yuhao Yang	2016-06-23	2	-7/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16177 model loading backward compatibility for ml.regression ## How was this patch tested? existing ut and manual test for loading 1.6 models. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #13879 from hhbyyh/regreComp.
*	[SPARK-16130][ML] model loading backward compatibility for ↵	Yuhao Yang	2016-06-23	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	ml.classfication.LogisticRegression ## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-16130 model loading backward compatibility for ml.classfication.LogisticRegression ## How was this patch tested? existing ut and manual test for loading old models. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #13841 from hhbyyh/lrcomp.
*	[SPARK-16154][MLLIB] Update spark.ml and spark.mllib package docs	Xiangrui Meng	2016-06-23	5	-9/+72
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since we decided to switch spark.mllib package into maintenance mode in 2.0, it would be nice to update the package docs to reflect this change. ## How was this patch tested? Manually checked generated APIs. Author: Xiangrui Meng <meng@databricks.com> Closes #13859 from mengxr/SPARK-16154.
*	[SPARK-16153][MLLIB] switch to multi-line doc to avoid a genjavadoc bug	Xiangrui Meng	2016-06-22	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We recently deprecated setLabelCol in ChiSqSelectorModel (#13823): ~~~scala /** group setParam / Since("1.6.0") deprecated("labelCol is not used by ChiSqSelectorModel.", "2.0.0") def setLabelCol(value: String): this.type = set(labelCol, value) ~~~ This unfortunately hit a genjavadoc bug and broken doc generation. This is the generated Java code: ~~~java /* group setParam / public org.apache.spark.ml.feature.ChiSqSelectorModel setOutputCol (java.lang.String value) { throw new RuntimeException(); } * deprecated labelCol is not used by ChiSqSelectorModel. Since 2.0.0. */ public org.apache.spark.ml.feature.ChiSqSelectorModel setLabelCol (java.lang.String value) { throw new RuntimeException(); } ~~~ Switching to multiline is a workaround. Author: Xiangrui Meng <meng@databricks.com> Closes #13855 from mengxr/SPARK-16153.
*	[MINOR][MLLIB] DefaultParamsReadable/Writable should be DeveloperApi	Xiangrui Meng	2016-06-22	1	-8/+5
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `DefaultParamsReadable/Writable` are not user-facing. Only developers who implement `Transformer/Estimator` would use it. So this PR changes the annotation to `DeveloperApi`. Author: Xiangrui Meng <meng@databricks.com> Closes #13828 from mengxr/default-readable-should-be-developer-api.
*	[SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg	Nick Pentreath	2016-06-22	12	-31/+31
\| \| \| \| \| \| \| \| \| \| \| \|	[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them. ## How was this patch tested? Existing unit tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
*	[SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs	Holden Karau	2016-06-22	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc. ## How was this patch tested? Built docs locally & PySpark SQL tests Author: Holden Karau <holden@us.ibm.com> Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
*	[SPARK-15644][MLLIB][SQL] Replace SQLContext with SparkSession in MLlib	gatorsmile	2016-06-21	30	-80/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	#### What changes were proposed in this pull request? This PR is to use the latest `SparkSession` to replace the existing `SQLContext` in `MLlib`. `SQLContext` is removed from `MLlib`. Also fix a test case issue in `BroadcastJoinSuite`. BTW, `SQLContext` is not being used in the `MLlib` test suites. #### How was this patch tested? Existing test cases. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #13380 from gatorsmile/sqlContextML.
*	[MINOR][MLLIB] deprecate setLabelCol in ChiSqSelectorModel	Xiangrui Meng	2016-06-21	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Deprecate `labelCol`, which is not used by ChiSqSelectorModel. Author: Xiangrui Meng <meng@databricks.com> Closes #13823 from mengxr/deprecate-setLabelCol-in-ChiSqSelectorModel.
*	[SPARK-16118][MLLIB] add getDropLast to OneHotEncoder	Xiangrui Meng	2016-06-21	2	-1/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We forgot the getter of `dropLast` in `OneHotEncoder` ## How was this patch tested? unit test Author: Xiangrui Meng <meng@databricks.com> Closes #13821 from mengxr/SPARK-16118.
*	[SPARK-16117][MLLIB] hide LibSVMFileFormat and move its doc to LibSVMDataSource	Xiangrui Meng	2016-06-21	2	-38/+59
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? LibSVMFileFormat implements data source for LIBSVM format. However, users do not really need to call its APIs to use it. So we should hide it in the public API docs. The main issue is that we still need to put the documentation and example code somewhere. The proposal it to have a dummy class to hold the documentation, as a workaround to https://issues.scala-lang.org/browse/SI-8124. ## How was this patch tested? Manually checked the generated API doc and tested loading LIBSVM data. Author: Xiangrui Meng <meng@databricks.com> Closes #13819 from mengxr/SPARK-16117.
*	[MINOR][MLLIB] move setCheckpointInterval to non-expert setters	Xiangrui Meng	2016-06-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `checkpointInterval` is a non-expert param. This PR moves its setter to non-expert group. Author: Xiangrui Meng <meng@databricks.com> Closes #13813 from mengxr/checkpoint-non-expert.
*	[SPARK-15177][.1][R] make SparkR model params and default values consistent ↵	Xiangrui Meng	2016-06-21	2	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with MLlib ## What changes were proposed in this pull request? This PR is a subset of #13023 by yanboliang to make SparkR model param names and default values consistent with MLlib. I tried to avoid other changes from #13023 to keep this PR minimal. I will send a follow-up PR to improve the documentation. Main changes: * `spark.glm`: epsilon -> tol, maxit -> maxIter * `spark.kmeans`: default k -> 2, default maxIter -> 20, default initMode -> "k-means\|\|" * `spark.naiveBayes`: laplace -> smoothing, default 1.0 ## How was this patch tested? Existing unit tests. Author: Xiangrui Meng <meng@databricks.com> Closes #13801 from mengxr/SPARK-15177.1.
*	[SPARK-10258][DOC][ML] Add @Since annotations to ml.feature	Nick Pentreath	2016-06-21	27	-63/+357
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds missing `Since` annotations to `ml.feature` package. Closes #8505. ## How was this patch tested? Existing tests. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13641 from MLnick/add-since-annotations.