spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-9681] [ML] Support R feature interactions in RFormula	Eric Liang	2015-09-25	2	-5/+160
\| \| \| \| \| \| \| \| \| \| \| \|	This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`). To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8830 from ericl/interaction-2.
*	[SPARK-10763] [ML] [JAVA] [TEST] Update Java MLLIB/ML tests to use ↵	Holden Karau	2015-09-23	10	-39/+42
\| \| \| \| \| \| \| \| \| \|	simplified dataframe construction As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those. Author: Holden Karau <holden@pigscanfly.ca> Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
*	[SPARK-10686] [ML] Add quantilesCol to AFTSurvivalRegression	Yanbo Liang	2015-09-23	1	-25/+49
\| \| \| \| \| \| \| \|	By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8836 from yanboliang/spark-10686.
*	[SPARK-9715] [ML] Store numFeatures in all ML PredictionModel types	sethah	2015-09-23	11	-15/+38
\| \| \| \| \| \| \| \|	All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility. Author: sethah <seth.hendrickson16@gmail.com> Closes #8675 from sethah/SPARK-9715.
*	[SPARK-10706] [MLLIB] Add java wrapper for random vector rdd	Meihua Wu	2015-09-22	1	-0/+17
\| \| \| \| \| \| \| \| \| \|	Add java wrapper for random vector rdd holdenk srowen Author: Meihua Wu <meihuawu@umich.edu> Closes #8841 from rotationsymmetry/SPARK-10706.
*	[SPARK-3147] [MLLIB] [STREAMING] Streaming 2-sample statistical significance ↵	Feynman Liang	2015-09-21	1	-0/+243
\| \| \| \| \| \| \| \| \| \| \|	testing Implementation of significance testing using Streaming API. Author: Feynman Liang <fliang@databricks.com> Author: Feynman Liang <feynman.liang@gmail.com> Closes #4716 from feynmanliang/ab_testing.
*	[SPARK-9642] [ML] LinearRegression should supported weighted data	Meihua Wu	2015-09-21	1	-0/+88
\| \| \| \| \| \| \| \| \| \|	In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling. work in progress. Author: Meihua Wu <meihuawu@umich.edu> Closes #8631 from rotationsymmetry/SPARK-9642.
*	[SPARK-10626] [MLLIB] create java friendly method for random rdd	Holden Karau	2015-09-21	1	-0/+30
\| \| \| \| \| \| \| \|	SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method. Author: Holden Karau <holden@pigscanfly.ca> Closes #8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.
*	[SPARK-8518] [ML] Log-linear models for survival analysis	Yanbo Liang	2015-09-17	1	-0/+311
\| \| \| \| \| \| \| \| \|	[Accelerated Failure Time (AFT) model](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time. Users can refer to the R function [```survreg```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html) to compare the model and [```predict```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/predict.survreg.html) to compare the prediction. There are different kinds of model prediction, I have just select the type ```response``` which is default used for R. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8611 from yanboliang/spark-8518.
*	[SPARK-9698] [ML] Add RInteraction transformer for supporting R-style ↵	Eric Liang	2015-09-17	1	-0/+165
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	feature interactions This is a pre-req for supporting the ":" operator in the RFormula feature transformer. Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit mengxr Author: Eric Liang <ekl@databricks.com> Closes #7987 from ericl/interaction.
*	[SPARK-7685] [ML] Apply weights to different samples in Logistic Regression	DB Tsai	2015-09-15	2	-9/+120
\| \| \| \| \| \| \| \| \| \| \|	In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm. Author: DB Tsai <dbt@netflix.com> Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com> Closes #7884 from dbtsai/SPARK-7685.
*	[SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS	Yuhao Yang	2015-09-15	1	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-10491 We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`. Let me know if new UT needed. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8663 from hhbyyh/movedspr.
*	[SPARK-10573] [ML] IndexToString output schema should be StringType	Nick Pritchard	2015-09-14	1	-0/+8
\| \| \| \| \| \| \| \|	Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8751 from pnpritchard/SPARK-10573.
*	[SPARK-10537] [ML] document LIBSVM source options in public API doc and some ↵	Xiangrui Meng	2015-09-11	2	-16/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	minor improvements We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR: 1. Do `vectorType == "sparse"` only once. 2. Update `hashCode` and `equals`. 3. Remove inherited doc. 4. Delete temp dir in `afterAll`. Lewuathe Author: Xiangrui Meng <meng@databricks.com> Closes #8699 from mengxr/SPARK-10537.
*	[SPARK-10117] [MLLIB] Implement SQL data source API for reading LIBSVM data	lewuathe	2015-09-09	2	-0/+156
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API. Two option is implemented. * `numFeatures`: Specify the dimension of features vector * `featuresType`: Specify the type of output vector. `sparse` is default. Author: lewuathe <lewuathe@me.com> Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits: 986999d [lewuathe] Change unit test phrase 11d513f [lewuathe] Fix some reviews 21600a4 [lewuathe] Merge branch 'master' into SPARK-10117 9ce63c7 [lewuathe] Rewrite service loader file 1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117 ba3657c [lewuathe] Merge branch 'master' into SPARK-10117 0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF 4f40891 [lewuathe] Improve test suites 5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117 8660d0e [lewuathe] Fix Java unit test b56a948 [lewuathe] Merge branch 'master' into SPARK-10117 2c12894 [lewuathe] Remove unnecessary tag 7d693c2 [lewuathe] Resolv conflict 62010af [lewuathe] Merge branch 'master' into SPARK-10117 a97ee97 [lewuathe] Fix some points aef9564 [lewuathe] Fix 70ee4dd [lewuathe] Add Java test 3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data 40d3027 [lewuathe] Add Java test 7056d4a [lewuathe] Merge branch 'master' into SPARK-10117 99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
*	[SPARK-10464] [MLLIB] Add WeibullGenerator for RandomDataGenerator	Yanbo Liang	2015-09-08	1	-1/+15
\| \| \| \| \| \| \| \| \|	Add WeibullGenerator for RandomDataGenerator. #8611 need use WeibullGenerator to generate random data based on Weibull distribution. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8622 from yanboliang/spark-10464.
*	[SPARK-9834] [MLLIB] implement weighted least squares via normal equation	Xiangrui Meng	2015-09-08	1	-0/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet. There are couple TODOs that can be addressed in future PRs: * consolidate summary statistics aggregators * move `dspr` to `BLAS` * etc It would be nice to have this merged first because it blocks couple other features. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8588 from mengxr/SPARK-9834.
*	[SPARK-10470] [ML] ml.IsotonicRegressionModel.copy should set parent	Yanbo Liang	2015-09-08	1	-0/+5
\| \| \| \| \| \| \| \| \|	Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent. Here fix it and add test case. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8637 from yanboliang/spark-10470.
*	[SPARK-10013] [ML] [JAVA] [TEST] remove java assert from java unit tests	Holden Karau	2015-09-05	4	-52/+54
\| \| \| \| \| \| \| \|	From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests. Author: Holden Karau <holden@pigscanfly.ca> Closes #8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
*	[SPARK-9723] [ML] params getordefault should throw more useful error	Holden Karau	2015-09-02	2	-7/+13
\| \| \| \| \| \| \| \|	Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup. Author: Holden Karau <holden@pigscanfly.ca> Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
*	[SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover	Holden Karau	2015-09-01	1	-1/+1
\| \| \| \| \| \| \| \|	Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
*	[SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset ↵	Burak Yavuz	2015-08-30	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	of matrix multiplications mengxr jkbradley rxin It would be great if this fix made it into RC3! Author: Burak Yavuz <brkyvz@gmail.com> Closes #8525 from brkyvz/blas-scaling.
*	[SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java ↵	Feynman Liang	2015-08-27	1	-0/+72
\| \| \| \| \| \| \| \| \| \| \| \| \|	compatibility test * Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine * Cleans up scaladocs for public methods * Adds test for Java compatibility * Follow up Python user guide code example is tracked by SPARK-10249 Author: Feynman Liang <fliang@databricks.com> Closes #8436 from feynmanliang/SPARK-10230.
*	[SPARK-10257] [MLLIB] Removes Guava from all spark.mllib Java tests	Feynman Liang	2015-08-27	14	-74/+71
\| \| \| \| \| \| \| \| \| \| \| \|	* Replaces instances of `Lists.newArrayList` with `Arrays.asList` * Replaces `commons.lang.StringUtils` over `com.google.collections.Strings` * Replaces `List` interface over `ArrayList` implementations This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests. Author: Feynman Liang <fliang@databricks.com> Closes #8451 from feynmanliang/SPARK-10257.
*	[SPARK-9613] [HOTFIX] Fix usage of JavaConverters removed in Scala 2.11	Jacek Laskowski	2015-08-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases. Build for 2.10: ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install and 2.11: ./dev/change-scala-version.sh 2.11 ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install Author: Jacek Laskowski <jacek@japila.pl> Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
*	[SPARK-10256] [ML] Removes guava dependency from spark.ml.classification ↵	Feynman Liang	2015-08-27	1	-2/+2
\| \| \| \| \| \| \| \|	JavaTests Author: Feynman Liang <fliang@databricks.com> Closes #8447 from feynmanliang/SPARK-10256.
*	[SPARK-10255] [ML] Removes Guava dependencies from spark.ml.param JavaTests	Feynman Liang	2015-08-27	2	-6/+6
\| \| \| \| \| \|	Author: Feynman Liang <fliang@databricks.com> Closes #8446 from feynmanliang/SPARK-10255.
*	[SPARK-10254] [ML] Removes Guava dependencies in spark.ml.feature JavaTests	Feynman Liang	2015-08-27	11	-30/+35
\| \| \| \| \| \| \| \| \|	* Replaces `com.google.common` dependencies with `java.util.Arrays` * Small clean up in `JavaNormalizerSuite` Author: Feynman Liang <fliang@databricks.com> Closes #8445 from feynmanliang/SPARK-10254.
*	[SPARK-9888] [MLLIB] User guide for new LDA features	Feynman Liang	2015-08-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	* Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888.
*	[SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration	Feynman Liang	2015-08-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230.
*	[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵	Sean Owen	2015-08-25	5	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \|	to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
*	[SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug	Joseph K. Bradley	2015-08-23	1	-2/+20
\| \| \| \| \| \| \| \| \| \| \| \|	GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix.
*	[SPARK-9893] User guide with Java test suite for VectorSlicer	Xusen Yin	2015-08-21	1	-0/+85
\| \| \| \| \| \| \| \| \| \|	Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin <yinxusen@gmail.com> Closes #8267 from yinxusen/SPARK-9893.
*	[SPARK-9245] [MLLIB] LDA topic assignments	Joseph K. Bradley	2015-08-20	2	-2/+26
\| \| \| \| \| \| \| \| \| \|	For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token. CC: rotationsymmetry mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8329 from jkbradley/lda-topic-assignments.
*	[SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add ↵	Xiangrui Meng	2015-08-20	1	-0/+74
\| \| \| \| \| \| \| \| \| \|	Java test suite Otherwise, setters do not return self type. jkbradley avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8342 from mengxr/SPARK-10138.
*	[SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator`	Feynman Liang	2015-08-19	3	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097.
*	[SPARK-10012] [ML] Missing test case for Params#arrayLengthGt	lewuathe	2015-08-18	1	-0/+3
\| \| \| \| \| \| \| \|	Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012.
*	[SPARK-9900] [MLLIB] User guide for Association Rules	Feynman Liang	2015-08-18	1	-1/+1
\| \| \| \| \| \| \| \|	Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules.
*	[SPARK-9028] [ML] Add CountVectorizer as an estimator to generate ↵	Yuhao Yang	2015-08-18	2	-73/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator.
*	[SPARK-9981] [ML] Made labels public for StringIndexerModel	Joseph K. Bradley	2015-08-14	1	-0/+18
\| \| \| \| \| \| \| \| \| \|	Also added unit test for integration between StringIndexerModel and IndexToString CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8211 from jkbradley/stridx-labels.
*	[SPARK-8744] [ML] Add a public constructor to StringIndexer	Holden Karau	2015-08-14	1	-0/+2
\| \| \| \| \| \| \| \|	It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model. Author: Holden Karau <holden@pigscanfly.ca> Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
*	[SPARK-9956] [ML] Make trees work with one-category features	Joseph K. Bradley	2015-08-14	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical. As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing. Targeted for 1.5 and master CC: manishamde mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8187 from jkbradley/tree-1cat.
*	[SPARK-9661] [MLLIB] minor clean-up of SPARK-9661	Xiangrui Meng	2015-08-14	2	-18/+24
\| \| \| \| \| \| \| \|	Some minor clean-ups after SPARK-9661. See my inline comments. MechCoder jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #8190 from mengxr/SPARK-9661-fix.
*	[SPARK-9922] [ML] rename StringIndexerReverse to IndexToString	Xiangrui Meng	2015-08-13	1	-15/+35
\| \| \| \| \| \| \| \| \| \| \| \| \|	What `StringIndexerInverse` does is not strictly associated with `StringIndexer`, and the name is not clearly describing the transformation. Renaming to `IndexToString` might be better. ~~I also changed `invert` to `inverse` without arguments. `inputCol` and `outputCol` could be set after.~~ I also removed `invert`. jkbradley holdenk Author: Xiangrui Meng <meng@databricks.com> Closes #8152 from mengxr/SPARK-9922.
*	[SPARK-9661] [MLLIB] [ML] Java compatibility	MechCoder	2015-08-13	3	-0/+59
\| \| \| \| \| \| \| \| \| \| \| \|	I skimmed through the docs for various instance of Object and replaced them with Java compaible versions of the same. 1. Some methods in LDAModel. 2. runMiniBatchSGD 3. kolmogorovSmirnovTest Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #8126 from MechCoder/java_incop.
*	[SPARK-9073] [ML] spark.ml Models copy() should call setParent when there is ↵	lewuathe	2015-08-13	19	-2/+113
\| \| \| \| \| \| \| \| \| \| \|	a parent Copied ML models must have the same parent of original ones Author: lewuathe <lewuathe@me.com> Author: Lewuathe <lewuathe@me.com> Closes #7447 from Lewuathe/SPARK-9073.
*	[SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol	Xiangrui Meng	2015-08-12	1	-9/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues. This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters. jkbradley yu-iskw Author: Xiangrui Meng <meng@databricks.com> Closes #8148 from mengxr/SPARK-9918 and squashes the following commits: 149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol 3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
*	[SPARK-9789] [ML] Added logreg threshold param back	Joseph K. Bradley	2015-08-12	2	-13/+27
\| \| \| \| \| \| \| \| \| \|	Reinstated LogisticRegression.threshold Param for binary compatibility. Param thresholds overrides threshold, if set. CC: mengxr dbtsai feynmanliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8079 from jkbradley/logreg-reinstate-threshold.
*	[SPARK-9847] [ML] Modified copyValues to distinguish between default, ↵	Joseph K. Bradley	2015-08-12	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \|	explicit param values From JIRA: Currently, Params.copyValues copies default parameter values to the paramMap of the target instance, rather than the defaultParamMap. It should copy to the defaultParamMap because explicitly setting a parameter can change the semantics. This issue arose in SPARK-9789, where 2 params "threshold" and "thresholds" for LogisticRegression can have mutually exclusive values. If thresholds is set, then fit() will copy the default value of threshold as well, easily resulting in inconsistent settings for the 2 params. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8115 from jkbradley/copyvalues-fix.
*	[SPARK-9788] [MLLIB] Fix LDA Binary Compatibility	Feynman Liang	2015-08-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1. Add “asymmetricDocConcentration” and revert docConcentration changes. If the (internal) doc concentration vector is a single value, “getDocConcentration" returns it. If it is a constant vector, getDocConcentration returns the first item, and fails otherwise. 2. Give `LDAModel.gammaShape` a default value in `LDAModel` concrete class constructors. jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8077 from feynmanliang/SPARK-9788 and squashes the following commits: 6b07bc8 [Feynman Liang] Code review changes 9d6a71e [Feynman Liang] Add asymmetricAlpha alias bf4e685 [Feynman Liang] Asymmetric docConcentration 4cab972 [Feynman Liang] Default gammaShape