spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" ↵	Ruifeng Zheng	2016-06-04	2	-8/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	f1_score ## What changes were proposed in this pull request? 1, del precision,recall in `ml.MulticlassClassificationEvaluator` 2, update user guide for `mlllib.weightedFMeasure` ## How was this patch tested? local build Author: Ruifeng Zheng <ruifengz@foxmail.com> Closes #13390 from zhengruifeng/clarify_f1.
*	[SPARK-15494][SQL] encoder code cleanup	Wenchen Fan	2016-06-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Our encoder framework has been evolved a lot, this PR tries to clean up the code to make it more readable and emphasise the concept that encoder should be used as a container of serde expressions. 1. move validation logic to analyzer instead of encoder 2. only have a `resolveAndBind` method in encoder instead of `resolve` and `bind`, as we don't have the encoder life cycle concept anymore. 3. `Dataset` don't need to keep a resolved encoder, as there is no such concept anymore. bound encoder is still needed to do serialization outside of query framework. 4. Using `BoundReference` to represent an unresolved field in deserializer expression is kind of weird, this PR adds a `GetColumnByOrdinal` for this purpose. (serializer expression still use `BoundReference`, we can replace it with `GetColumnByOrdinal` in follow-ups) ## How was this patch tested? existing test Author: Wenchen Fan <wenchen@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #13269 from cloud-fan/clean-encoder.
*	[SPARK-15740][MLLIB] ignore big model load / save in Word2VecSuite	Xiangrui Meng	2016-06-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? andrewor14 noticed some OOM errors caused by "test big model load / save" in Word2VecSuite, e.g., https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-master-test-maven-hadoop-2.2/1168/consoleFull. It doesn't show up in the test result because it was OOMed. This PR disables the test. I will leave the JIRA open for a proper fix ## How was this patch tested? No new features. Author: Xiangrui Meng <meng@databricks.com> Closes #13478 from mengxr/SPARK-15740.
*	[SPARK-15668][ML] ml.feature: update check schema to avoid confusion when ↵	Yuhao Yang	2016-06-02	4	-36/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	user use MLlib.vector as input type ## What changes were proposed in this pull request? ml.feature: update check schema to avoid confusion when user use MLlib.vector as input type ## How was this patch tested? existing ut Author: Yuhao Yang <yuhao.yang@intel.com> Closes #13411 from hhbyyh/schemaCheck.
*	[MINOR] clean up style for storage param setters in ALS	Nick Pentreath	2016-06-02	1	-6/+2
\| \| \| \| \| \| \| \| \| \| \|	Clean up style for param setter methods in ALS to match standard style and the other setter in class (this is an artefact of one of my previous PRs that wasn't cleaned up). ## How was this patch tested? Existing tests - no functionality change. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13480 from MLnick/als-param-minor-cleanup.
*	[SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.feature	Yanbo Liang	2016-06-01	4	-16/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include: * Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless. * Scala API docs update. * Sync Scala and Python API docs for these changes. ## How was this patch tested? Exist tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13410 from yanboliang/spark-15587.
*	[SPARK-15664][MLLIB] Replace FileSystem.get(conf) with ↵	Lianhui Wang	2016-06-01	5	-20/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	path.getFileSystem(conf) when removing CheckpointFile in MLlib ## What changes were proposed in this pull request? if sparkContext.set CheckpointDir to another Dir that is not default FileSystem, it will throw exception when removing CheckpointFile in MLlib. So we should always get the FileSystem from Path to avoid wrong FS problem. ## How was this patch tested? N/A Author: Lianhui Wang <lianhuiwang09@gmail.com> Closes #13408 from lianhuiwang/SPARK-15664.
*	[SPARK-15618][SQL][MLLIB] Use SparkSession.builder.sparkContext if applicable.	Dongjoon Hyun	2016-05-31	22	-51/+49
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR changes function `SparkSession.builder.sparkContext(..)` from private[sql] into private[spark], and uses it if applicable like the followings. ``` - val spark = SparkSession.builder().config(sc.getConf).getOrCreate() + val spark = SparkSession.builder().sparkContext(sc).getOrCreate() ``` ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13365 from dongjoon-hyun/SPARK-15618.
*	[MINOR] Resolve a number of miscellaneous build warnings	Sean Owen	2016-05-29	2	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately. ## How was this patch tested? Jenkins Author: Sean Owen <sowen@cloudera.com> Closes #13377 from srowen/BuildWarnings.
*	[SPARK-15610][ML] update error message for k in pca	Zheng RuiFeng	2016-05-27	2	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fix the wrong bound of `k` in `PCA` `require(k <= sources.first().size, ...` -> `require(k < sources.first().size` BTW, remove unused import in `ml.ElementwiseProduct` ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13356 from zhengruifeng/fix_pca.
*	[SPARK-15413][ML][MLLIB] Change `toBreeze` to `asBreeze` in Vector and Matrix	DB Tsai	2016-05-27	41	-125/+125
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? We're using `asML` to convert the mllib vector/matrix to ml vector/matrix now. Using `as` is more correct given that this conversion actually shares the same underline data structure. As a result, in this PR, `toBreeze` will be changed to `asBreeze`. This is a private API, as a result, it will not affect any user's application. ## How was this patch tested? unit tests Author: DB Tsai <dbt@netflix.com> Closes #13198 from dbtsai/minor.
*	[SPARK-11959][SPARK-15484][DOC][ML] Document WLS and IRLS	Yanbo Liang	2016-05-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? * Document ```WeightedLeastSquares```(normal equation) and ```IterativelyReweightedLeastSquares```. * Copy ```L-BFGS``` documents from ```spark.mllib``` to ```spark.ml```. Due to the session ```Optimization of linear methods``` is used for developers, I think we should provide the brief introduction of the optimization method, necessary references and how it implements in Spark. It's not necessary to paste all mathematical formula and derivation here. If developers/users want to learn more, they can track reference. ## How was this patch tested? Document update, no tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13262 from yanboliang/spark-15484.
*	[HOTFIX] Scala 2.10 compile GaussianMixtureModel	Andrew Or	2016-05-27	1	-1/+1
\|
*	[SPARK-15584][SQL] Abstract duplicate code: `spark.sql.sources.` properties	Dongjoon Hyun	2016-05-27	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR replaces `spark.sql.sources.` strings with `CreateDataSourceTableUtils.*` constant variables. ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13349 from dongjoon-hyun/SPARK-15584.
*	[SPARK-15603][MLLIB] Replace SQLContext with SparkSession in ML/MLLib	Dongjoon Hyun	2016-05-27	23	-195/+160
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR replaces all deprecated `SQLContext` occurrences with `SparkSession` in `ML/MLLib` module except the following two classes. These two classes use `SQLContext` in their function signatures. - ReadWrite.scala - TreeModels.scala ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13352 from dongjoon-hyun/SPARK-15603.
*	[MINOR] Fix Typos 'a -> an'	Zheng RuiFeng	2016-05-26	7	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `a` -> `an` I use regex to generate potential error lines: `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml//scala` and review them line by line. ## How was this patch tested? local build `lint-java` checking Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13317 from zhengruifeng/a_an.
*	[SPARK-15532][SQL] SQLContext/HiveContext's public constructors should use ↵	Yin Huai	2016-05-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	SparkSession.build.getOrCreate ## What changes were proposed in this pull request? This PR changes SQLContext/HiveContext's public constructor to use SparkSession.build.getOrCreate and removes isRootContext from SQLContext. ## How was this patch tested? Existing tests. Author: Yin Huai <yhuai@databricks.com> Closes #13310 from yhuai/SPARK-15532.
*	[SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecations	Sean Owen	2016-05-26	7	-39/+44
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items: * WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples. * Use in PythonMLlibAPI: Change to using private constructors * Streaming algs: No warnings after we un-deprecate the classes * Examples: Deprecate or change ones which use deprecated APIs * MulticlassMetrics fields (precision, etc.) * LinearRegressionSummary.model field ## How was this patch tested? Existing tests. Checked for warnings manually. Author: Sean Owen <sowen@cloudera.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #13314 from jkbradley/warning-cleanups.
*	[SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15	Villu Ruusmann	2016-05-26	1	-7/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? See https://issues.apache.org/jira/browse/SPARK-15523 This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes. ## How was this patch tested? 1. Executed `mvn clean package` in `mllib` directory 2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory. Author: Villu Ruusmann <villu.ruusmann@gmail.com> Closes #13297 from vruusmann/update-jpmml.
*	[SPARK-15543][SQL] Rename DefaultSources to make them more self-describing	Reynold Xin	2016-05-25	2	-3/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names. They are now named: - LibSVMFileFormat - CSVFileFormat - JdbcRelationProvider - JsonFileFormat - ParquetFileFormat - TextFileFormat Backward compatibility is maintained through aliasing. ## How was this patch tested? Updated relevant test cases too. Author: Reynold Xin <rxin@databricks.com> Closes #13311 from rxin/SPARK-15543.
*	Log warnings for numIterations * miniBatchFraction < 1.0	Gio Borje	2016-05-25	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used. This may be counter-intuitive to most users and led to the issue during the development of another Spark ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`. ## How was this patch tested? `build/mvn -DskipTests clean package` build succeeds Author: Gio Borje <gborje@linkedin.com> Closes #13265 from Hydrotoast/master.
*	[SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS	Nick Pentreath	2016-05-25	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice. We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields. Tests N/A. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
*	[MINOR][MLLIB][STREAMING][SQL] Fix typos	lfzCarlosC	2016-05-25	2	-2/+2
\| \| \| \| \| \| \| \| \| \|	fixed typos for source code for components [mllib] [streaming] and [SQL] None and obvious. Author: lfzCarlosC <lfz.carlos@gmail.com> Closes #13298 from lfzCarlosC/master.
*	[SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark ↵	Nick Pentreath	2016-05-24	1	-5/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	QuantileDiscretizer This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala. Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`). Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`. ## How was this patch tested? A little doctest and built API docs locally to check HTML doc generation. Author: Nick Pentreath <nickp@za.ibm.com> Closes #13228 from MLnick/SPARK-15442-py-relerror-param.
*	[SPARK-15339][ML] ML 2.0 QA: Scala APIs and code audit for regression	Yanbo Liang	2016-05-19	5	-47/+58
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? * ```GeneralizedLinearRegression``` API docs enhancement. * The default value of ```GeneralizedLinearRegression``` ```linkPredictionCol``` is not set rather than empty. This will consistent with other similar params such as ```weightCol``` * Make some methods more private. * Fix a minor bug of LinearRegression. * Fix some other issues. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13129 from yanboliang/spark-15339.
*	[SPARK-15075][SPARK-15345][SQL] Clean up SparkSession builder and propagate ↵	Reynold Xin	2016-05-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	config options to existing sessions if specified ## What changes were proposed in this pull request? Currently SparkSession.Builder use SQLContext.getOrCreate. It should probably the the other way around, i.e. all the core logic goes in SparkSession, and SQLContext just calls that. This patch does that. This patch also makes sure config options specified in the builder are propagated to the existing (and of course the new) SparkSession. ## How was this patch tested? Updated tests to reflect the change, and also introduced a new SparkSessionBuilderSuite that should cover all the branches. Author: Reynold Xin <rxin@databricks.com> Closes #13200 from rxin/SPARK-15075.
*	[SPARK-15296][MLLIB] Refactor All Java Tests that use SparkSession	Sandeep Singh	2016-05-19	58	-1147/+206
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Refactor All Java Tests that use SparkSession, to extend SharedSparkSesion ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13101 from techaddict/SPARK-15296.
*	[MINOR][ML][PYSPARK] ml.evaluation Scala and Python API sync	Yanbo Liang	2016-05-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ```ml.evaluation``` Scala and Python API sync. ## How was this patch tested? Only API docs change, no new tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13195 from yanboliang/evaluation-doc.
*	[SPARK-15341][DOC][ML] Add documentation for "model.write" to clarify ↵	Yanbo Liang	2016-05-19	5	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	"summary" was not saved ## What changes were proposed in this pull request? Currently in ```model.write```, we don't save ```summary```(if applicable). We should add documentation to clarify it. We fixed the incorrect link ```[[MLWriter]]``` to ```[[org.apache.spark.ml.util.MLWriter]]``` BTW. ## How was this patch tested? Documentation update, no unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13131 from yanboliang/spark-15341.
*	[SPARK-15414][MLLIB] Make the mllib,ml linalg type conversion APIs public	Sandeep Singh	2016-05-19	2	-18/+42
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Open up APIs for converting between new, old linear algebra types (in spark.mllib.linalg): `Sparse`/`Dense` X `Vector`/`Matrices` `.asML` and `.fromML` ## How was this patch tested? Existing Tests Author: Sandeep Singh <sandeep@techaddict.me> Closes #13202 from techaddict/SPARK-15414.
*	[SPARK-15361][ML] ML 2.0 QA: Scala APIs audit for ml.clustering	Yanbo Liang	2016-05-19	4	-21/+43
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Audit Scala API for ml.clustering. Fix some wrong API documentations and update outdated one. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13148 from yanboliang/spark-15361.
*	[SPARK-15411][ML] Add @since to ml.stat.MultivariateOnlineSummarizer.scala	DB Tsai	2016-05-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add since to ml.stat.MultivariateOnlineSummarizer.scala ## How was this patch tested? unit tests Author: DB Tsai <dbt@netflix.com> Closes #13197 from dbtsai/cleanup.
*	[SPARK-15292][ML] ML 2.0 QA: Scala APIs audit for classification	Yanbo Liang	2016-05-19	2	-35/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Audit Scala API for classification, almost all issues were related ```MultilayerPerceptronClassifier``` in this section. * Fix one wrong param getter function: ```getOptimizer``` -> ```getSolver``` * Add missing setter function for ```solver``` and ```stepSize```. * Make ```GD``` solver take effect. * Update docs, annotations and fix other minor issues. ## How was this patch tested? Existing unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13076 from yanboliang/spark-15292.
*	[SPARK-15362][ML] Make spark.ml KMeansModel load backwards compatible	Yanbo Liang	2016-05-19	1	-5/+17
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? [SPARK-14646](https://issues.apache.org/jira/browse/SPARK-14646) makes ```KMeansModel``` store the cluster centers one per row. ```KMeansModel.load()``` method needs to be updated in order to load models saved with Spark 1.6. ## How was this patch tested? Since ```save/load``` is ```Experimental``` for 1.6, I think offline test for backwards compatibility is enough. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13149 from yanboliang/spark-15362.
*	[DOC][MINOR] ml.feature Scala and Python API sync	Bryan Cutler	2016-05-19	4	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? I reviewed Scala and Python APIs for ml.feature and corrected discrepancies. ## How was this patch tested? Built docs locally, ran style checks Author: Bryan Cutler <cutlerb@gmail.com> Closes #13159 from BryanCutler/ml.feature-api-sync.
*	[SPARK-15192][SQL] null check for SparkSession.createDataFrame	Wenchen Fan	2016-05-18	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR adds null check in `SparkSession.createDataFrame`, so that we can make sure the passed in rows matches the given schema. ## How was this patch tested? new tests in `DatasetSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #13008 from cloud-fan/row-encoder.
*	[SPARK-14891][ML] Add schema validation for ALS	Nick Pentreath	2016-05-18	3	-18/+143
\| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown. With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue). ## How was this patch tested? New test cases in `ALSSuite`. Author: Nick Pentreath <nickp@za.ibm.com> Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
*	[SPARK-15346][MLLIB] Reduce duplicate computation in picking initial points	DLucky	2016-05-18	1	-6/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mateiz srowen I state that the contribution is my original work and that I license the work to the project under the project's open source license There's some format problems with my last PR, with HyukjinKwon 's help I read the guidance, re-check my code and PR, then run the tests, finally re-submit the PR request here. The related JIRA issue though marked as resolved, this change may relate to it I think. ## Proposed Change After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old ones. Instead this change keeps the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them. ## Test result One can find an easy test way in (https://issues.apache.org/jira/browse/SPARK-6706) I test the KMeans++ method for a small dataset with 16k points, and the whole KMeans\|\| with a large one with 240k points. The data has 4096 features and I tunes K from 100 to 500. The test environment was on my 4 machine cluster, I also tested a 3M points data on a larger cluster with 25 machines and got similar results, which I would not draw the detail curve. The result of the first two exps are shown below ### Local KMeans++ test: Dataset:4m_ini_center Data_size:16234 Dimension:4096 Lloyd's Iteration = 10 The y-axis is time in sec, the x-axis is tuning the K. ![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png) ![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg) ### On a larger dataset An improve show in the graph but not commit in this file: In this experiment I also have an improvement for calculation in normalization data (the distance is convert to the cosine distance). As if the data is normalized into (0,1), one improvement in the original vesion for util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then precision in this case). Therefore I design an early terminal method when comparing two distance (used for findClosest). But I don't include this improve in this file, you may only refer to the curves without "normalize" for comparing the results. Dataset:4k24 Data_size:243960 Dimension:4096 Normlize Enlarge Initialize Lloyd's_Iteration NO 1 3 5 YES 10000 3 5 Notice: the normlized data is enlarged to ensure precision The cost time: x-for value of K, y-for time in sec ![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg) SE for unnormalized data between two version, to ensure the correctness ![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg) Here is the SE between normalized data just for reference, it's also correct. ![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg) Author: DLucky <mouendless@gmail.com> Closes #13133 from mouendless/patch-2.
*	[SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into ↵	WeichenXu	2016-05-18	3	-9/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	accumulatorV2 in spark project ## What changes were proposed in this pull request? I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself) ## How was this patch tested? Exisiting unit tests Author: WeichenXu <WeichenXu123@outlook.com> Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
*	[SPARK-15171][SQL] Remove the references to deprecated method ↵	Sean Zhong	2016-05-18	3	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	dataset.registerTempTable ## What changes were proposed in this pull request? Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`. ## How was this patch tested? This PR only changes the unit test code, examples, and comments. It should be safe. This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged. Author: Sean Zhong <seanzhong@databricks.com> Closes #13098 from clockfly/spark-15171-remove-deprecation.
*	[SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based ↵	DB Tsai	2016-05-17	152	-345/+1007
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	algorithms ## What changes were proposed in this pull request? Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis. ## How was this patch tested? Unit tests Author: DB Tsai <dbt@netflix.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Author: Xiangrui Meng <meng@databricks.com> Closes #12627 from dbtsai/SPARK-14615-NewML.
*	[MINOR][DOCS] Replace remaining 'sqlContext' in ScalaDoc/JavaDoc.	Dongjoon Hyun	2016-05-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? According to the recent change, this PR replaces all the remaining `sqlContext` usage with `spark` in ScalaDoc/JavaDoc (.scala/.java files) except `SQLContext.scala`, `SparkPlan.scala', and `DatasetHolder.scala`. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #13125 from dongjoon-hyun/minor_doc_sparksession.
*	[SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into ↵	Sean Owen	2016-05-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	spark-tags ## What changes were proposed in this pull request? (See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.) Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags` ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #13074 from srowen/SPARK-15290.
*	[MINOR] Fix Typos	Zheng RuiFeng	2016-05-15	2	-33/+36
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? 1,Rename matrix args in BreezeUtil to upper to match the doc 2,Fix several typos in ML and SQL ## How was this patch tested? manual tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #13078 from zhengruifeng/fix_ann.
*	[SPARK-15096][ML] LogisticRegression MultiClassSummarizer numClasses can ↵	wm624@hotmail.com	2016-05-14	2	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fail if no valid labels are found ## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Throw better exception when numClasses is empty and empty.max is thrown. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Add a new unit test, which calls histogram with empty numClasses. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12969 from wangmiao1981/logisticR.
*	[SPARK-15267][SQL] Refactor options for JDBC and ORC data sources and change ↵	hyukjinkwon	2016-05-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	default compression for ORC ## What changes were proposed in this pull request? Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`). It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`. Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`. ## How was this patch tested? Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC. Author: hyukjinkwon <gurwls223@gmail.com> Closes #13048 from HyukjinKwon/SPARK-15267.
*	[SPARK-14900][ML] spark.ml classification metrics should include accuracy	wm624@hotmail.com	2016-05-13	3	-17/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? (Please fill in changes proposed in this fix) Add accuracy to MulticlassMetrics class and add corresponding code in MulticlassClassificationEvaluator. ## How was this patch tested? (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests) Scala Unit tests in ml.evaluation Author: wm624@hotmail.com <wm624@hotmail.com> Closes #12882 from wangmiao1981/accuracy.
*	[SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other ↵	BenFradet	2016-05-13	19	-27/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	numeric types for label ## What changes were proposed in this pull request? Made ChiSqSelector and RFormula accept all numeric types for label ## How was this patch tested? Unit tests Author: BenFradet <benjamin.fradet@gmail.com> Closes #12467 from BenFradet/SPARK-13961.
*	[SPARK-15181][ML][PYSPARK] Python API for GLR summaries.	sethah	2016-05-13	1	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs. ## How was this patch tested? Added a unit test to `pyspark.ml.tests` Author: sethah <seth.hendrickson16@gmail.com> Closes #12961 from sethah/GLR_summary.
*	[SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView	Sean Zhong	2016-05-12	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView. ## How was this patch tested? Unit tests. Author: Sean Zhong <seanzhong@databricks.com> Closes #12945 from clockfly/spark-15171.