spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-12309][ML] Use sqlContext from MLlibTestSparkContext for spark.ml ↵	Yanbo Liang	2015-12-16	5	-11/+5
\| \| \| \| \| \| \| \| \| \| \| \|	test suites Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new one for spark.ml test suites. I have checked thoroughly and found there are four test cases need to update. cc mengxr jkbradley Author: Yanbo Liang <ybliang8@gmail.com> Closes #10279 from yanboliang/spark-12309.
*	[SPARK-9694][ML] Add random seed Param to Scala CrossValidator	Yanbo Liang	2015-12-16	2	-3/+16
\| \| \| \| \| \| \| \|	Add random seed Param to Scala CrossValidator Author: Yanbo Liang <ybliang8@gmail.com> Closes #9108 from yanboliang/spark-9694.
*	[SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pyspark	Liang-Chi Hsieh	2015-12-14	2	-33/+62
\| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-12016 We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10100 from viirya/fix-load-py-wordvecmodel.
*	[SPARK-11497][MLLIB][PYTHON] PySpark RowMatrix Constructor Has Type Erasure ↵	Mike Dusenberry	2015-12-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Issue As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types. This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased. cc holdenk Author: Mike Dusenberry <mwdusenb@us.ibm.com> Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.
*	[SPARK-10991][ML] logistic regression training summary handle empty ↵	Holden Karau	2015-12-11	2	-2/+29
\| \| \| \| \| \| \| \| \| \| \|	prediction col LogisticRegression training summary should still function if the predictionCol is set to an empty string or otherwise unset (related too https://issues.apache.org/jira/browse/SPARK-9718 ) Author: Holden Karau <holden@pigscanfly.ca> Author: Holden Karau <holden@us.ibm.com> Closes #9037 from holdenk/SPARK-10991-LogisticRegressionTrainingSummary-handle-empty-prediction-col.
*	[SPARK-11602][MLLIB] Refine visibility for 1.6 scala API audit	Yuhao Yang	2015-12-10	4	-5/+5
\| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11602 Made a pass on the API change of 1.6. Open the PR for efficient discussion. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9939 from hhbyyh/auditScala.
*	[SPARK-11530][MLLIB] Return eigenvalues with PCA model	Sean Owen	2015-12-10	6	-25/+64
\| \| \| \| \| \| \| \| \| \|	Add `computePrincipalComponentsAndVariance` to also compute PCA's explained variance. CC mengxr Author: Sean Owen <sowen@cloudera.com> Closes #9736 from srowen/SPARK-11530.
*	[SPARK-10299][ML] word2vec should allow users to specify the window size	Holden Karau	2015-12-09	3	-4/+65
\| \| \| \| \| \| \| \| \|	Currently word2vec has the window hard coded at 5, some users may want different sizes (for example if using on n-gram input or similar). User request comes from http://stackoverflow.com/questions/32231975/spark-word2vec-window-size . Author: Holden Karau <holden@us.ibm.com> Author: Holden Karau <holden@pigscanfly.ca> Closes #8513 from holdenk/SPARK-10299-word2vec-should-allow-users-to-specify-the-window-size.
*	[SPARK-11343][ML] Documentation of float and double prediction/label columns ↵	Dominik Dahlem	2015-12-08	1	-2/+7
\| \| \| \| \| \| \| \| \| \| \| \|	in RegressionEvaluator felixcheung , mengxr Just added a message to require() Author: Dominik Dahlem <dominik.dahlem@gmail.combination> Closes #9598 from dahlem/ddahlem_regression_evaluator_double_predictions_message_04112015.
*	[SPARK-11605][MLLIB] ML 1.6 QA: API: Java compatibility, docs	Yuhao Yang	2015-12-08	4	-25/+94
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11605 Check Java compatibility for MLlib for this release. fix: 1. `StreamingTest.registerStream` needs java friendly interface. 2. `GradientBoostedTreesModel.computeInitialPredictionAndError` and `GradientBoostedTreesModel.updatePredictionError` has java compatibility issue. Mark them as `developerAPI`. TBD: [updated] no fix for now per discussion. `org.apache.spark.mllib.classification.LogisticRegressionModel` `public scala.Option<java.lang.Object> getThreshold();` has wrong return type for Java invocation. `SVMModel` has the similar issue. Yet adding a `scala.Option<java.util.Double> getThreshold()` would result in an overloading error due to the same function signature. And adding a new function with different name seems to be not necessary. cc jkbradley feynmanliang Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10102 from hhbyyh/javaAPI.
*	[SPARK-11439][ML] Optimization of creating sparse feature without dense one	Nakul Jindal	2015-12-08	3	-122/+142
\| \| \| \| \| \| \| \|	Sparse feature generated in LinearDataGenerator does not create dense vectors as an intermediate any more. Author: Nakul Jindal <njindal@us.ibm.com> Closes #9756 from nakul02/SPARK-11439_sparse_without_creating_dense_feature.
*	[SPARK-11958][SPARK-11957][ML][DOC] SQLTransformer user guide and example code	Yanbo Liang	2015-12-07	1	-2/+9
\| \| \| \| \| \| \| \|	Add ```SQLTransformer``` user guide, example code and make Scala API doc more clear. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10006 from yanboliang/spark-11958.
*	[SPARK-10259][ML] Add @since annotation to ml.classification	Takahashi Hiroshi	2015-12-07	7	-44/+185
\| \| \| \| \| \| \| \|	Add since annotation to ml.classification Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp> Closes #8534 from taishi-oss/issue10259.
*	[SPARK-12160][MLLIB] Use SQLContext.getOrCreate in MLlib	Joseph K. Bradley	2015-12-07	13	-29/+29
\| \| \| \| \| \| \| \| \| \| \| \|	Switched from using SQLContext constructor to using getOrCreate, mainly in model save/load methods. This covers all instances in spark.mllib. There were no uses of the constructor in spark.ml. CC: mengxr yhuai Author: Joseph K. Bradley <joseph@databricks.com> Closes #10161 from jkbradley/mllib-sqlcontext-fix.
*	[SPARK-11988][ML][MLLIB] Update JPMML to 1.2.7	Sean Owen	2015-12-05	5	-63/+58
\| \| \| \| \| \| \| \|	Update JPMML pmml-model to 1.2.7 Author: Sean Owen <sowen@cloudera.com> Closes #9972 from srowen/SPARK-11988.
*	[SPARK-11994][MLLIB] Word2VecModel load and save cause SparkException when ↵	Antonio Murgia	2015-12-05	2	-4/+31
\| \| \| \| \| \| \| \|	model is bigger than spark.kryoserializer.buffer.max Author: Antonio Murgia <antonio.murgia2@studio.unibo.it> Closes #9989 from tmnd1991/SPARK-11932.
*	[SPARK-12096][MLLIB] remove the old constraint in word2vec	Yuhao Yang	2015-12-05	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-12096 word2vec now can handle much bigger vocabulary. The old constraint vocabSize.toLong * vectorSize < Ine.max / 8 should be removed. new constraint is vocabSize.toLong * vectorSize < max array length (usually a little less than Int.MaxValue) I tested with vocabsize over 18M and vectorsize = 100. srowen jkbradley Sorry to miss this in last PR. I was reminded today. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #10103 from hhbyyh/w2vCapacity.
*	[SPARK-12112][BUILD] Upgrade to SBT 0.13.9	Josh Rosen	2015-12-05	5	-14/+16
\| \| \| \| \| \| \| \| \| \|	We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin). I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations. Author: Josh Rosen <joshrosen@databricks.com> Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
*	[SPARK-6990][BUILD] Add Java linting script; fix minor warnings	Dmitry Erastov	2015-12-04	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces https://github.com/apache/spark/pull/9696 Invoke Checkstyle and print any errors to the console, failing the step. Use Google's style rules modified according to https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide Some important checks are disabled (see TODOs in `checkstyle.xml`) due to multiple violations being present in the codebase. Suggest fixing those TODOs in a separate PR(s). More on Checkstyle can be found on the [official website](http://checkstyle.sourceforge.net/). Sample output (from [build 46345](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46345/consoleFull)) (duplicated because I run the build twice with different profiles): > Checkstyle checks failed at following occurrences: [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/UnsafeRowParquetRecordReader.java:[217,7] (coding) MissingSwitchDefault: switch without "default" clause. > [ERROR] src/main/java/org/apache/spark/sql/execution/datasources/parquet/SpecificParquetRecordReaderBase.java:[198,10] (modifier) ModifierOrder: 'protected' modifier out of order with the JLS suggestions. > [error] running /home/jenkins/workspace/SparkPullRequestBuilder2/dev/lint-java ; received return code 1 Also fix some of the minor violations that didn't require sweeping changes. Apologies for the previous botched PRs - I finally figured out the issue. cr: JoshRosen, pwendell > I state that the contribution is my original work, and I license the work to the project under the project's open source license. Author: Dmitry Erastov <derastov@gmail.com> Closes #9867 from dskrvk/master.
*	[SPARK-12000] do not specify arg types when reference a method in ScalaDoc	Xiangrui Meng	2015-12-02	2	-3/+3
\| \| \| \| \| \| \| \| \| \|	This fixes SPARK-12000, verified on my local with JDK 7. It seems that `scaladoc` try to match method names and messed up with annotations. cc: JoshRosen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #10114 from mengxr/SPARK-12000.2.
*	[SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunning	Yu ISHIKAWA	2015-12-02	3	-16/+58
\| \| \| \| \| \| \| \| \| \| \| \|	cc mengxr noel-smith I worked on this issues based on https://github.com/apache/spark/pull/8729. ehsanmok thank you for your contricution! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9338 from yu-iskw/JIRA-10266.
*	[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues	Cheng Lian	2015-12-01	1	-5/+7
\| \| \| \| \| \| \| \|	This PR backports PR #10039 to master Author: Cheng Lian <lian@databricks.com> Closes #10063 from liancheng/spark-12046.doc-fix.master.
*	[SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec	Yuhao Yang	2015-12-01	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization. Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help, 1. decrease the worker memory consumption by 45%. 2. decrease running time by 40%. This will also help extend the upper limit for Word2Vec. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9878 from hhbyyh/w2vBC.
*	[SPARK-11847][ML] Model export/import for spark.ml: LDA	Yuhao Yang	2015-11-24	3	-8/+150
\| \| \| \| \| \| \| \| \| \| \|	Add read/write support to LDA, similar to ALS. save/load for ml.LocalLDAModel is done. For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9894 from hhbyyh/ldaMLsave.
*	[SPARK-11521][ML][DOC] Document that Logistic, Linear Regression summaries ↵	Joseph K. Bradley	2015-11-24	2	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \|	ignore weight col Doc for 1.6 that the summaries mostly ignore the weight column. To be corrected for 1.7 CC: mengxr thunterdb Author: Joseph K. Bradley <joseph@databricks.com> Closes #9927 from jkbradley/linregsummary-doc.
*	[SPARK-11902][ML] Unhandled case in VectorAssembler#transform	BenFradet	2015-11-22	2	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT. So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType". This PR aims to fix this, throwing a SparkException when dealing with an unknown column type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #9885 from BenFradet/SPARK-11902.
*	[SPARK-11912][ML] ml.feature.PCA minor refactor	Yanbo Liang	2015-11-22	2	-30/+24
\| \| \| \| \| \| \| \|	Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel``` to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9897 from yanboliang/spark-11912.
*	[SPARK-6791][ML] Add read/write for CrossValidator and Evaluators	Joseph K. Bradley	2015-11-22	12	-85/+522
\| \| \| \| \| \| \| \| \| \| \| \|	I believe this works for general estimators within CrossValidator, including compound estimators. (See the complex unit test.) Added read/write for all 3 Evaluators as well. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9848 from jkbradley/cv-io.
*	[SPARK-11852][ML] StandardScaler minor refactor	Yanbo Liang	2015-11-20	2	-39/+32
\| \| \| \| \| \| \| \|	```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9839 from yanboliang/standardScaler-refactor.
*	[SPARK-11867] Add save/load for kmeans and naive bayes	Xusen Yin	2015-11-19	4	-28/+195
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11867 Author: Xusen Yin <yinxusen@gmail.com> Closes #9849 from yinxusen/SPARK-11867.
*	[SPARK-11869][ML] Clean up TempDirectory properly in ML tests	Joseph K. Bradley	2015-11-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```) I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem. CC: mengxr Can you confirm this is fine? I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting. Author: Joseph K. Bradley <joseph@databricks.com> Closes #9851 from jkbradley/tempdir-cleanup.
*	[SPARK-11829][ML] Add read/write to estimators under ml.feature (II)	Yanbo Liang	2015-11-19	9	-33/+338
\| \| \| \| \| \| \| \| \| \| \| \|	Add read/write support to the following estimators under spark.ml: * ChiSqSelector * PCA * VectorIndexer * Word2Vec Author: Yanbo Liang <ybliang8@gmail.com> Closes #9838 from yanboliang/spark-11829.
*	[SPARK-11846] Add save/load for AFTSurvivalRegression and IsotonicRegression	Xusen Yin	2015-11-19	4	-22/+210
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11846 mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #9836 from yinxusen/SPARK-11846.
*	[SPARK-11842][ML] Small cleanups to existing Readers and Writers	Joseph K. Bradley	2015-11-18	10	-25/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Updates: * Add repartition(1) to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept * Add Since versions for read/write methods in Pipeline, LogisticRegression * Switch from hand-written class names in Readers to using getClass CC: mengxr CC: yanboliang Would you mind taking a look at this PR? mengxr might not be able to soon. Thank you! Author: Joseph K. Bradley <joseph@databricks.com> Closes #9829 from jkbradley/ml-io-cleanups.
*	[SPARK-11839][ML] refactor save/write traits	Xiangrui Meng	2015-11-18	27	-321/+190
\| \| \| \| \| \| \| \| \| \| \| \|	* add "ML" prefix to reader/writer/readable/writable to avoid name collision with java.util.* * define `DefaultParamsReadable/Writable` and use them to save some code * use `super.load` instead so people can jump directly to the doc of `Readable.load`, which documents the Java compatibility issues jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9827 from mengxr/SPARK-11839.
*	[SPARK-6787][ML] add read/write to estimators under ml.feature (1)	Xiangrui Meng	2015-11-18	10	-47/+467
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add read/write support to the following estimators under spark.ml: * CountVectorizer * IDF * MinMaxScaler * StandardScaler (a little awkward because we store some params in spark.mllib model) * StringIndexer Added some necessary method for read/write. Maybe we should add `private[ml] trait DefaultParamsReadable` and `DefaultParamsWritable` to save some boilerplate code, though we still need to override `load` for Java compatibility. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9798 from mengxr/SPARK-6787.
*	[SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example ↵	Yanbo Liang	2015-11-18	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	codes This PR includes: * Update SparkR:::glm, SparkR:::summary API docs. * Update SparkR machine learning user guide and example codes to show: * supporting feature interaction in R formula. * summary for gaussian GLM model. * coefficients for binomial GLM model. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9727 from yanboliang/spark-11684.
*	[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec	Yuhao Yang	2015-11-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9803 from hhbyyh/w2vVocab.
*	[SPARK-6789][ML] Add Readable, Writable support for spark.ml ALS, ALSModel	Joseph K. Bradley	2015-11-18	3	-17/+150
\| \| \| \| \| \| \| \| \| \|	Also modifies DefaultParamsWriter.saveMetadata to take optional extra metadata. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9786 from jkbradley/als-io.
*	[SPARK-6790][ML] Add spark.ml LinearRegression import/export	Wenjian Huang	2015-11-18	2	-5/+106
\| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces [https://github.com/apache/spark/pull/9656] with updates. fayeshine should be the main author when this PR is committed. CC: mengxr fayeshine Author: Wenjian Huang <nextrush@163.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #9814 from jkbradley/fayeshine-patch-6790.
*	[SPARK-7013][ML][TEST] Add unit test for spark.ml StandardScaler	RoyGaoVLIS	2015-11-17	1	-0/+108
\| \| \| \| \| \| \| \| \|	I have added unit test for ML's StandardScaler By comparing with R's output, please review for me. Thx. Author: RoyGaoVLIS <roygao@zju.edu.cn> Closes #6665 from RoyGao/7013.
*	[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector	Xiangrui Meng	2015-11-17	2	-6/+28
\| \| \| \| \| \| \| \|	This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9776 from mengxr/SPARK-11764.
*	[SPARK-11763][ML] Add save,load to LogisticRegression Estimator	Joseph K. Bradley	2015-11-17	7	-59/+173
\| \| \| \| \| \| \| \| \| \| \| \|	Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs. Moved LogisticRegressionReader/Writer to within LogisticRegressionModel CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9749 from jkbradley/lr-io-2.
*	[SPARK-11769][ML] Add save, load to all basic Transformers	Joseph K. Bradley	2015-11-17	32	-84/+453
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds: * Bucketizer * DCT * HashingTF * Interaction * NGram * Normalizer * OneHotEncoder * PolynomialExpansion * QuantileDiscretizer * RFormula * SQLTransformer * StopWordsRemover * StringIndexer * Tokenizer * VectorAssembler * VectorSlicer CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9755 from jkbradley/transformer-io.
*	[SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors	Xiangrui Meng	2015-11-17	2	-0/+62
\| \| \| \| \| \| \| \|	This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9751 from mengxr/SPARK-11766.
*	[SPARK-11612][ML] Pipeline and PipelineModel persistence	Joseph K. Bradley	2015-11-16	4	-18/+306
\| \| \| \| \| \| \| \| \| \| \| \|	Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable. Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9674 from jkbradley/pipeline-io.
*	[SPARK-11672][ML] set active SQLContext in JavaDefaultReadWriteSuite	Xiangrui Meng	2015-11-15	1	-1/+6
\| \| \| \| \| \| \| \|	The same as #9694, but for Java test suite. yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9719 from mengxr/SPARK-11672.4.
*	[MINOR][ML] remove MLlibTestsSparkContext from ImpuritySuite	Xiangrui Meng	2015-11-13	1	-2/+1
\| \| \| \| \| \| \| \|	ImpuritySuite doesn't need SparkContext. Author: Xiangrui Meng <meng@databricks.com> Closes #9698 from mengxr/remove-mllib-test-context-in-impurity-suite.
*	[SPARK-11672][ML] Set active SQLContext in MLlibTestSparkContext.beforeAll	Xiangrui Meng	2015-11-13	2	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Still saw some error messages caused by `SQLContext.getOrCreate`: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3997/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/ This PR sets the active SQLContext in beforeAll, which is not automatically set in `new SQLContext`. This makes `SQLContext.getOrCreate` return the right SQLContext. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9694 from mengxr/SPARK-11672.3.
*	[SPARK-11723][ML][DOC] Use LibSVM data source rather than ↵	Yanbo Liang	2015-11-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.