spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-12000] do not specify arg types when reference a method in ScalaDoc	Xiangrui Meng	2015-12-02	2	-3/+3
\| \| \| \| \| \| \| \| \| \|	This fixes SPARK-12000, verified on my local with JDK 7. It seems that `scaladoc` try to match method names and messed up with annotations. cc: JoshRosen jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #10114 from mengxr/SPARK-12000.2.
*	[SPARK-10266][DOCUMENTATION, ML] Fixed @Since annotation for ml.tunning	Yu ISHIKAWA	2015-12-02	3	-16/+58
\| \| \| \| \| \| \| \| \| \| \| \|	cc mengxr noel-smith I worked on this issues based on https://github.com/apache/spark/pull/8729. ehsanmok thank you for your contricution! Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #9338 from yu-iskw/JIRA-10266.
*	[SPARK-12046][DOC] Fixes various ScalaDoc/JavaDoc issues	Cheng Lian	2015-12-01	1	-5/+7
\| \| \| \| \| \| \| \|	This PR backports PR #10039 to master Author: Cheng Lian <lian@databricks.com> Closes #10063 from liancheng/spark-12046.doc-fix.master.
*	[SPARK-11898][MLLIB] Use broadcast for the global tables in Word2Vec	Yuhao Yang	2015-12-01	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11898 syn0Global and sync1Global in word2vec are quite large objects with size (vocab * vectorSize * 8), yet they are passed to worker using basic task serialization. Use broadcast can greatly improve the performance. My benchmark shows that, for 1M vocabulary and default vectorSize 100, changing to broadcast can help, 1. decrease the worker memory consumption by 45%. 2. decrease running time by 40%. This will also help extend the upper limit for Word2Vec. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9878 from hhbyyh/w2vBC.
*	[SPARK-11847][ML] Model export/import for spark.ml: LDA	Yuhao Yang	2015-11-24	3	-8/+150
\| \| \| \| \| \| \| \| \| \| \|	Add read/write support to LDA, similar to ALS. save/load for ml.LocalLDAModel is done. For DistributedLDAModel, I'm not sure if we can invoke save on the mllib.DistributedLDAModel directly. I'll send update after some test. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9894 from hhbyyh/ldaMLsave.
*	[SPARK-11521][ML][DOC] Document that Logistic, Linear Regression summaries ↵	Joseph K. Bradley	2015-11-24	2	-0/+33
\| \| \| \| \| \| \| \| \| \| \| \| \|	ignore weight col Doc for 1.6 that the summaries mostly ignore the weight column. To be corrected for 1.7 CC: mengxr thunterdb Author: Joseph K. Bradley <joseph@databricks.com> Closes #9927 from jkbradley/linregsummary-doc.
*	[SPARK-11902][ML] Unhandled case in VectorAssembler#transform	BenFradet	2015-11-22	2	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	There is an unhandled case in the transform method of VectorAssembler if one of the input columns doesn't have one of the supported type DoubleType, NumericType, BooleanType or VectorUDT. So, if you try to transform a column of StringType you get a cryptic "scala.MatchError: StringType". This PR aims to fix this, throwing a SparkException when dealing with an unknown column type. Author: BenFradet <benjamin.fradet@gmail.com> Closes #9885 from BenFradet/SPARK-11902.
*	[SPARK-11912][ML] ml.feature.PCA minor refactor	Yanbo Liang	2015-11-22	2	-30/+24
\| \| \| \| \| \| \| \|	Like [SPARK-11852](https://issues.apache.org/jira/browse/SPARK-11852), ```k``` is params and we should save it under ```metadata/``` rather than both under ```data/``` and ```metadata/```. Refactor the constructor of ```ml.feature.PCAModel``` to take only ```pc``` but construct ```mllib.feature.PCAModel``` inside ```transform```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9897 from yanboliang/spark-11912.
*	[SPARK-6791][ML] Add read/write for CrossValidator and Evaluators	Joseph K. Bradley	2015-11-22	12	-85/+522
\| \| \| \| \| \| \| \| \| \| \| \|	I believe this works for general estimators within CrossValidator, including compound estimators. (See the complex unit test.) Added read/write for all 3 Evaluators as well. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9848 from jkbradley/cv-io.
*	[SPARK-11852][ML] StandardScaler minor refactor	Yanbo Liang	2015-11-20	2	-39/+32
\| \| \| \| \| \| \| \|	```withStd``` and ```withMean``` should be params of ```StandardScaler``` and ```StandardScalerModel```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #9839 from yanboliang/standardScaler-refactor.
*	[SPARK-11867] Add save/load for kmeans and naive bayes	Xusen Yin	2015-11-19	4	-28/+195
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11867 Author: Xusen Yin <yinxusen@gmail.com> Closes #9849 from yinxusen/SPARK-11867.
*	[SPARK-11869][ML] Clean up TempDirectory properly in ML tests	Joseph K. Bradley	2015-11-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Need to remove parent directory (```className```) rather than just tempDir (```className/random_name```) I tested this with IDFSuite, which has 2 read/write tests, and it fixes the problem. CC: mengxr Can you confirm this is fine? I believe it is since the same ```random_name``` is used for all tests in a suite; we basically have an extra unneeded level of nesting. Author: Joseph K. Bradley <joseph@databricks.com> Closes #9851 from jkbradley/tempdir-cleanup.
*	[SPARK-11829][ML] Add read/write to estimators under ml.feature (II)	Yanbo Liang	2015-11-19	9	-33/+338
\| \| \| \| \| \| \| \| \| \| \| \|	Add read/write support to the following estimators under spark.ml: * ChiSqSelector * PCA * VectorIndexer * Word2Vec Author: Yanbo Liang <ybliang8@gmail.com> Closes #9838 from yanboliang/spark-11829.
*	[SPARK-11846] Add save/load for AFTSurvivalRegression and IsotonicRegression	Xusen Yin	2015-11-19	4	-22/+210
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-11846 mengxr Author: Xusen Yin <yinxusen@gmail.com> Closes #9836 from yinxusen/SPARK-11846.
*	[SPARK-11842][ML] Small cleanups to existing Readers and Writers	Joseph K. Bradley	2015-11-18	10	-25/+38
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Updates: * Add repartition(1) to save() methods' saving of data for LogisticRegressionModel, LinearRegressionModel. * Strengthen privacy to class and companion object for Writers and Readers * Change LogisticRegressionSuite read/write test to fit intercept * Add Since versions for read/write methods in Pipeline, LogisticRegression * Switch from hand-written class names in Readers to using getClass CC: mengxr CC: yanboliang Would you mind taking a look at this PR? mengxr might not be able to soon. Thank you! Author: Joseph K. Bradley <joseph@databricks.com> Closes #9829 from jkbradley/ml-io-cleanups.
*	[SPARK-11839][ML] refactor save/write traits	Xiangrui Meng	2015-11-18	27	-321/+190
\| \| \| \| \| \| \| \| \| \| \| \|	* add "ML" prefix to reader/writer/readable/writable to avoid name collision with java.util.* * define `DefaultParamsReadable/Writable` and use them to save some code * use `super.load` instead so people can jump directly to the doc of `Readable.load`, which documents the Java compatibility issues jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9827 from mengxr/SPARK-11839.
*	[SPARK-6787][ML] add read/write to estimators under ml.feature (1)	Xiangrui Meng	2015-11-18	10	-47/+467
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add read/write support to the following estimators under spark.ml: * CountVectorizer * IDF * MinMaxScaler * StandardScaler (a little awkward because we store some params in spark.mllib model) * StringIndexer Added some necessary method for read/write. Maybe we should add `private[ml] trait DefaultParamsReadable` and `DefaultParamsWritable` to save some boilerplate code, though we still need to override `load` for Java compatibility. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9798 from mengxr/SPARK-6787.
*	[SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example ↵	Yanbo Liang	2015-11-18	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	codes This PR includes: * Update SparkR:::glm, SparkR:::summary API docs. * Update SparkR machine learning user guide and example codes to show: * supporting feature interaction in R formula. * summary for gaussian GLM model. * coefficients for binomial GLM model. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9727 from yanboliang/spark-11684.
*	[SPARK-11813][MLLIB] Avoid serialization of vocab in Word2Vec	Yuhao Yang	2015-11-18	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11813 I found the problem during training a large corpus. Avoid serialization of vocab in Word2Vec has 2 benefits. 1. Performance improvement for less serialization. 2. Increase the capacity of Word2Vec a lot. Currently in the fit of word2vec, the closure mainly includes serialization of Word2Vec and 2 global table. the main part of Word2vec is the vocab of size: vocab * 40 * 2 * 4 = 320 vocab 2 global table: vocab * vectorSize * 8. If vectorSize = 20, that's 160 vocab. Their sum cannot exceed Int.max due to the restriction of ByteArrayOutputStream. In any case, avoiding serialization of vocab helps decrease the size of the closure serialization, especially when vectorSize is small, thus to allow larger vocabulary. Actually there's another possible fix, make local copy of fields to avoid including Word2Vec in the closure. Let me know if that's preferred. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9803 from hhbyyh/w2vVocab.
*	[SPARK-6789][ML] Add Readable, Writable support for spark.ml ALS, ALSModel	Joseph K. Bradley	2015-11-18	3	-17/+150
\| \| \| \| \| \| \| \| \| \|	Also modifies DefaultParamsWriter.saveMetadata to take optional extra metadata. CC: mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #9786 from jkbradley/als-io.
*	[SPARK-6790][ML] Add spark.ml LinearRegression import/export	Wenjian Huang	2015-11-18	2	-5/+106
\| \| \| \| \| \| \| \| \| \| \| \| \|	This replaces [https://github.com/apache/spark/pull/9656] with updates. fayeshine should be the main author when this PR is committed. CC: mengxr fayeshine Author: Wenjian Huang <nextrush@163.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #9814 from jkbradley/fayeshine-patch-6790.
*	[SPARK-7013][ML][TEST] Add unit test for spark.ml StandardScaler	RoyGaoVLIS	2015-11-17	1	-0/+108
\| \| \| \| \| \| \| \| \|	I have added unit test for ML's StandardScaler By comparing with R's output, please review for me. Thx. Author: RoyGaoVLIS <roygao@zju.edu.cn> Closes #6665 from RoyGao/7013.
*	[SPARK-11764][ML] make Param.jsonEncode/jsonDecode support Vector	Xiangrui Meng	2015-11-17	2	-6/+28
\| \| \| \| \| \| \| \|	This PR makes the default read/write work with simple transformers/estimators that have params of type `Param[Vector]`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9776 from mengxr/SPARK-11764.
*	[SPARK-11763][ML] Add save,load to LogisticRegression Estimator	Joseph K. Bradley	2015-11-17	7	-59/+173
\| \| \| \| \| \| \| \| \| \| \| \|	Add save/load to LogisticRegression Estimator, and refactor tests a little to make it easier to add similar support to other Estimator, Model pairs. Moved LogisticRegressionReader/Writer to within LogisticRegressionModel CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9749 from jkbradley/lr-io-2.
*	[SPARK-11769][ML] Add save, load to all basic Transformers	Joseph K. Bradley	2015-11-17	32	-84/+453
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This excludes Estimators and ones which include Vector and other non-basic types for Params or data. This adds: * Bucketizer * DCT * HashingTF * Interaction * NGram * Normalizer * OneHotEncoder * PolynomialExpansion * QuantileDiscretizer * RFormula * SQLTransformer * StopWordsRemover * StringIndexer * Tokenizer * VectorAssembler * VectorSlicer CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9755 from jkbradley/transformer-io.
*	[SPARK-11766][MLLIB] add toJson/fromJson to Vector/Vectors	Xiangrui Meng	2015-11-17	2	-0/+62
\| \| \| \| \| \| \| \|	This is to support JSON serialization of Param[Vector] in the pipeline API. It could be used for other purposes too. The schema is the same as `VectorUDT`. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9751 from mengxr/SPARK-11766.
*	[SPARK-11612][ML] Pipeline and PipelineModel persistence	Joseph K. Bradley	2015-11-16	4	-18/+306
\| \| \| \| \| \| \| \| \| \| \| \|	Pipeline and PipelineModel extend Readable and Writable. Persistence succeeds only when all stages are Writable. Note: This PR reinstates tests for other read/write functionality. It should probably not get merged until [https://issues.apache.org/jira/browse/SPARK-11672] gets fixed. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9674 from jkbradley/pipeline-io.
*	[SPARK-11672][ML] set active SQLContext in JavaDefaultReadWriteSuite	Xiangrui Meng	2015-11-15	1	-1/+6
\| \| \| \| \| \| \| \|	The same as #9694, but for Java test suite. yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9719 from mengxr/SPARK-11672.4.
*	[MINOR][ML] remove MLlibTestsSparkContext from ImpuritySuite	Xiangrui Meng	2015-11-13	1	-2/+1
\| \| \| \| \| \| \| \|	ImpuritySuite doesn't need SparkContext. Author: Xiangrui Meng <meng@databricks.com> Closes #9698 from mengxr/remove-mllib-test-context-in-impurity-suite.
*	[SPARK-11672][ML] Set active SQLContext in MLlibTestSparkContext.beforeAll	Xiangrui Meng	2015-11-13	2	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Still saw some error messages caused by `SQLContext.getOrCreate`: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3997/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.3,label=spark-test/testReport/junit/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/ This PR sets the active SQLContext in beforeAll, which is not automatically set in `new SQLContext`. This makes `SQLContext.getOrCreate` return the right SQLContext. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9694 from mengxr/SPARK-11672.3.
*	[SPARK-11723][ML][DOC] Use LibSVM data source rather than ↵	Yanbo Liang	2015-11-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MLUtils.loadLibSVMFile to load DataFrame Use LibSVM data source rather than MLUtils.loadLibSVMFile to load DataFrame, include: * Use libSVM data source for all example codes under examples/ml, and remove unused import. * Use libSVM data source for user guides under ml-*** which were omitted by #8697. * Fix bug: We should use ```sqlContext.read().format("libsvm").load(path)``` at Java side, but the API doc and user guides misuse as ```sqlContext.read.format("libsvm").load(path)```. * Code cleanup. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9690 from yanboliang/spark-11723.
*	[SPARK-11672][ML] flaky spark.ml read/write tests	Xiangrui Meng	2015-11-12	5	-5/+7
\| \| \| \| \| \| \| \| \| \|	We set `sqlContext = null` in `afterAll`. However, this doesn't change `SQLContext.activeContext` and then `SQLContext.getOrCreate` might use the `SparkContext` from previous test suite and hence causes the error. This PR calls `clearActive` in `beforeAll` and `afterAll` to avoid using an old context from other test suites. cc: yhuai Author: Xiangrui Meng <meng@databricks.com> Closes #9677 from mengxr/SPARK-11672.2.
*	[SPARK-11712][ML] Make spark.ml LDAModel be abstract	Joseph K. Bradley	2015-11-12	2	-88/+96
\| \| \| \| \| \| \| \| \| \|	Per discussion in the initial Pipelines LDA PR [https://github.com/apache/spark/pull/9513], we should make LDAModel abstract and create a LocalLDAModel. This code simplification should be done before the 1.6 release to ensure API compatibility in future releases. CC feynmanliang mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9678 from jkbradley/lda-pipelines-2.
*	[SPARK-11674][ML] add private val after @transient in Word2VecModel	Xiangrui Meng	2015-11-11	1	-1/+1
\| \| \| \| \| \| \| \|	This causes compile failure with Scala 2.11. See https://issues.scala-lang.org/browse/SI-8813. (Jenkins won't test Scala 2.11. I tested compile locally.) JoshRosen Author: Xiangrui Meng <meng@databricks.com> Closes #9644 from mengxr/SPARK-11674.
*	[SPARK-11672][ML] disable spark.ml read/write tests	Xiangrui Meng	2015-11-11	4	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	Saw several failures on Jenkins, e.g., https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2040/testReport/org.apache.spark.ml.util/JavaDefaultReadWriteSuite/testDefaultReadWrite/. This is the first failure in master build: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/3982/ I cannot reproduce it on local. So temporarily disable the tests and I will look into the issue under the same JIRA. I'm going to merge the PR after Jenkins passes compile. Author: Xiangrui Meng <meng@databricks.com> Closes #9641 from mengxr/SPARK-11672.
*	[SPARK-11626][ML] ml.feature.Word2Vec.transform() function very slow	Yuming Wang	2015-11-11	1	-18/+16
\| \| \| \| \| \| \| \| \| \|	org.apache.spark.ml.feature.Word2Vec.transform() very slow. we should not read broadcast every sentence. Author: Yuming Wang <q79969786@gmail.com> Author: yuming.wang <q79969786@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #9592 from 979969786/master.
*	[SPARK-6726][ML] Import/export for spark.ml LogisticRegressionModel	Joseph K. Bradley	2015-11-10	4	-11/+152
\| \| \| \| \| \| \| \| \| \|	This PR adds model save/load for spark.ml's LogisticRegressionModel. It also does minor refactoring of the default save/load classes to reuse code. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9606 from jkbradley/logreg-io2.
*	[SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in ↵	Yu ISHIKAWA	2015-11-10	1	-15/+6
\| \| \| \| \| \| \| \| \| \|	Python cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9534 from yu-iskw/SPARK-11566.
*	[SPARK-5565][ML] LDA wrapper for Pipelines API	Joseph K. Bradley	2015-11-10	3	-5/+946
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This adds LDA to spark.ml, the Pipelines API. It follows the design doc in the JIRA: [https://issues.apache.org/jira/browse/SPARK-5565], with one major change: * I eliminated doc IDs. These are not necessary with DataFrames since the user can add an ID column as needed. Note: This will conflict with [https://github.com/apache/spark/pull/9484], but I'll try to merge [https://github.com/apache/spark/pull/9484] first and then rebase this PR. CC: hhbyyh feynmanliang If you have a chance to make a pass, that'd be really helpful--thanks! Now that I'm done traveling & this PR is almost ready, I'll see about reviewing other PRs critical for 1.6. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9513 from jkbradley/lda-pipelines.
*	[SPARK-7316][MLLIB] RDD sliding window with step	unknown	2015-11-10	3	-39/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implementation of step capability for sliding window function in MLlib's RDD. Though one can use current sliding window with step 1 and then filter every Nth window, it will take more time and space (N*data.count times more than needed). For example, below are the results for various windows and steps on 10M data points: Window \| Step \| Time \| Windows produced ------------ \| ------------- \| ---------- \| ---------- 128 \| 1 \| 6.38 \| 9999873 128 \| 10 \| 0.9 \| 999988 128 \| 100 \| 0.41 \| 99999 1024 \| 1 \| 44.67 \| 9998977 1024 \| 10 \| 4.74 \| 999898 1024 \| 100 \| 0.78 \| 99990 ``` import org.apache.spark.mllib.rdd.RDDFunctions._ val rdd = sc.parallelize(1 to 10000000, 10) rdd.count val window = 1024 val step = 1 val t = System.nanoTime(); val windows = rdd.sliding(window, step); println(windows.count); println((System.nanoTime() - t) / 1e9) ``` Author: unknown <ulanov@ULANOV3.americas.hpqcorp.net> Author: Alexander Ulanov <nashb@yandex.ru> Author: Xiangrui Meng <meng@databricks.com> Closes #5855 from avulanov/SPARK-7316-sliding.
*	[SPARK-11618][ML] Minor refactoring of basic ML import/export	Joseph K. Bradley	2015-11-10	1	-27/+30
\| \| \| \| \| \| \| \| \| \| \| \|	Refactoring * separated overwrite and param save logic in DefaultParamsWriter * added sparkVersion to DefaultParamsWriter CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #9587 from jkbradley/logreg-io.
*	[SPARK-11069][ML] Add RegexTokenizer option to convert to lowercase	Yuhao Yang	2015-11-09	3	-7/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-11069 quotes from jira: Tokenizer converts strings to lowercase automatically, but RegexTokenizer does not. It would be nice to add an option to RegexTokenizer to convert to lowercase. Proposal: call the Boolean Param "toLowercase" set default to false (so behavior does not change) Actually sklearn converts to lowercase before tokenizing too Author: Yuhao Yang <hhbyyh@gmail.com> Closes #9092 from hhbyyh/tokenLower.
*	[SPARK-6517][MLLIB] Implement the Algorithm of Hierarchical Clustering	Yu ISHIKAWA	2015-11-09	4	-0/+841
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	I implemented a hierarchical clustering algorithm again. This PR doesn't include examples, documentation and spark.ml APIs. I am going to send another PRs later. https://issues.apache.org/jira/browse/SPARK-6517 - This implementation based on a bi-sectiong K-means clustering. - It derives from the freeman-lab 's implementation - The basic idea is not changed from the previous version. (#2906) - However, It is 1000x faster than the previous version through parallel processing. Thank you for your great cooperation, RJ Nowling(rnowling), Jeremy Freeman(freeman-lab), Xiangrui Meng(mengxr) and Sean Owen(srowen). Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Author: Yu ISHIKAWA <yu-iskw@users.noreply.github.com> Closes #5267 from yu-iskw/new-hierarchical-clustering.
*	[SPARK-11582][MLLIB] specifying pmml version attribute =4.2 in the root node ↵	fazlan-nazeem	2015-11-09	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	of pmml model The current pmml models generated do not specify the pmml version in its root node. This is a problem when using this pmml model in other tools because they expect the version attribute to be set explicitly. This fix adds the pmml version attribute to the generated pmml models and specifies its value as 4.2. Author: fazlan-nazeem <fazlann@wso2.com> Closes #9558 from fazlan-nazeem/master.
*	[SPARK-11494][ML][R] Expose R-like summary statistics in SparkR::glm for ↵	Yanbo Liang	2015-11-09	1	-4/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	linear regression Expose R-like summary statistics in SparkR::glm for linear regression, the output of ```summary``` like ```Java $DevianceResiduals Min Max -0.9509607 0.7291832 $Coefficients Estimate Std. Error t value Pr(>\|t\|) (Intercept) 1.6765 0.2353597 7.123139 4.456124e-11 Sepal_Length 0.3498801 0.04630128 7.556598 4.187317e-12 Species_versicolor -0.9833885 0.07207471 -13.64402 0 Species_virginica -1.00751 0.09330565 -10.79796 0 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #9561 from yanboliang/spark-11494.
*	[SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python	Yu ISHIKAWA	2015-11-06	2	-2/+57
\| \| \| \| \| \| \| \| \| \| \| \| \|	Could jkbradley and davies review it? - Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it. - Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`. [[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467) Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #8643 from yu-iskw/SPARK-8467-2.
*	[SPARK-11217][ML] save/load for non-meta estimators and transformers	Xiangrui Meng	2015-11-06	7	-4/+469
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR implements the default save/load for non-meta estimators and transformers using the JSON serialization of param values. The saved metadata includes: * class name * uid * timestamp * paramMap The save/load interface is similar to DataFrames. We use the current active context by default, which should be sufficient for most use cases. ~~~scala instance.save("path") instance.write.context(sqlContext).overwrite().save("path") Instance.load("path") ~~~ The param handling is different from the design doc. We didn't save default and user-set params separately, and when we load it back, all parameters are user-set. This does cause issues. But it also cause other issues if we modify the default params. TODOs: * [x] Java test * [ ] a follow-up PR to implement default save/load for all non-meta estimators and transformers cc jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9454 from mengxr/SPARK-11217.
*	[SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits	Imran Rashid	2015-11-06	3	-8/+26
\| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10116 This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`. mengxr mkolod Author: Imran Rashid <irashid@cloudera.com> Closes #8314 from squito/SPARK-10116.
*	[SPARK-11514][ML] Pass random seed to spark.ml DecisionTree*	Yu ISHIKAWA	2015-11-05	5	-7/+14
\| \| \| \| \| \| \| \|	cc jkbradley Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #9486 from yu-iskw/SPARK-11514.
*	[SPARK-10265][DOCUMENTATION, ML] Fixed @Since annotation to ml.regression	Ehsan M.Kermani	2015-11-05	5	-18/+119
\| \| \| \| \| \| \| \|	Here is my first commit. Author: Ehsan M.Kermani <ehsanmo1367@gmail.com> Closes #8728 from ehsanmok/SinceAnn.