spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-7402] [ML] JSON SerDe for standard param types	Xiangrui Meng	2015-10-13	1	-0/+114
\| \| \| \| \| \| \| \|	This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #9090 from mengxr/SPARK-7402.
*	[SPARK-10875] [MLLIB] Computed covariance matrix should be symmetric	Nick Pritchard	2015-10-08	1	-0/+18
\| \| \| \| \| \| \| \|	Compute upper triangular values of the covariance matrix, then copy to lower triangular values. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8940 from pnpritchard/SPARK-10875.
*	[SPARK-9718] [ML] linear regression training summary all columns	Holden Karau	2015-10-08	1	-0/+13
\| \| \| \| \| \| \| \|	LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful. Author: Holden Karau <holden@pigscanfly.ca> Closes #8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.
*	[SPARK-10064] [ML] Parallelize decision tree bin split calculations	Nathan Howell	2015-10-07	2	-8/+2
\| \| \| \| \| \| \| \| \| \|	Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation. With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours. Author: Nathan Howell <nhowell@godaddy.com> Closes #8246 from NathanHowell/SPARK-10064.
*	[SPARK-10738] [ML] Refactoring `Instance` out from LOR and LIR, and also ↵	DB Tsai	2015-10-07	2	-0/+2
\| \| \| \| \| \| \| \| \| \|	cleaning up some code Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code. Author: DB Tsai <dbt@netflix.com> Closes #8853 from dbtsai/refactoring.
*	[SPARK-9841] [ML] Make clear public	Holden Karau	2015-10-07	1	-0/+5
\| \| \| \| \| \| \| \|	It is currently impossible to clear Param values once set. It would be helpful to be able to. Author: Holden Karau <holden@pigscanfly.ca> Closes #8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.
*	[SPARK-6530] [ML] Add chi-square selector for ml package	Xusen Yin	2015-10-02	1	-0/+61
\| \| \| \| \| \| \| \|	See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530). Author: Xusen Yin <yinxusen@gmail.com> Closes #5742 from yinxusen/SPARK-6530.
*	[SPARK-5890] [ML] Add feature discretizer	Xusen Yin	2015-10-02	1	-0/+98
\| \| \| \| \| \| \| \| \| \|	JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890). I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly. Author: Xusen Yin <yinxusen@gmail.com> Closes #5779 from yinxusen/SPARK-5890.
*	[SPARK-9681] [ML] Support R feature interactions in RFormula	Eric Liang	2015-09-25	2	-5/+160
\| \| \| \| \| \| \| \| \| \| \| \|	This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`). To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8830 from ericl/interaction-2.
*	[SPARK-10763] [ML] [JAVA] [TEST] Update Java MLLIB/ML tests to use ↵	Holden Karau	2015-09-23	10	-39/+42
\| \| \| \| \| \| \| \| \| \|	simplified dataframe construction As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those. Author: Holden Karau <holden@pigscanfly.ca> Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
*	[SPARK-10686] [ML] Add quantilesCol to AFTSurvivalRegression	Yanbo Liang	2015-09-23	1	-25/+49
\| \| \| \| \| \| \| \|	By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector). Author: Yanbo Liang <ybliang8@gmail.com> Closes #8836 from yanboliang/spark-10686.
*	[SPARK-9715] [ML] Store numFeatures in all ML PredictionModel types	sethah	2015-09-23	11	-15/+38
\| \| \| \| \| \| \| \|	All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility. Author: sethah <seth.hendrickson16@gmail.com> Closes #8675 from sethah/SPARK-9715.
*	[SPARK-10706] [MLLIB] Add java wrapper for random vector rdd	Meihua Wu	2015-09-22	1	-0/+17
\| \| \| \| \| \| \| \| \| \|	Add java wrapper for random vector rdd holdenk srowen Author: Meihua Wu <meihuawu@umich.edu> Closes #8841 from rotationsymmetry/SPARK-10706.
*	[SPARK-3147] [MLLIB] [STREAMING] Streaming 2-sample statistical significance ↵	Feynman Liang	2015-09-21	1	-0/+243
\| \| \| \| \| \| \| \| \| \| \|	testing Implementation of significance testing using Streaming API. Author: Feynman Liang <fliang@databricks.com> Author: Feynman Liang <feynman.liang@gmail.com> Closes #4716 from feynmanliang/ab_testing.
*	[SPARK-9642] [ML] LinearRegression should supported weighted data	Meihua Wu	2015-09-21	1	-0/+88
\| \| \| \| \| \| \| \| \| \|	In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling. work in progress. Author: Meihua Wu <meihuawu@umich.edu> Closes #8631 from rotationsymmetry/SPARK-9642.
*	[SPARK-10626] [MLLIB] create java friendly method for random rdd	Holden Karau	2015-09-21	1	-0/+30
\| \| \| \| \| \| \| \|	SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method. Author: Holden Karau <holden@pigscanfly.ca> Closes #8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.
*	[SPARK-8518] [ML] Log-linear models for survival analysis	Yanbo Liang	2015-09-17	1	-0/+311
\| \| \| \| \| \| \| \| \|	[Accelerated Failure Time (AFT) model](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time. Users can refer to the R function [```survreg```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html) to compare the model and [```predict```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/predict.survreg.html) to compare the prediction. There are different kinds of model prediction, I have just select the type ```response``` which is default used for R. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8611 from yanboliang/spark-8518.
*	[SPARK-9698] [ML] Add RInteraction transformer for supporting R-style ↵	Eric Liang	2015-09-17	1	-0/+165
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	feature interactions This is a pre-req for supporting the ":" operator in the RFormula feature transformer. Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit mengxr Author: Eric Liang <ekl@databricks.com> Closes #7987 from ericl/interaction.
*	[SPARK-7685] [ML] Apply weights to different samples in Logistic Regression	DB Tsai	2015-09-15	2	-9/+120
\| \| \| \| \| \| \| \| \| \| \|	In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive. http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm. Author: DB Tsai <dbt@netflix.com> Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com> Closes #7884 from dbtsai/SPARK-7685.
*	[SPARK-10491] [MLLIB] move RowMatrix.dspr to BLAS	Yuhao Yang	2015-09-15	1	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-10491 We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`. Let me know if new UT needed. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #8663 from hhbyyh/movedspr.
*	[SPARK-10573] [ML] IndexToString output schema should be StringType	Nick Pritchard	2015-09-14	1	-0/+8
\| \| \| \| \| \| \| \|	Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata. Author: Nick Pritchard <nicholas.pritchard@falkonry.com> Closes #8751 from pnpritchard/SPARK-10573.
*	[SPARK-10537] [ML] document LIBSVM source options in public API doc and some ↵	Xiangrui Meng	2015-09-11	2	-16/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	minor improvements We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR: 1. Do `vectorType == "sparse"` only once. 2. Update `hashCode` and `equals`. 3. Remove inherited doc. 4. Delete temp dir in `afterAll`. Lewuathe Author: Xiangrui Meng <meng@databricks.com> Closes #8699 from mengxr/SPARK-10537.
*	[SPARK-10117] [MLLIB] Implement SQL data source API for reading LIBSVM data	lewuathe	2015-09-09	2	-0/+156
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API. Two option is implemented. * `numFeatures`: Specify the dimension of features vector * `featuresType`: Specify the type of output vector. `sparse` is default. Author: lewuathe <lewuathe@me.com> Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits: 986999d [lewuathe] Change unit test phrase 11d513f [lewuathe] Fix some reviews 21600a4 [lewuathe] Merge branch 'master' into SPARK-10117 9ce63c7 [lewuathe] Rewrite service loader file 1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117 ba3657c [lewuathe] Merge branch 'master' into SPARK-10117 0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF 4f40891 [lewuathe] Improve test suites 5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117 8660d0e [lewuathe] Fix Java unit test b56a948 [lewuathe] Merge branch 'master' into SPARK-10117 2c12894 [lewuathe] Remove unnecessary tag 7d693c2 [lewuathe] Resolv conflict 62010af [lewuathe] Merge branch 'master' into SPARK-10117 a97ee97 [lewuathe] Fix some points aef9564 [lewuathe] Fix 70ee4dd [lewuathe] Add Java test 3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data 40d3027 [lewuathe] Add Java test 7056d4a [lewuathe] Merge branch 'master' into SPARK-10117 99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
*	[SPARK-10464] [MLLIB] Add WeibullGenerator for RandomDataGenerator	Yanbo Liang	2015-09-08	1	-1/+15
\| \| \| \| \| \| \| \| \|	Add WeibullGenerator for RandomDataGenerator. #8611 need use WeibullGenerator to generate random data based on Weibull distribution. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8622 from yanboliang/spark-10464.
*	[SPARK-9834] [MLLIB] implement weighted least squares via normal equation	Xiangrui Meng	2015-09-08	1	-0/+133
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet. There are couple TODOs that can be addressed in future PRs: * consolidate summary statistics aggregators * move `dspr` to `BLAS` * etc It would be nice to have this merged first because it blocks couple other features. dbtsai Author: Xiangrui Meng <meng@databricks.com> Closes #8588 from mengxr/SPARK-9834.
*	[SPARK-10470] [ML] ml.IsotonicRegressionModel.copy should set parent	Yanbo Liang	2015-09-08	1	-0/+5
\| \| \| \| \| \| \| \| \|	Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent. Here fix it and add test case. Author: Yanbo Liang <ybliang8@gmail.com> Closes #8637 from yanboliang/spark-10470.
*	[SPARK-10013] [ML] [JAVA] [TEST] remove java assert from java unit tests	Holden Karau	2015-09-05	4	-52/+54
\| \| \| \| \| \| \| \|	From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests. Author: Holden Karau <holden@pigscanfly.ca> Closes #8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
*	[SPARK-9723] [ML] params getordefault should throw more useful error	Holden Karau	2015-09-02	2	-7/+13
\| \| \| \| \| \| \| \|	Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup. Author: Holden Karau <holden@pigscanfly.ca> Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
*	[SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover	Holden Karau	2015-09-01	1	-1/+1
\| \| \| \| \| \| \| \|	Add a python API for the Stop Words Remover. Author: Holden Karau <holden@pigscanfly.ca> Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
*	[SPARK-10353] [MLLIB] BLAS gemm not scaling when beta = 0.0 for some subset ↵	Burak Yavuz	2015-08-30	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	of matrix multiplications mengxr jkbradley rxin It would be great if this fix made it into RC3! Author: Burak Yavuz <brkyvz@gmail.com> Closes #8525 from brkyvz/blas-scaling.
*	[SPARK-9680] [MLLIB] [DOC] StopWordsRemovers user guide and Java ↵	Feynman Liang	2015-08-27	1	-0/+72
\| \| \| \| \| \| \| \| \| \| \| \| \|	compatibility test * Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine * Cleans up scaladocs for public methods * Adds test for Java compatibility * Follow up Python user guide code example is tracked by SPARK-10249 Author: Feynman Liang <fliang@databricks.com> Closes #8436 from feynmanliang/SPARK-10230.
*	[SPARK-10257] [MLLIB] Removes Guava from all spark.mllib Java tests	Feynman Liang	2015-08-27	14	-74/+71
\| \| \| \| \| \| \| \| \| \| \| \|	* Replaces instances of `Lists.newArrayList` with `Arrays.asList` * Replaces `commons.lang.StringUtils` over `com.google.collections.Strings` * Replaces `List` interface over `ArrayList` implementations This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests. Author: Feynman Liang <fliang@databricks.com> Closes #8451 from feynmanliang/SPARK-10257.
*	[SPARK-9613] [HOTFIX] Fix usage of JavaConverters removed in Scala 2.11	Jacek Laskowski	2015-08-27	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases. Build for 2.10: ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install and 2.11: ./dev/change-scala-version.sh 2.11 ./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install Author: Jacek Laskowski <jacek@japila.pl> Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
*	[SPARK-10256] [ML] Removes guava dependency from spark.ml.classification ↵	Feynman Liang	2015-08-27	1	-2/+2
\| \| \| \| \| \| \| \|	JavaTests Author: Feynman Liang <fliang@databricks.com> Closes #8447 from feynmanliang/SPARK-10256.
*	[SPARK-10255] [ML] Removes Guava dependencies from spark.ml.param JavaTests	Feynman Liang	2015-08-27	2	-6/+6
\| \| \| \| \| \|	Author: Feynman Liang <fliang@databricks.com> Closes #8446 from feynmanliang/SPARK-10255.
*	[SPARK-10254] [ML] Removes Guava dependencies in spark.ml.feature JavaTests	Feynman Liang	2015-08-27	11	-30/+35
\| \| \| \| \| \| \| \| \|	* Replaces `com.google.common` dependencies with `java.util.Arrays` * Small clean up in `JavaNormalizerSuite` Author: Feynman Liang <fliang@databricks.com> Closes #8445 from feynmanliang/SPARK-10254.
*	[SPARK-9888] [MLLIB] User guide for new LDA features	Feynman Liang	2015-08-25	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	* Adds two new sections to LDA's user guide; one for each optimizer/model * Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization) * Cleans up a TODO and sets a default parameter in LDA code jkbradley hhbyyh Author: Feynman Liang <fliang@databricks.com> Closes #8254 from feynmanliang/SPARK-9888.
*	[SPARK-10230] [MLLIB] Rename optimizeAlpha to optimizeDocConcentration	Feynman Liang	2015-08-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770) CC jkbradley Author: Feynman Liang <fliang@databricks.com> Closes #8422 from feynmanliang/SPARK-10230.
*	[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵	Sean Owen	2015-08-25	5	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \|	to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
*	[SPARK-10164] [MLLIB] Fixed GMM distributed decomposition bug	Joseph K. Bradley	2015-08-23	1	-2/+20
\| \| \| \| \| \| \| \| \| \| \| \|	GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests. This PR adds a unit test which checks this. It failed previously but works with this fix. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8370 from jkbradley/gmm-fix.
*	[SPARK-9893] User guide with Java test suite for VectorSlicer	Xusen Yin	2015-08-21	1	-0/+85
\| \| \| \| \| \| \| \| \| \|	Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer. Note that Python version does not support selecting by names now. Author: Xusen Yin <yinxusen@gmail.com> Closes #8267 from yinxusen/SPARK-9893.
*	[SPARK-9245] [MLLIB] LDA topic assignments	Joseph K. Bradley	2015-08-20	2	-2/+26
\| \| \| \| \| \| \| \| \| \|	For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token. CC: rotationsymmetry mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8329 from jkbradley/lda-topic-assignments.
*	[SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add ↵	Xiangrui Meng	2015-08-20	1	-0/+74
\| \| \| \| \| \| \| \| \| \|	Java test suite Otherwise, setters do not return self type. jkbradley avulanov Author: Xiangrui Meng <meng@databricks.com> Closes #8342 from mengxr/SPARK-10138.
*	[SPARK-10097] Adds `shouldMaximize` flag to `ml.evaluation.Evaluator`	Feynman Liang	2015-08-19	3	-2/+6
\| \| \| \| \| \| \| \| \| \| \| \| \|	Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user. This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized. CC jkbradley Author: Feynman Liang <fliang@databricks.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #8290 from feynmanliang/SPARK-10097.
*	[SPARK-10012] [ML] Missing test case for Params#arrayLengthGt	lewuathe	2015-08-18	1	-0/+3
\| \| \| \| \| \| \| \|	Currently there is no test case for `Params#arrayLengthGt`. Author: lewuathe <lewuathe@me.com> Closes #8223 from Lewuathe/SPARK-10012.
*	[SPARK-9900] [MLLIB] User guide for Association Rules	Feynman Liang	2015-08-18	1	-1/+1
\| \| \| \| \| \| \| \|	Updates FPM user guide to include Association Rules. Author: Feynman Liang <fliang@databricks.com> Closes #8207 from feynmanliang/SPARK-9900-arules.
*	[SPARK-9028] [ML] Add CountVectorizer as an estimator to generate ↵	Yuhao Yang	2015-08-18	2	-73/+167
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	CountVectorizerModel jira: https://issues.apache.org/jira/browse/SPARK-9028 Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency. I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn. Author: Yuhao Yang <hhbyyh@gmail.com> Author: Joseph K. Bradley <joseph@databricks.com> Closes #7388 from hhbyyh/cvEstimator.
*	[SPARK-9981] [ML] Made labels public for StringIndexerModel	Joseph K. Bradley	2015-08-14	1	-0/+18
\| \| \| \| \| \| \| \| \| \|	Also added unit test for integration between StringIndexerModel and IndexToString CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #8211 from jkbradley/stridx-labels.
*	[SPARK-8744] [ML] Add a public constructor to StringIndexer	Holden Karau	2015-08-14	1	-0/+2
\| \| \| \| \| \| \| \|	It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model. Author: Holden Karau <holden@pigscanfly.ca> Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
*	[SPARK-9956] [ML] Make trees work with one-category features	Joseph K. Bradley	2015-08-14	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical. As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing. Targeted for 1.5 and master CC: manishamde mengxr yanboliang Author: Joseph K. Bradley <joseph@databricks.com> Closes #8187 from jkbradley/tree-1cat.