| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #9090 from mengxr/SPARK-7402.
|
|
|
|
|
|
|
|
| |
Compute upper triangular values of the covariance matrix, then copy to lower triangular values.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes #8940 from pnpritchard/SPARK-10875.
|
|
|
|
|
|
|
|
| |
LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.
|
|
|
|
|
|
|
|
|
|
| |
Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation.
With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.
Author: Nathan Howell <nhowell@godaddy.com>
Closes #8246 from NathanHowell/SPARK-10064.
|
|
|
|
|
|
|
|
|
|
| |
cleaning up some code
Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code.
Author: DB Tsai <dbt@netflix.com>
Closes #8853 from dbtsai/refactoring.
|
|
|
|
|
|
|
|
| |
It is currently impossible to clear Param values once set. It would be helpful to be able to.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.
|
|
|
|
|
|
|
|
| |
See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530).
Author: Xusen Yin <yinxusen@gmail.com>
Closes #5742 from yinxusen/SPARK-6530.
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890).
I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #5779 from yinxusen/SPARK-5890.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #8830 from ericl/interaction-2.
|
|
|
|
|
|
|
|
|
|
| |
simplified dataframe construction
As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
|
|
|
|
|
|
|
|
| |
By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8836 from yanboliang/spark-10686.
|
|
|
|
|
|
|
|
| |
All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.
Author: sethah <seth.hendrickson16@gmail.com>
Closes #8675 from sethah/SPARK-9715.
|
|
|
|
|
|
|
|
|
|
| |
Add java wrapper for random vector rdd
holdenk srowen
Author: Meihua Wu <meihuawu@umich.edu>
Closes #8841 from rotationsymmetry/SPARK-10706.
|
|
|
|
|
|
|
|
|
|
|
| |
testing
Implementation of significance testing using Streaming API.
Author: Feynman Liang <fliang@databricks.com>
Author: Feynman Liang <feynman.liang@gmail.com>
Closes #4716 from feynmanliang/ab_testing.
|
|
|
|
|
|
|
|
|
|
| |
In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling.
work in progress.
Author: Meihua Wu <meihuawu@umich.edu>
Closes #8631 from rotationsymmetry/SPARK-9642.
|
|
|
|
|
|
|
|
| |
SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.
|
|
|
|
|
|
|
|
|
| |
[Accelerated Failure Time (AFT) model](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time.
Users can refer to the R function [```survreg```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html) to compare the model and [```predict```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/predict.survreg.html) to compare the prediction. There are different kinds of model prediction, I have just select the type ```response``` which is default used for R.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8611 from yanboliang/spark-8518.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
feature interactions
This is a pre-req for supporting the ":" operator in the RFormula feature transformer.
Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #7987 from ericl/interaction.
|
|
|
|
|
|
|
|
|
|
|
| |
In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.
Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>
Closes #7884 from dbtsai/SPARK-7685.
|
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-10491
We implemented dspr with sparse vector support in `RowMatrix`. This method is also used in WeightedLeastSquares and other places. It would be useful to move it to `linalg.BLAS`.
Let me know if new UT needed.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #8663 from hhbyyh/movedspr.
|
|
|
|
|
|
|
|
| |
Fixes bug where IndexToString output schema was DoubleType. Correct me if I'm wrong, but it doesn't seem like the output needs to have any "ML Attribute" metadata.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes #8751 from pnpritchard/SPARK-10573.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
minor improvements
We should document options in public API doc. Otherwise, it is hard to find out the options without looking at the code. I tried to make `DefaultSource` private and put the documentation to package doc. However, since then there exists no public class under `source.libsvm`, the Java package doc doesn't show up in the generated html file (http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4492654). So I put the doc to `DefaultSource` instead. There are several minor updates in this PR:
1. Do `vectorType == "sparse"` only once.
2. Update `hashCode` and `equals`.
3. Remove inherited doc.
4. Delete temp dir in `afterAll`.
Lewuathe
Author: Xiangrui Meng <meng@databricks.com>
Closes #8699 from mengxr/SPARK-10537.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is convenient to implement data source API for LIBSVM format to have a better integration with DataFrames and ML pipeline API.
Two option is implemented.
* `numFeatures`: Specify the dimension of features vector
* `featuresType`: Specify the type of output vector. `sparse` is default.
Author: lewuathe <lewuathe@me.com>
Closes #8537 from Lewuathe/SPARK-10117 and squashes the following commits:
986999d [lewuathe] Change unit test phrase
11d513f [lewuathe] Fix some reviews
21600a4 [lewuathe] Merge branch 'master' into SPARK-10117
9ce63c7 [lewuathe] Rewrite service loader file
1fdd2df [lewuathe] Merge branch 'SPARK-10117' of github.com:Lewuathe/spark into SPARK-10117
ba3657c [lewuathe] Merge branch 'master' into SPARK-10117
0ea1c1c [lewuathe] LibSVMRelation is registered into META-INF
4f40891 [lewuathe] Improve test suites
5ab62ab [lewuathe] Merge branch 'master' into SPARK-10117
8660d0e [lewuathe] Fix Java unit test
b56a948 [lewuathe] Merge branch 'master' into SPARK-10117
2c12894 [lewuathe] Remove unnecessary tag
7d693c2 [lewuathe] Resolv conflict
62010af [lewuathe] Merge branch 'master' into SPARK-10117
a97ee97 [lewuathe] Fix some points
aef9564 [lewuathe] Fix
70ee4dd [lewuathe] Add Java test
3fd8dce [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
40d3027 [lewuathe] Add Java test
7056d4a [lewuathe] Merge branch 'master' into SPARK-10117
99accaa [lewuathe] [SPARK-10117] Implement SQL data source API for reading LIBSVM data
|
|
|
|
|
|
|
|
|
| |
Add WeibullGenerator for RandomDataGenerator.
#8611 need use WeibullGenerator to generate random data based on Weibull distribution.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8622 from yanboliang/spark-10464.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The goal of this PR is to have a weighted least squares implementation that takes the normal equation approach, and hence to be able to provide R-like summary statistics and support IRLS (used by GLMs). The tests match R's lm and glmnet.
There are couple TODOs that can be addressed in future PRs:
* consolidate summary statistics aggregators
* move `dspr` to `BLAS`
* etc
It would be nice to have this merged first because it blocks couple other features.
dbtsai
Author: Xiangrui Meng <meng@databricks.com>
Closes #8588 from mengxr/SPARK-9834.
|
|
|
|
|
|
|
|
|
| |
Copied model must have the same parent, but ml.IsotonicRegressionModel.copy did not set parent.
Here fix it and add test case.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8637 from yanboliang/spark-10470.
|
|
|
|
|
|
|
|
| |
From Jira: We should use assertTrue, etc. instead to make sure the asserts are not ignored in tests.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8607 from holdenk/SPARK-10013-remove-java-assert-from-java-unit-tests.
|
|
|
|
|
|
|
|
| |
Params.getOrDefault should throw a more meaningful exception than what you get from a bad key lookup.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8567 from holdenk/SPARK-9723-params-getordefault-should-throw-more-useful-error.
|
|
|
|
|
|
|
|
| |
Add a python API for the Stop Words Remover.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
|
|
|
|
|
|
|
|
|
|
|
|
| |
of matrix multiplications
mengxr jkbradley rxin
It would be great if this fix made it into RC3!
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #8525 from brkyvz/blas-scaling.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
compatibility test
* Adds user guide for ml.feature.StopWordsRemovers, ran code examples on my machine
* Cleans up scaladocs for public methods
* Adds test for Java compatibility
* Follow up Python user guide code example is tracked by SPARK-10249
Author: Feynman Liang <fliang@databricks.com>
Closes #8436 from feynmanliang/SPARK-10230.
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Replaces instances of `Lists.newArrayList` with `Arrays.asList`
* Replaces `commons.lang.StringUtils` over `com.google.collections.Strings`
* Replaces `List` interface over `ArrayList` implementations
This PR along with #8445 #8446 #8447 completely removes all `com.google.collections.Lists` dependencies within mllib's Java tests.
Author: Feynman Liang <fliang@databricks.com>
Closes #8451 from feynmanliang/SPARK-10257.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Fix for [JavaConverters.asJavaListConverter](http://www.scala-lang.org/api/2.10.5/index.html#scala.collection.JavaConverters$) being removed in 2.11.7 and hence the build fails with the 2.11 profile enabled. Tested with the default 2.10 and 2.11 profiles. BUILD SUCCESS in both cases.
Build for 2.10:
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -DskipTests clean install
and 2.11:
./dev/change-scala-version.sh 2.11
./build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.7.1 -Dscala-2.11 -DskipTests clean install
Author: Jacek Laskowski <jacek@japila.pl>
Closes #8479 from jaceklaskowski/SPARK-9613-hotfix.
|
|
|
|
|
|
|
|
| |
JavaTests
Author: Feynman Liang <fliang@databricks.com>
Closes #8447 from feynmanliang/SPARK-10256.
|
|
|
|
|
|
| |
Author: Feynman Liang <fliang@databricks.com>
Closes #8446 from feynmanliang/SPARK-10255.
|
|
|
|
|
|
|
|
|
| |
* Replaces `com.google.common` dependencies with `java.util.Arrays`
* Small clean up in `JavaNormalizerSuite`
Author: Feynman Liang <fliang@databricks.com>
Closes #8445 from feynmanliang/SPARK-10254.
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Adds two new sections to LDA's user guide; one for each optimizer/model
* Documents new features added to LDA (e.g. topXXXperXXX, asymmetric priors, hyperpam optimization)
* Cleans up a TODO and sets a default parameter in LDA code
jkbradley hhbyyh
Author: Feynman Liang <fliang@databricks.com>
Closes #8254 from feynmanliang/SPARK-9888.
|
|
|
|
|
|
|
|
|
|
| |
See [discussion](https://github.com/apache/spark/pull/8254#discussion_r37837770)
CC jkbradley
Author: Feynman Liang <fliang@databricks.com>
Closes #8422 from feynmanliang/SPARK-10230.
|
|
|
|
|
|
|
|
|
|
|
|
| |
to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`
Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.
Author: Sean Owen <sowen@cloudera.com>
Closes #8033 from srowen/SPARK-9613.
|
|
|
|
|
|
|
|
|
|
|
|
| |
GaussianMixture now distributes matrix decompositions for certain problem sizes. Distributed computation actually fails, but this was not tested in unit tests.
This PR adds a unit test which checks this. It failed previously but works with this fix.
CC: mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8370 from jkbradley/gmm-fix.
|
|
|
|
|
|
|
|
|
|
| |
Add user guide for `VectorSlicer`, with Java test suite and Python version VectorSlicer.
Note that Python version does not support selecting by names now.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #8267 from yinxusen/SPARK-9893.
|
|
|
|
|
|
|
|
|
|
| |
For each (document, term) pair, return top topic. Note that instances of (doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we should provide an estimate per document-term, rather than per token.
CC: rotationsymmetry mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8329 from jkbradley/lda-topic-assignments.
|
|
|
|
|
|
|
|
|
|
| |
Java test suite
Otherwise, setters do not return self type. jkbradley avulanov
Author: Xiangrui Meng <meng@databricks.com>
Closes #8342 from mengxr/SPARK-10138.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user.
This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized.
CC jkbradley
Author: Feynman Liang <fliang@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8290 from feynmanliang/SPARK-10097.
|
|
|
|
|
|
|
|
| |
Currently there is no test case for `Params#arrayLengthGt`.
Author: lewuathe <lewuathe@me.com>
Closes #8223 from Lewuathe/SPARK-10012.
|
|
|
|
|
|
|
|
| |
Updates FPM user guide to include Association Rules.
Author: Feynman Liang <fliang@databricks.com>
Closes #8207 from feynmanliang/SPARK-9900-arules.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
CountVectorizerModel
jira: https://issues.apache.org/jira/browse/SPARK-9028
Add an estimator for CountVectorizerModel. The estimator will extract a vocabulary from document collections according to the term frequency.
I changed the meaning of minCount as a filter across the corpus. This aligns with Word2Vec and the similar parameter in SKlearn.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #7388 from hhbyyh/cvEstimator.
|
|
|
|
|
|
|
|
|
|
| |
Also added unit test for integration between StringIndexerModel and IndexToString
CC: holdenk We realized we should have left in your unit test (to catch the issue with removing the inverse() method), so this adds it back. mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8211 from jkbradley/stridx-labels.
|
|
|
|
|
|
|
|
| |
It would be helpful to allow users to pass a pre-computed index to create an indexer, rather than always going through StringIndexer to create the model.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #7267 from holdenk/SPARK-8744-StringIndexerModel-should-have-public-constructor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This modifies DecisionTreeMetadata construction to treat 1-category features as continuous, so that trees do not fail with such features. It is important for the pipelines API, where VectorIndexer can automatically categorize certain features as categorical.
As stated in the JIRA, this is a temp fix which we can improve upon later by automatically filtering out those features. That will take longer, though, since it will require careful indexing.
Targeted for 1.5 and master
CC: manishamde mengxr yanboliang
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #8187 from jkbradley/tree-1cat.
|