| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
This fixes some compile time warnings.
Author: Xiangrui Meng <meng@databricks.com>
Closes #9319 from mengxr/mllib-compile-warn-20151027.
|
|
|
|
|
|
|
|
|
|
|
|
| |
returns incorrect answer in some cases
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.
Supersedes https://github.com/apache/spark/pull/9293
Author: Sean Owen <sowen@cloudera.com>
Closes #9309 from srowen/SPARK-11302.2.
|
|
|
|
|
|
|
|
|
|
| |
Add columnSimilarities to IndexedRowMatrix by delegating to functionality already in RowMatrix.
With a test.
Author: Reza Zadeh <reza@databricks.com>
Closes #8792 from rezazadeh/colsims.
|
|
|
|
|
|
|
|
| |
Remove "Experimental" from .mllib code that has been around since 1.4.0 or earlier
Author: Sean Owen <sowen@cloudera.com>
Closes #9169 from srowen/SPARK-11184.
|
|
|
|
|
|
|
|
|
|
|
| |
This is a PR for Parquet-based model import/export.
* Added save/load for ChiSqSelectorModel
* Updated the test suite ChiSqSelectorSuite
Author: Jayant Shekar <jayant@user-MBPMBA-3.local>
Closes #6785 from jayantshekhar/SPARK-6723.
|
|
|
|
|
|
|
|
| |
package
Author: Reynold Xin <rxin@databricks.com>
Closes #9239 from rxin/types-private.
|
|
|
|
|
|
|
|
|
|
|
| |
* `>=0` => `>= 0`
* print `i`, `j` in the log message
MechCoder
Author: Xiangrui Meng <meng@databricks.com>
Closes #9189 from mengxr/SPARK-10082.
|
|
|
|
|
|
|
|
|
|
|
| |
Given row_ind should be less than the number of rows
Given col_ind should be less than the number of cols.
The current code in master gives unpredictable behavior for such cases.
Author: MechCoder <manojkumarsivaraj334@gmail.com>
Closes #8271 from MechCoder/hash_code_matrices.
|
|
|
|
|
|
|
| |
Author: Tijo Thomas <tijoparacka@gmail.com>
Author: tijo <tijo@ezzoft.com>
Closes #8554 from tijoparacka/SPARK-10261-2.
|
|
|
|
|
|
|
|
|
|
|
| |
…2 regularization if the number of features is small
Author: lewuathe <lewuathe@me.com>
Author: Lewuathe <sasaki@treasure-data.com>
Author: Kai Sasaki <sasaki@treasure-data.com>
Author: Lewuathe <lewuathe@me.com>
Closes #8884 from Lewuathe/SPARK-10668.
|
|
|
|
|
|
|
|
|
|
| |
predictImpl
predictNodeIndex is moved to LearningNode and renamed predictImpl for consistency with Node.predictImpl
Author: Luvsandondov Lkhamsuren <lkhamsurenl@gmail.com>
Closes #8609 from lkhamsurenl/SPARK-9963.
|
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-11029
We should add a method analogous to spark.mllib.clustering.KMeansModel.computeCost to spark.ml.clustering.KMeansModel.
This will be a temp fix until we have proper evaluators defined for clustering.
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: yuhaoyang <yuhao@zhanglipings-iMac.local>
Closes #9073 from hhbyyh/computeCost.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR aims to decrease communication costs in BlockMatrix multiplication in two ways:
- Simulate the multiplication on the driver, and figure out which blocks actually need to be shuffled
- Send the block once to a partition, and join inside the partition rather than sending multiple copies to the same partition
**NOTE**: One important note is that right now, the old behavior of checking for multiple blocks with the same index is lost. This is not hard to add, but is a little more expensive than how it was.
Initial benchmarking showed promising results (look below), however I did hit some `FileNotFound` exceptions with the new implementation after the shuffle.
Size A: 1e5 x 1e5
Size B: 1e5 x 1e5
Block Sizes: 1024 x 1024
Sparsity: 0.01
Old implementation: 1m 13s
New implementation: 9s
cc avulanov Would you be interested in helping me benchmark this? I used your code from the mailing list (which you sent about 3 months ago?), and the old implementation didn't even run, but the new implementation completed in 268s in a 120 GB / 16 core cluster
Author: Burak Yavuz <brkyvz@gmail.com>
Closes #8757 from brkyvz/opt-bmm.
|
|
|
|
|
|
|
|
|
|
|
| |
AFTSurvivalRegression
Value of the quantile probabilities array should be in the range (0, 1) instead of [0,1]
in `AFTSurvivalRegression.scala` according to [Discussion] (https://github.com/apache/spark/pull/8926#discussion-diff-40698242)
Author: vectorijk <jiangkai@gmail.com>
Closes #9083 from vectorijk/spark-11059.
|
|
|
|
|
|
|
|
| |
This PR implements the JSON SerDe for the following param types: `Boolean`, `Int`, `Long`, `Float`, `Double`, `String`, `Array[Int]`, `Array[Double]`, and `Array[String]`. The implementation of `Float`, `Double`, and `Array[Double]` are specialized to handle `NaN` and `Inf`s. This will be used in pipeline persistence. jkbradley
Author: Xiangrui Meng <meng@databricks.com>
Closes #9090 from mengxr/SPARK-7402.
|
|
|
|
|
|
|
|
|
|
| |
PySpark
Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark
Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
Closes #8700 from smartkiwi/SPARK-10535_.
|
|
|
|
|
|
|
|
| |
Compute upper triangular values of the covariance matrix, then copy to lower triangular values.
Author: Nick Pritchard <nicholas.pritchard@falkonry.com>
Closes #8940 from pnpritchard/SPARK-10875.
|
|
|
|
|
|
|
|
|
|
| |
absolute error
GBT compare ValidateError with tolerance switching between relative and absolute ones, where the former one is relative to the current loss on the training set.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8549 from yanboliang/spark-7770.
|
|
|
|
|
|
|
|
| |
LinearRegression training summary: The transformed dataset should hold all columns, not just selected ones like prediction and label. There is no real need to remove some, and the user may find them useful.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8564 from holdenk/SPARK-9718-LinearRegressionTrainingSummary-all-columns.
|
|
|
|
|
|
|
|
|
|
| |
Reimplement `DecisionTree.findSplitsBins` via `RDD` to parallelize bin calculation.
With large feature spaces the current implementation is very slow. This change limits the features that are distributed (or collected) to just the continuous features, and performs the split calculations in parallel. It completes on a real multi terabyte dataset in less than a minute instead of multiple hours.
Author: Nathan Howell <nhowell@godaddy.com>
Closes #8246 from NathanHowell/SPARK-10064.
|
|
|
|
|
|
|
|
|
|
| |
cleaning up some code
Refactoring `Instance` case class out from LOR and LIR, and also cleaning up some code.
Author: DB Tsai <dbt@netflix.com>
Closes #8853 from dbtsai/refactoring.
|
|
|
|
|
|
|
|
|
|
| |
and ALS
Consolidate the Cholesky solvers in WeightedLeastSquares and ALS.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8936 from yanboliang/spark-10490.
|
|
|
|
|
|
|
|
|
|
| |
(spark.mllib)
Provide initialModel param for pyspark.mllib.clustering.KMeans
Author: Evan Chen <chene@us.ibm.com>
Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #8775 from vanzin/SPARK-10300.
|
|
|
|
|
|
|
|
| |
It is currently impossible to clear Param values once set. It would be helpful to be able to.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8619 from holdenk/SPARK-9841-params-clear-needs-to-be-public.
|
|
|
|
|
|
|
|
| |
https://github.com/apache/spark/pull/8882 broke our build.
Author: Yin Huai <yhuai@databricks.com>
Closes #8964 from yhuai/fixStyle.
|
|
|
|
|
|
|
|
| |
See JIRA [here](https://issues.apache.org/jira/browse/SPARK-6530).
Author: Xusen Yin <yinxusen@gmail.com>
Closes #5742 from yinxusen/SPARK-6530.
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5890).
I borrow the code of `findSplits` from `RandomForest`. I don't think it's good to call it from `RandomForest` directly.
Author: Xusen Yin <yinxusen@gmail.com>
Closes #5779 from yinxusen/SPARK-5890.
|
|
|
|
|
|
|
|
| |
Document CrossValidatorModel members: bestModel and avgMetrics
Author: Rerngvit Yanggratoke <rerngvit@kth.se>
Closes #8882 from rerngvit/Spark-9798.
|
|
|
|
|
|
|
|
| |
For some implicit dataset, ratings may not exist in the training data. In this case, we can assume all observed pairs to be positive and treat their ratings as 1. This should happen when users set ```ratingCol``` to an empty string.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8937 from yanboliang/spark-10736.
|
|
|
|
|
|
|
|
| |
I implemented toString for AssociationRules.Rule, format like `[x, y] => {z}: 1.0`
Author: y-shimizu <y.shimizu0429@gmail.com>
Closes #8904 from y-shimizu/master.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #8830 from ericl/interaction-2.
|
|
|
|
|
|
|
|
|
|
| |
simplified dataframe construction
As introduced in https://issues.apache.org/jira/browse/SPARK-10630 we now have an easier way to create dataframes from local Java lists. Lets update the tests to use those.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8886 from holdenk/SPARK-10763-update-java-mllib-ml-tests-to-use-simplified-dataframe-construction.
|
|
|
|
|
|
|
|
|
| |
Currently use can set ```checkpointInterval``` to specify how often should the cache be check-pointed. But we also need the function that users can disable it. This PR supports that users can disable checkpoint if user setting ```checkpointInterval = -1```.
We also add documents for GBT ```cacheNodeIds``` to make users can understand more clearly about checkpoint.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8820 from yanboliang/spark-10699.
|
|
|
|
|
|
|
|
| |
By default ```quantilesCol``` should be empty. If ```quantileProbabilities``` is set, we should append quantiles as a new column (of type Vector).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8836 from yanboliang/spark-10686.
|
|
|
|
|
|
|
|
| |
All prediction models should store `numFeatures` indicating the number of features the model was trained on. Default value of -1 added for backwards compatibility.
Author: sethah <seth.hendrickson16@gmail.com>
Closes #8675 from sethah/SPARK-9715.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently when you set illegal value for params of array type (such as IntArrayParam, DoubleArrayParam, StringArrayParam), it will throw IllegalArgumentException but with incomprehensible error information.
Take ```VectorSlicer.setNames``` as an example:
```scala
val vectorSlicer = new VectorSlicer().setInputCol("features").setOutputCol("result")
// The value of setNames must be contain distinct elements, so the next line will throw exception.
vectorSlicer.setIndices(Array.empty).setNames(Array("f1", "f4", "f1"))
```
It will throw IllegalArgumentException as:
```
vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
java.lang.IllegalArgumentException: vectorSlicer_b3b4d1a10f43 parameter names given invalid value [Ljava.lang.String;798256c5.
```
We should distinguish the value of array type from primitive type at Param.validate(value: T), and we will get better error information.
```
vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
java.lang.IllegalArgumentException: vectorSlicer_3b744ea277b2 parameter names given invalid value [f1,f4,f1].
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8863 from yanboliang/spark-10750.
|
|
|
|
|
|
|
|
|
|
| |
prevNodeIdsForInstances.unpersist() at end of training
NodeIdCache: prevNodeIdsForInstances.unpersist() needs to be called at end of training.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8541 from holdenk/SPARK-9962-decission-tree-training-prevNodeIdsForiNstances-unpersist-at-end-of-training.
|
|
|
|
|
|
|
|
|
|
| |
Add java wrapper for random vector rdd
holdenk srowen
Author: Meihua Wu <meihuawu@umich.edu>
Closes #8841 from rotationsymmetry/SPARK-10706.
|
|
|
|
|
|
|
|
|
|
|
| |
testing
Implementation of significance testing using Streaming API.
Author: Feynman Liang <fliang@databricks.com>
Author: Feynman Liang <feynman.liang@gmail.com>
Closes #4716 from feynmanliang/ab_testing.
|
|
|
|
|
|
|
|
|
|
| |
In many modeling application, data points are not necessarily sampled with equal probabilities. Linear regression should support weighting which account the over or under sampling.
work in progress.
Author: Meihua Wu <meihuawu@umich.edu>
Closes #8631 from rotationsymmetry/SPARK-9642.
|
|
|
|
|
|
|
|
| |
SPARK-3136 added a large number of functions for creating Java RandomRDDs, but for people that want to use custom RandomDataGenerators we should make a Java friendly method.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8782 from holdenk/SPARK-10626-create-java-friendly-method-for-randomRDD.
|
|
|
|
|
|
|
|
|
| |
There are duplicate set of initialization flag in `WeightedLeastSquares#add`.
`initialized` is already set in `init(Int)`.
Author: lewuathe <lewuathe@me.com>
Closes #8837 from Lewuathe/duplicate-initialization-flag.
|
|
|
|
|
|
|
|
|
| |
Note methods that fail for cols > 65535; note that SVD does not require n >= m
CC mengxr
Author: Sean Owen <sowen@cloudera.com>
Closes #8839 from srowen/SPARK-5905.
|
|
|
|
|
|
|
|
|
|
|
| |
This makes equality test failures much more readable.
mengxr
Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>
Closes #8826 from ericl/attrgroupstr.
|
|
|
|
|
|
|
|
|
| |
[Accelerated Failure Time (AFT) model](https://en.wikipedia.org/wiki/Accelerated_failure_time_model) is the most commonly used and easy to parallel method of survival analysis for censored survival data. It is the log-linear model based on the Weibull distribution of the survival time.
Users can refer to the R function [```survreg```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/survreg.html) to compare the model and [```predict```](https://stat.ethz.ch/R-manual/R-devel/library/survival/html/predict.survreg.html) to compare the prediction. There are different kinds of model prediction, I have just select the type ```response``` which is default used for R.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8611 from yanboliang/spark-8518.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
feature interactions
This is a pre-req for supporting the ":" operator in the RFormula feature transformer.
Design doc from umbrella task: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #7987 from ericl/interaction.
|
|
|
|
|
|
|
|
|
| |
```GBTParams``` has ```stepSize``` as learning rate currently.
ML has shared param class ```HasStepSize```, ```GBTParams``` can extend from it rather than duplicated implementation.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8552 from yanboliang/spark-10394.
|
|
|
|
|
|
|
|
|
| |
Should be the same as SPARK-7808 but use Java for the code example.
It would be great to add package doc for `spark.ml.feature`.
Author: Holden Karau <holden@pigscanfly.ca>
Closes #8740 from holdenk/SPARK-10077-JAVA-PACKAGE-DOC-FOR-SPARK.ML.FEATURE.
|
|
|
|
|
|
|
|
|
|
|
| |
In fraud detection dataset, almost all the samples are negative while only couple of them are positive. This type of high imbalanced data will bias the models toward negative resulting poor performance. In python-scikit, they provide a correction allowing users to Over-/undersample the samples of each class according to the given weights. In auto mode, selects weights inversely proportional to class frequencies in the training set. This can be done in a more efficient way by multiplying the weights into loss and gradient instead of doing actual over/undersampling in the training dataset which is very expensive.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html
On the other hand, some of the training data maybe more important like the training samples from tenure users while the training samples from new users maybe less important. We should be able to provide another "weight: Double" information in the LabeledPoint to weight them differently in the learning algorithm.
Author: DB Tsai <dbt@netflix.com>
Author: DB Tsai <dbt@dbs-mac-pro.corp.netflix.com>
Closes #7884 from dbtsai/SPARK-7685.
|