| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database.
The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/
## How was this patch tested?
Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All
result obtained were stricly the same as the original implementation (without this change).
dev/run-tests was also runned, no error were found.
Author : Cyril de Vogelaere <cyril.devogelaeregmail.com>
Author: Syrux <pokcyril@hotmail.com>
Closes #17575 from Syrux/SPARK-20265.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
There are several problems with it:
- It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
- > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
(see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
## How was this patch tested?
Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
This was tested via manually adding `time.time()` as below:
```diff
profiles_and_goals = build_profiles + sbt_goals
print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
" ".join(profiles_and_goals))
+ import time
+ st = time.time()
exec_sbt(profiles_and_goals)
+ print("Elapsed :[%s]" % str(time.time() - st))
```
produces
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #17477 from HyukjinKwon/SPARK-18692.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
degreesOfFreedom in LR and GLR
## What changes were proposed in this pull request?
- made `numInstances` public in GLR
- made `degreesOfFreedom` public in LR
## How was this patch tested?
reran the concerned test suites
Author: Benjamin Fradet <benjamin.fradet@gmail.com>
Closes #17431 from BenFradet/SPARK-20097.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
locale bug" causes Spark problems
## What changes were proposed in this pull request?
Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes #17527 from srowen/SPARK-20156.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This error message doesn't get properly formatted because of a missing `s`. Currently the error looks like:
```
Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
```
(note the literal `$current` instead of the interpolated value)
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Vijay Ramesh <vramesh@demandbase.com>
Closes #17572 from vijaykramesh/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.
## How was this patch tested?
Python unit test.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #17494 from viirya/correlation-python-api.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
The ML `RandomForestClassificationModel` and `RandomForestRegressionModel` were not using the estimator parent UID when being fit. This change fixes that so the models can be properly be identified with their parents.
## How was this patch tested?Existing tests.
Added check to verify that model uid matches that of the parent, then renamed `checkCopy` to `checkCopyAndUids` and verified that it was called by one test for each ML algorithm.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #17296 from BryanCutler/rfmodels-use-parent-uid-SPARK-19953.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
generation and transform
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-20003
I was doing some test and found the issue. ml.fpm.FPGrowthModel `setMinConfidence` should always affect rules generation and transform.
Currently associationRules in FPGrowthModel is a lazy val and `setMinConfidence` in FPGrowthModel has no impact once associationRules got computed .
I try to cache the associationRules to avoid re-computation if `minConfidence` is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.
## How was this patch tested?
new unit test and I strength the unit test for model save/load to ensure the cache mechanism.
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes #17336 from hhbyyh/fpmodelminconf.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
MLTestingUtils.testOutliersWithSmallWeights
## What changes were proposed in this pull request?
This is a small piece from https://github.com/apache/spark/pull/16722 which ultimately will add sample weights to decision trees. This is to allow more flexibility in testing outliers since linear models and trees behave differently.
Note: The primary author when this is committed should be sethah since this is taken from his code.
## How was this patch tested?
Existing tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #17501 from jkbradley/SPARK-20183.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
- `spark.fpGrowth` -model training.
- `freqItemsets` and `associationRules` methods with new corresponding generics.
- Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
- unit tests.
## How was this patch tested?
Feature specific unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes #17170 from zero323/SPARK-19825.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316
## How was this patch tested?
local doc generation and example execution
Author: Yuhao Yang <yuhao.yang@intel.com>
Closes #17324 from hhbyyh/imputerdoc.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Some ML Models were using `defaultCopy` which expects a default constructor, and others were not setting the parent estimator. This change fixes these by creating a new instance of the model and explicitly setting values and parent.
## How was this patch tested?
Added `MLTestingUtils.checkCopy` to the offending models to tests to verify the copy is made and parent is set.
Author: Bryan Cutler <cutlerb@gmail.com>
Closes #17326 from BryanCutler/ml-model-copy-error-SPARK-19985.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
…adoc
## What changes were proposed in this pull request?
Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0).
## How was this patch tested?
Local build
Author: Jacek Laskowski <jacek@japila.pl>
Closes #17417 from jaceklaskowski/window-expression-scaladoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
A pyspark wrapper for spark.ml.stat.ChiSquareTest
## How was this patch tested?
unit tests
doctests
Author: Bago Amirbekian <bago@databricks.com>
Closes #17421 from MrBago/chiSquareTestWrapper.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
uppercase impurity type Gini
Fix bug: DecisionTreeModel can't recongnize Impurity "Gini" when loading
TODO:
+ [x] add unit test
+ [x] fix the bug
Author: 颜发才(Yan Facai) <facai.yan@gmail.com>
Closes #17407 from facaiy/BUG/decision_tree_loader_failer_with_Gini_impurity.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Use the new `compressed` method on matrices to store the logistic regression coefficients as sparse or dense - whichever is requires less memory.
Marked as WIP so we can add some performance test results. Basically, we should see if prediction is slower because of using a sparse matrix over a dense one. This can happen since sparse matrices do not use native BLAS operations when computing the margins.
## How was this patch tested?
Unit tests added.
Author: sethah <seth.hendrickson16@gmail.com>
Closes #17426 from sethah/SPARK-17137.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add Python wrapper for `Imputer` feature transformer.
## How was this patch tested?
New doc tests and tweak to PySpark ML `tests.py`
Author: Nick Pentreath <nickp@za.ibm.com>
Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket.
The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage.
## How was this patch tested?
```
build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite"
```
Author: Timothy Hunter <timhunter@databricks.com>
Closes #17108 from thunterdb/19636.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Several javadoc8 breaks have been introduced. This PR proposes fix those instances so that we can build Scala/Java API docs.
```
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:6: error: reference not found
[error] * <code>flatMapGroupsWithState</code> operations on {link KeyValueGroupedDataset}.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:10: error: reference not found
[error] * Both, <code>mapGroupsWithState</code> and <code>flatMapGroupsWithState</code> in {link KeyValueGroupedDataset}
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:51: error: reference not found
[error] * {link GroupStateTimeout.ProcessingTimeTimeout}) or event time (i.e.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:52: error: reference not found
[error] * {link GroupStateTimeout.EventTimeTimeout}).
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:158: error: reference not found
[error] * Spark SQL types (see {link Encoder} for more details).
[error] ^
[error] .../spark/mllib/target/java/org/apache/spark/ml/fpm/FPGrowthParams.java:26: error: bad use of '>'
[error] * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
[error] ^
[error] .../spark/sql/core/src/main/java/org/apache/spark/api/java/function/FlatMapGroupsWithStateFunction.java:30: error: reference not found
[error] * {link org.apache.spark.sql.KeyValueGroupedDataset#flatMapGroupsWithState(
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:211: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:232: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:254: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:277: error: reference not found
[error] * See {link GroupState} for more details.
[error] ^
[error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
[error] * {link TaskMetrics} & {link MetricsSystem} objects are not thread safe.
[error] ^
[error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
[error] * {link TaskMetrics} & {link MetricsSystem} objects are not thread safe.
[error] ^
[info] 13 errors
```
```
jekyll 3.3.1 | Error: Unidoc generation failed
```
## How was this patch tested?
Manually via `jekyll build`
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #17389 from HyukjinKwon/minor-javadoc8-fix.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
I realized that since ChiSquare is in the package stat, it's pretty unclear if it's the hypothesis test, distribution, or what. This PR renames it to ChiSquareTest to clarify this.
## How was this patch tested?
Existing unit tests
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #17368 from jkbradley/SPARK-20039.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Update docs for NaN handling in approxQuantile.
## How was this patch tested?
existing tests.
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #17369 from zhengruifeng/doc_quantiles_nan.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.
- [DOCS] was previously: "rank is the number of latent factors in the model."
- [API] was previously: "rank - number of features to use"
This change describes rank in both places consistently as:
- "Number of features to use (also referred to as the number of latent factors)"
Author: Chris Snow <chris.snowuk.ibm.com>
Author: christopher snow <chsnow123@gmail.com>
Closes #17345 from snowch/SPARK-20011.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Replaces `featuresCol` `Param` with `itemsCol`. See [SPARK-19899](https://issues.apache.org/jira/browse/SPARK-19899).
## How was this patch tested?
Manual tests. Existing unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes #17321 from zero323/SPARK-19899.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Wrapper taking and return a DataFrame
## How was this patch tested?
Copied unit tests from RDD-based API
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #17110 from jkbradley/df-hypotests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-13568
It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.
Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).
Currently this PR supports imputation for Double and Vector (null and NaN in Vector).
## How was this patch tested?
new unit tests and manual test
Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao <yuhao.yang@intel.com>
Closes #11601 from hhbyyh/imputer.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR is to enhance StringIndexer with NULL values handling.
Before the PR, StringIndexer will throw an exception when encounters NULL values.
With this PR:
- handleInvalid=error: Throw an exception as before
- handleInvalid=skip: Skip null values as well as unseen labels
- handleInvalid=keep: Give null values an additional index as well as unseen labels
BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-)
## How was this patch tested?
new unit tests
Author: Menglong TAN <tanmenglong@renrenche.com>
Author: Menglong TAN <tanmenglong@gmail.com>
Closes #17233 from crackcell/11569_StringIndexer_NULL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This commit moved `distinct` in its intended place to avoid duplicated predictions and adds unit test covering the issue.
## How was this patch tested?
Unit tests.
Author: zero323 <zero323@users.noreply.github.com>
Closes #17283 from zero323/SPARK-19940.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently generating synonyms using a large model (I've tested with 3m words) is very slow. These efficiencies have sped things up for us by ~17%
I wasn't sure if such small changes were worthy of a jira, but the guidelines seemed to suggest that that is the preferred approach
## What changes were proposed in this pull request?
Address a few small issues in the findSynonyms logic:
1) remove usage of ``Array.fill`` to zero out the ``cosineVec`` array. The default float value in Scala and Java is 0.0f, so explicitly setting the values to zero is not needed
2) use Floats throughout. The conversion to Doubles before doing the ``priorityQueue`` is totally superfluous, since all the similarity computations are done using Floats anyway. Creating a second large array just serves to put extra strain on the GC
3) convert the slow ``for(i <- cosVec.indices)`` to an ugly, but faster, ``while`` loop
These efficiencies are really only apparent when working with a large model
## How was this patch tested?
Existing unit tests + some in-house tests to time the difference
cc jkbradley MLNick srowen
Author: Asher Krim <krim.asher@gmail.com>
Author: Asher Krim <krim.asher@gmail>
Closes #17263 from Krimit/fasterFindSynonyms.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Port Tweedie GLM #16344 to SparkR
felixcheung yanboliang
## How was this patch tested?
new test in SparkR
Author: actuaryzhang <actuaryzhang10@gmail.com>
Closes #16729 from actuaryzhang/sparkRTweedie.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Give proper syntax for Java and Python in addition to Scala.
## How was this patch tested?
Manually.
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #17215 from jkbradley/write-err-msg.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"maxDepth" to R models
## What changes were proposed in this pull request?
RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models.
Below 4 R wrappers are changed:
* `RandomForestClassificationWrapper`
* `RandomForestRegressionWrapper`
* `GBTClassificationWrapper`
* `GBTRegressionWrapper`
## How was this patch tested?
Test manually on my local machine.
Author: Xin Ren <iamshrek@126.com>
Closes #17207 from keypointt/SPARK-19282.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
of exception
## What changes were proposed in this pull request?
Ensure broadcasted variable are destroyed even in case of exception
## How was this patch tested?
Word2VecSuite was run locally
Author: Anthony Truchet <a.truchet@criteo.com>
Closes #14299 from AnthonyTruchet/SPARK-16440.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tweedie distribution.
## What changes were proposed in this pull request?
PySpark ```GeneralizedLinearRegression``` supports tweedie distribution.
## How was this patch tested?
Add unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #17146 from yanboliang/spark-19806.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Since we allow ```Estimator``` and ```Model``` not always share same params (see ```ALSParams``` and ```ALSModelParams```), we should pass in test params for estimator and model separately in function ```testEstimatorAndModelReadWrite```.
## How was this patch tested?
Existing tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #17151 from yanboliang/test-rw.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
provide methods to return synonyms directly, without wrapping them in a dataframe
In performance sensitive applications (such as user facing apis) the roundtrip to and from dataframes is costly and unnecessary
The methods are named ``findSynonymsArray`` to make the return type clear, which also implies a local datastructure
## How was this patch tested?
updated word2vec tests
Author: Asher Krim <akrim@hubspot.com>
Closes #16811 from Krimit/w2vFindSynonymsLocal.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR is an enhancement to ML StringIndexer.
Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
But those unseen records might still be useful and user would like to keep the unseen labels in
certain use cases, This PR enables StringIndexer to support keeping unseen labels as
indices [numLabels].
'''Before
StringIndexer().setHandleInvalid("skip")
StringIndexer().setHandleInvalid("error")
'''After
support the third option "keep"
StringIndexer().setHandleInvalid("keep")
## How was this patch tested?
Test added in StringIndexerSuite
Signed-off-by: VinceShieh <vincent.xieintel.com>
(Please fill in changes proposed in this fix)
Author: VinceShieh <vincent.xie@intel.com>
Closes #16883 from VinceShieh/spark-17498.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add unit tests for testing SparseVector.
We can't add mixed DenseVector and SparseVector test case, as discussed in JIRA 19382.
def merge(other: MultivariateOnlineSummarizer): this.type = {
if (this.totalWeightSum != 0.0 && other.totalWeightSum != 0.0) {
require(n == other.n, s"Dimensions mismatch when merging with another summarizer. " +
s"Expecting $n but got $
{other.n}
.")
## How was this patch tested?
Unit tests
Author: wm624@hotmail.com <wm624@hotmail.com>
Author: Miao Wang <wangmiao1981@users.noreply.github.com>
Closes #16784 from wangmiao1981/bk.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This is a simple implementation of RecommendForAllUsers & RecommendForAllItems for the Dataframe version of ALS. It uses Dataframe operations (not a wrapper on the RDD implementation). Haven't benchmarked against a wrapper, but unit test examples do work.
## How was this patch tested?
Unit tests
```
$ build/sbt
> mllib/testOnly *ALSSuite -- -z "recommendFor"
> mllib/testOnly
```
Author: Your Name <you@example.com>
Author: sueann <sueann@databricks.com>
Closes #17090 from sueann/SPARK-19535.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
JIRA: [SPARK-19745](https://issues.apache.org/jira/browse/SPARK-19745)
Reorganize SVCAggregator to avoid serializing coefficients. This patch also makes the gradient array a `lazy val` which will avoid materializing a large array on the driver before shipping the class to the executors. This improvement stems from https://github.com/apache/spark/pull/16037. Actually, probably all ML aggregators can benefit from this.
We can either: a.) separate the gradient improvement into another patch b.) keep what's here _plus_ add the lazy evaluation to all other aggregators in this patch or c.) keep it as is.
## How was this patch tested?
This is an interesting question! I don't know of a reasonable way to test this right now. Ideally, we could perform an optimization and look at the shuffle write data for each task, and we could compare the size to what it we know it should be: `numCoefficients * 8 bytes`. Not sure if there is a good way to do that right now? We could discuss this here or in another JIRA, but I suspect it would be a significant undertaking.
Author: sethah <seth.hendrickson16@gmail.com>
Closes #17076 from sethah/svc_agg.
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
make `AFTSurvivalRegression` support numeric censorCol
## How was this patch tested?
existing tests and added tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #17034 from zhengruifeng/aft_numeric_censor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in ALS.
## What changes were proposed in this pull request?
The original ALS was performing unnecessary casting to the user and item ids because the protected checkedCast() method required a double. I removed the castings and refactored the method to receive Any and efficiently handle all permitted numeric values.
## How was this patch tested?
I tested it by running the unit-tests and by manually validating the result of checkedCast for various legal and illegal values.
Author: Vasilis Vryniotis <bbriniotis@datumbox.com>
Closes #17059 from datumbox/als_casting_fix.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
In the ALS method the default values of regParam do not match within the same file (lines [224](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L224) and [714](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala#L714)). In one place we set it to 1.0 and in the other to 0.1.
I changed the one of train() method to 0.1 and now it matches the default value which is visible to Spark users. The method is marked with DeveloperApi so it should not affect the users. Whenever we use the particular method we provide all parameters, so the default does not matter. Only exception is the unit-tests on ALSSuite but the change does not break them.
Note: This PR should get the award of the laziest commit in Spark history. Originally I wanted to correct this on another PR but MLnick [suggested](https://github.com/apache/spark/pull/17059#issuecomment-283333572) to create a separate PR & ticket. If you think this change is too insignificant/minor, you are probably right, so feel free to reject and close this. :)
## How was this patch tested?
Unit-tests
Author: Vasilis Vryniotis <vvryniotis@hotels.com>
Closes #17121 from datumbox/als_regparam.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
jira: https://issues.apache.org/jira/browse/SPARK-14503
Function parity: Add FPGrowth and AssociationRules to ML.
design doc: https://docs.google.com/document/d/1bVhABn5DiEj8bw0upqGMJT2L4nvO_0_cXdwu4uMT6uU/pub
Currently I make FPGrowthModel a transformer. For each association rule, it will just examine the input items against antecedents and summarize the consequents.
Update:
Thinking again, FPGrowth is only the algorithm to find the frequent itemsets, and can be replaced by other algorithms. The frequent itemsets are used by AssociationRules to generate the association rules. Then we can use the association rules to predict with other records.
![drawing1](https://cloud.githubusercontent.com/assets/7981698/22489294/76b9302c-e7cb-11e6-8d2d-3fc53f407b2f.png)
**For reviewers**, Let's first decide if the current `transform` function meets your expectation.
Current options:
1. Current implementation: Use Estimator and Transformer pattern in ML, the `transform` function will examine the input items against all the association rules and summarize the consequents. Users can also access frequent items and association rules via other model members.
2. Keep the Estimator and Transformer pattern. But AssociationRulesModel and FPGrowthModel will have empty `transform` function, meaning DataFrame has no change after transform. But users can access frequent items and association rules via other model members.
3. (mentioned by zhengruifeng) Keep the Estimator and Transformer pattern. But `FPGrowthModel` and `AssociationRulesModel` will just return frequent itemsets and association rules DataFrame in the `transform` function. Meaning the resulting DataFrame after `transform` will not be related to the input DataFrame.
4. Discard the Estimator and Transformer pattern. Both FPGrowth and FPGrowthModel will directly extend from PipelineStage, thus we don't need to have a `transform` function.
I'd like to hear more concrete suggestions. I would prefer option 1 or 2.
update 2:
As discussed in the jira, we will not expose AssociationRules as a public API for now.
## How was this patch tested?
new unit test suites
Author: Yuhao <yuhao.yang@intel.com>
Author: Yuhao Yang <yuhao.yang@intel.com>
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #15415 from hhbyyh/mlfpm.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR adds a param to `ALS`/`ALSModel` to set the strategy used when encountering unknown users or items at prediction time in `transform`. This can occur in 2 scenarios: (a) production scoring, and (b) cross-validation & evaluation.
The current behavior returns `NaN` if a user/item is unknown. In scenario (b), this can easily occur when using `CrossValidator` or `TrainValidationSplit` since some users/items may only occur in the test set and not in the training set. In this case, the evaluator returns `NaN` for all metrics, making model selection impossible.
The new param, `coldStartStrategy`, defaults to `nan` (the current behavior). The other option supported initially is `drop`, which drops all rows with `NaN` predictions. This flag allows users to use `ALS` in cross-validation settings. It is made an `expertParam`. The param is made a string so that the set of strategies can be extended in future (some options are discussed in [SPARK-14489](https://issues.apache.org/jira/browse/SPARK-14489)).
## How was this patch tested?
New unit tests, and manual "before and after" tests for Scala & Python using MovieLens `ml-latest-small` as example data. Here, using `CrossValidator` or `TrainValidationSplit` with the default param setting results in metrics that are all `NaN`, while setting `coldStartStrategy` to `drop` results in valid metrics.
Author: Nick Pentreath <nickp@za.ibm.com>
Closes #12896 from MLnick/SPARK-14489-als-nan.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
JIRA: [SPARK-19746](https://issues.apache.org/jira/browse/SPARK-19746)
The following code is inefficient:
````scala
val localCoefficients: Vector = bcCoefficients.value
features.foreachActive { (index, value) =>
val stdValue = value / localFeaturesStd(index)
var j = 0
while (j < numClasses) {
margins(j) += localCoefficients(index * numClasses + j) * stdValue
j += 1
}
}
````
`localCoefficients(index * numClasses + j)` calls `Vector.apply` which creates a new Breeze vector and indexes that. Even if it is not that slow to create the object, we will generate a lot of extra garbage that may result in longer GC pauses. This is a hot inner loop, so we should optimize wherever possible.
## How was this patch tested?
I don't think there's a great way to test this patch. It's purely performance related, so unit tests should guarantee that we haven't made any unwanted changes. Empirically I observed between 10-40% speedups just running short local tests. I suspect the big differences will be seen when large data/coefficient sizes have to pause for GC more often. I welcome other ideas for testing.
Author: sethah <seth.hendrickson16@gmail.com>
Closes #17078 from sethah/logistic_agg_indexing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR proposes to fix the lint-breaks as below:
```
[ERROR] src/test/java/org/apache/spark/network/TransportResponseHandlerSuite.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.network.buffer.ManagedBuffer.
[ERROR] src/main/java/org/apache/spark/unsafe/types/UTF8String.java:[156,10] (modifier) ModifierOrder: 'Nonnull' annotation modifier does not precede non-annotation modifiers.
[ERROR] src/main/java/org/apache/spark/SparkFirehoseListener.java:[122] (sizes) LineLength: Line is longer than 100 characters (found 105).
[ERROR] src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java:[164,78] (coding) OneStatementPerLine: Only one statement per line allowed.
[ERROR] src/test/java/test/org/apache/spark/JavaAPISuite.java:[1157] (sizes) LineLength: Line is longer than 100 characters (found 121).
[ERROR] src/test/java/org/apache/spark/streaming/JavaMapWithStateSuite.java:[149] (sizes) LineLength: Line is longer than 100 characters (found 113).
[ERROR] src/test/java/test/org/apache/spark/streaming/Java8APISuite.java:[146] (sizes) LineLength: Line is longer than 100 characters (found 122).
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[32,8] (imports) UnusedImports: Unused import - org.apache.spark.streaming.Time.
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[611] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/test/java/test/org/apache/spark/streaming/JavaAPISuite.java:[1317] (sizes) LineLength: Line is longer than 100 characters (found 102).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetAggregatorSuite.java:[91] (sizes) LineLength: Line is longer than 100 characters (found 102).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[113] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[164] (sizes) LineLength: Line is longer than 100 characters (found 110).
[ERROR] src/test/java/test/org/apache/spark/sql/JavaDatasetSuite.java:[212] (sizes) LineLength: Line is longer than 100 characters (found 114).
[ERROR] src/test/java/org/apache/spark/mllib/tree/JavaDecisionTreeSuite.java:[36] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/main/java/org/apache/spark/examples/streaming/JavaKinesisWordCountASL.java:[26,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
[ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[20,8] (imports) UnusedImports: Unused import - com.amazonaws.regions.RegionUtils.
[ERROR] src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisStreamSuite.java:[94] (sizes) LineLength: Line is longer than 100 characters (found 103).
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[30,8] (imports) UnusedImports: Unused import - org.apache.spark.sql.api.java.UDF1.
[ERROR] src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java:[72] (sizes) LineLength: Line is longer than 100 characters (found 104).
[ERROR] src/main/java/org/apache/spark/examples/mllib/JavaRankingMetricsExample.java:[121] (sizes) LineLength: Line is longer than 100 characters (found 101).
[ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[28,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaRDD.
[ERROR] src/main/java/org/apache/spark/examples/sql/JavaSQLDataSourceExample.java:[29,8] (imports) UnusedImports: Unused import - org.apache.spark.api.java.JavaSparkContext.
```
## How was this patch tested?
Manually via
```bash
./dev/lint-java
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #17072 from HyukjinKwon/java-lint.
|
|
|
|
|
|
|
|
|
|
|
|
| |
GeneralizedLinearRegression.linkPower
Add Scaladoc for GeneralizedLinearRegression.linkPower default value
Follow-up to https://github.com/apache/spark/pull/16344
Author: Joseph K. Bradley <joseph@databricks.com>
Closes #17069 from jkbradley/tweedie-comment.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
some SparkR APIs
## What changes were proposed in this pull request?
This is a follow-up PR of #16800
When doing SPARK-19456, we found that "" should be consider a NULL column name and should not be set. aggregationDepth should be exposed as an expert parameter.
## How was this patch tested?
Existing tests.
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes #16945 from wangmiao1981/svc.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Destroy broadcasted object without blocking
use `find mllib -name '*.scala' | xargs -i bash -c 'egrep "destroy" -n {} && echo {}'`
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #17016 from zhengruifeng/destroy_without_block.
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Add missing 'setTopicDistributionCol' for LDAModel
## How was this patch tested?
existing tests
Author: Zheng RuiFeng <ruifengz@foxmail.com>
Closes #17021 from zhengruifeng/lda_outputCol.
|