| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement ```IterativelyReweightedLeastSquares``` solver for GLM. I consider it as a solver rather than estimator, it only used internal so I keep it ```private[ml]```.
There are two limitations in the current implementation compared with R:
* It can not support ```Tuple``` as response for ```Binomial``` family, such as the following code:
```
glm( cbind(using, notUsing) ~ age + education + wantsMore , family = binomial)
```
* It does not support ```offset```.
Because I considered that ```RFormula``` did not support ```Tuple``` as label and ```offset``` keyword, so I simplified the implementation. But to add support for these two functions is not very hard, I can do it in follow-up PR if it is necessary. Meanwhile, we can also add R-like statistic summary for IRLS.
The implementation refers R, [statsmodels](https://github.com/statsmodels/statsmodels) and [sparkGLM](https://github.com/AlteryxLabs/sparkGLM).
Please focus on the main structure and overpass minor issues/docs that I will update later. Any comments and opinions will be appreciated.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10639 from yanboliang/spark-9835.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
regularized
The intercept in Logistic Regression represents a prior on categories which should not be regularized. In MLlib, the regularization is handled through Updater, and the Updater penalizes all the components without excluding the intercept which resulting poor training accuracy with regularization.
The new implementation in ML framework handles this properly, and we should call the implementation in ML from MLlib since majority of users are still using MLlib api.
Note that both of them are doing feature scalings to improve the convergence, and the only difference is ML version doesn't regularize the intercept. As a result, when lambda is zero, they will converge to the same solution.
Previously partially reviewed at https://github.com/apache/spark/pull/6386#issuecomment-168781424 re-opening for dbtsai to review.
Author: Holden Karau <holden@us.ibm.com>
Author: Holden Karau <holden@pigscanfly.ca>
Closes #10788 from holdenk/SPARK-7780-intercept-in-logisticregressionwithLBFGS-should-not-be-regularized.
|
|
|
|
|
|
|
|
|
|
|
|
| |
… Add LibSVMOutputWriter
The behavior of LibSVMRelation is not changed except adding LibSVMOutputWriter
* Partition is still not supported
* Multiple input paths is not supported
Author: Jeff Zhang <zjffdu@apache.org>
Closes #9595 from zjffdu/SPARK-11622.
|
|
|
|
|
|
|
|
|
|
| |
than its parent class
https://issues.apache.org/jira/browse/SPARK-12952
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10863 from yinxusen/SPARK-12952.
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-12834
We use `SerDe.dumps()` to serialize `JavaArray` and `JavaList` in `PythonMLLibAPI`, then deserialize them with `PickleSerializer` in Python side. However, there is no need to transform them in such an inefficient way. Instead of it, we can use type conversion to convert them, e.g. `list(JavaArray)` or `list(JavaList)`. What's more, there is an issue to Ser/De Scala Array as I said in https://issues.apache.org/jira/browse/SPARK-12780
Author: Xusen Yin <yinxusen@gmail.com>
Closes #10772 from yinxusen/SPARK-12834.
|
|
|
|
|
|
|
|
|
|
| |
```PCAModel``` can output ```explainedVariance``` at Python side.
cc mengxr srowen
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10830 from yanboliang/spark-12905.
|
|
|
|
|
|
|
|
| |
Update user guide for RFormula feature interactions. Meanwhile we also update other new features such as supporting string label in Spark 1.6.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10222 from yanboliang/spark-11965.
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult` depends on it.
- Update comments and docs
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10854 from zsxwing/remove-akka.
|
|
|
|
|
|
|
|
|
|
| |
converge issue
When all labels are the same, it's a dangerous ground for LogisticRegression without intercept to converge. GLMNET doesn't support this case, and will just exit. GLM can train, but will have a warning message saying the algorithm doesn't converge.
Author: DB Tsai <dbt@netflix.com>
Closes #10862 from dbtsai/add-tests.
|
|
|
|
|
|
|
|
|
| |
Add Since annotations to ml.param and ml.*
Author: Takahashi Hiroshi <takahashi.hiroshi@lab.ntt.co.jp>
Author: Hiroshi Takahashi <takahashi.hiroshi@lab.ntt.co.jp>
Closes #8935 from taishi-oss/issue10263.
|
|
|
|
|
|
|
|
|
|
| |
properly if standard deviation of target variable is zero.
This fixes the behavior of WeightedLeastSquars.fit() when the standard deviation of the target variable is zero. If the fitIntercept is true, there is no need to train.
Author: Imran Younus <iyounus@us.ibm.com>
Closes #10274 from iyounus/SPARK-12230_bug_fix_in_weighted_least_squares.
|
|
|
|
|
|
| |
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #9604 from yu-iskw/SPARK-6519.
|
|
|
|
|
|
|
|
|
|
| |
prediction column
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10472 from BenFradet/SPARK-9716.
|
|
|
|
|
|
|
|
|
|
| |
training data
CC jkbradley mengxr dbtsai
Author: Feynman Liang <feynman.liang@gmail.com>
Closes #10743 from feynmanliang/SPARK-12804.
|
|
|
|
|
|
|
|
| |
From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.
Author: Holden Karau <holden@us.ibm.com>
Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
|
|
|
|
|
|
|
|
| |
Change assertion's message so it's consistent with the code. The old message says that the invoked method was lapack.dports, where in fact it was lapack.dppsv method.
Author: Wojciech Jurczyk <wojtek.jurczyk@gmail.com>
Closes #10818 from wjur/wjur/rename_error_message.
|
|
|
|
|
|
|
|
|
|
| |
Currently `summary()` fails on a GLM model fitted over a vector feature missing ML attrs, since the output feature attrs will also have no name. We can avoid this situation by forcing `VectorAssembler` to make up suitable names when inputs are missing names.
cc mengxr
Author: Eric Liang <ekl@databricks.com>
Closes #10323 from ericl/spark-12346.
|
|
|
|
|
|
|
|
|
|
|
| |
I create new pr since original pr long time no update.
Please help to review.
srowen
Author: Tommy YU <tummyyu@163.com>
Closes #10756 from Wenpei/add_since_to_recomm.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10764 from rxin/SPARK-12830.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
of features is large
jira: https://issues.apache.org/jira/browse/SPARK-12026
The issue is valid as features.toArray.view.zipWithIndex.slice(startCol, endCol) becomes slower as startCol gets larger.
I tested on local and the change can improve the performance and the running time was stable.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #10146 from hhbyyh/chiSq.
|
|
|
|
|
|
|
|
|
|
|
|
| |
equals to zero
Cosine similarity with 0 vector should be 0
Related to https://github.com/apache/spark/pull/10152
Author: Sean Owen <sowen@cloudera.com>
Closes #10696 from srowen/SPARK-7615.
|
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-10809
We could provide a single-document topicDistributions method for LocalLDAModel to allow for quick queries which avoid RDD operations. Currently, the user must use an RDD of documents.
add some missing assert too.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #9484 from hhbyyh/ldaTopicPre.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-12685
the log of `word2vec` reports
trainWordsCount = -785727483
during computation over a large dataset.
Update the priority as it will affect the computation process.
`alpha = learningRate * (1 - numPartitions * wordCount.toDouble / (trainWordsCount + 1))`
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #10627 from hhbyyh/w2voverflow.
|
|
|
|
|
|
|
|
|
|
| |
single instance predict/predictSoft
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10552 from yanboliang/spark-12603.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Turn import ordering violations into build errors, plus a few adjustments
to account for how the checker behaves. I'm a little on the fence about
whether the existing code is right, but it's easier to appease the checker
than to discuss what's the more correct order here.
Plus a few fixes to imports that cropped in since my recent cleanups.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #10612 from vanzin/SPARK-3873-enable.
|
|
|
|
|
|
|
|
|
|
|
| |
before "," or ":")
Fix the style violation (space before , and :).
This PR is a followup for #10643.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #10684 from sarutak/SPARK-12692-followup-mllib.
|
|
|
|
|
|
|
|
| |
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.
Author: Sean Owen <sowen@cloudera.com>
Closes #10570 from srowen/SPARK-12618.
|
|
|
|
|
|
|
|
|
|
| |
This PR contains 1 commit which resolves [SPARK-12663](https://issues.apache.org/jira/browse/SPARK-12663).
For the record, I got a positive response from 2 people when I floated this idea on devspark.apache.org on 2015-10-23. [Link to archived discussion.](http://apache-spark-developers-list.1001551.n3.nabble.com/slightly-more-informative-error-message-in-MLUtils-loadLibSVMFile-td14764.html)
Author: Robert Dodier <robert_dodier@users.sourceforge.net>
Closes #10611 from robert-dodier/loadlibsvmfile-error-msg-branch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
metricName
For the BinaryClassificationEvaluator, the scaladoc doesn't mention that "areaUnderPR" is supported, only that the default is "areadUnderROC".
Also, in the documentation, it is said that:
"The default metric used to choose the best ParamMap can be overriden by the setMetric method in each of these evaluators."
However, the method is called setMetricName.
This PR aims to fix both issues.
Author: BenFradet <benjamin.fradet@gmail.com>
Closes #10328 from BenFradet/SPARK-12368.
|
|
|
|
|
|
| |
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #10582 from vanzin/SPARK-3873-tests.
|
|
|
|
|
|
|
|
| |
SPARK-12450 . Un-persist broadcasted variables in KMeans.
Author: RJ Nowling <rnowling@gmail.com>
Closes #10415 from rnowling/spark-12450.
|
|
|
|
|
|
|
|
| |
Support model save/load for FPGrowthModel
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9267 from yanboliang/spark-6724.
|
|
|
|
|
|
|
|
|
| |
Modified the definition of R^2 for regression through origin. Added modified test for regression metrics.
Author: Imran Younus <iyounus@us.ibm.com>
Author: Imran Younus <imranyounus@gmail.com>
Closes #10384 from iyounus/SPARK_12331_R2_for_regression_through_origin.
|
|
|
|
|
|
|
|
| |
DecisionTreeRegressor will provide variance of prediction as a Double column.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #8866 from yanboliang/spark-9622.
|
|
|
|
|
|
|
|
| |
See JIRA: https://issues.apache.org/jira/browse/SPARK-11259
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9224 from yanboliang/spark-11259.
|
|
|
|
|
|
|
|
| |
callUDF has been deprecated. However, we do not have an alternative for users to specify the output data type without type tags. This pull request introduced a new API for that, and replaces the invocation of the deprecated callUDF with that.
Author: Reynold Xin <rxin@databricks.com>
Closes #10547 from rxin/SPARK-12599.
|
|
|
|
|
|
|
|
|
|
|
| |
A slight adjustment to the checker configuration was needed; there is
a handful of warnings still left, but those are because of a bug in
the checker that I'll fix separately (before enabling errors for the
checker, of course).
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #10535 from vanzin/SPARK-3873-mllib.
|
|
|
|
|
|
|
|
|
|
|
| |
/ PR 10327
Sorry jkbradley
Ref: https://github.com/apache/spark/pull/10327#discussion_r48502942
Author: Sean Owen <sowen@cloudera.com>
Closes #10508 from srowen/SPARK-12349.2.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Include the following changes:
1. Close `java.sql.Statement`
2. Fix incorrect `asInstanceOf`.
3. Remove unnecessary `synchronized` and `ReentrantLock`.
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #10440 from zsxwing/findbugs.
|
|
|
|
|
|
|
|
|
| |
ParamMap#filter uses `mutable.Map#filterKeys`. The return type of `filterKey` is collection.Map, not mutable.Map but the result is casted to mutable.Map using `asInstanceOf` so we get `ClassCastException`.
Also, the return type of Map#filterKeys is not Serializable. It's the issue of Scala (https://issues.scala-lang.org/browse/SI-6654).
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #10381 from sarutak/SPARK-12424.
|
|
|
|
|
|
|
|
|
|
|
|
| |
suites after forcing to set specific value to "os.arch" property
Restore the original value of os.arch property after each test
Since some of tests forced to set the specific value to os.arch property, we need to set the original value.
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #10289 from kiszk/SPARK-12311.
|
|
|
|
|
|
|
|
|
| |
Only load explainedVariance in PCAModel if it was written with Spark > 1.6.x
jkbradley is this kind of what you had in mind?
Author: Sean Owen <sowen@cloudera.com>
Closes #10327 from srowen/SPARK-12349.
|
|
|
|
|
|
|
|
| |
Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized. It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer." Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647."
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #10387 from rxin/version-bump.
|
|
|
|
|
|
|
|
|
|
|
|
| |
test suites
Use ```sqlContext``` from ```MLlibTestSparkContext``` rather than creating new one for spark.ml test suites. I have checked thoroughly and found there are four test cases need to update.
cc mengxr jkbradley
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10279 from yanboliang/spark-12309.
|
|
|
|
|
|
|
|
| |
Add random seed Param to Scala CrossValidator
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #9108 from yanboliang/spark-9694.
|
|
|
|
|
|
|
|
|
|
| |
JIRA: https://issues.apache.org/jira/browse/SPARK-12016
We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #10100 from viirya/fix-load-py-wordvecmodel.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Issue
As noted in PR #9441, implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor. As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark. Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`. Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`. As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type. `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.
This PR currently contains that retagging fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`. This PR blocks #9441, so once this is merged, the other can be rebased.
cc holdenk
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes #9458 from dusenberrymw/SPARK-11497_PySpark_RowMatrix_Constructor_Has_Type_Erasure_Issue.
|
|
|
|
|
|
|
|
|
|
|
| |
prediction col
LogisticRegression training summary should still function if the predictionCol is set to an empty string or otherwise unset (related too https://issues.apache.org/jira/browse/SPARK-9718 )
Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>
Closes #9037 from holdenk/SPARK-10991-LogisticRegressionTrainingSummary-handle-empty-prediction-col.
|
|
|
|
|
|
|
|
|
|
| |
jira: https://issues.apache.org/jira/browse/SPARK-11602
Made a pass on the API change of 1.6. Open the PR for efficient discussion.
Author: Yuhao Yang <hhbyyh@gmail.com>
Closes #9939 from hhbyyh/auditScala.
|