diff options
author | Yanbo Liang <ybliang8@gmail.com> | 2016-12-07 00:31:11 -0800 |
---|---|---|
committer | Yanbo Liang <ybliang8@gmail.com> | 2016-12-07 00:31:11 -0800 |
commit | 90b59d1bf262b41c3a5f780697f504030f9d079c (patch) | |
tree | b52ec6f3b7a43040557ec882225f98aeccd336a6 /mllib-local | |
parent | 5c6bcdbda4dd23bbd112a7395cd9d1cfd04cf4bb (diff) | |
download | spark-90b59d1bf262b41c3a5f780697f504030f9d079c.tar.gz spark-90b59d1bf262b41c3a5f780697f504030f9d079c.tar.bz2 spark-90b59d1bf262b41c3a5f780697f504030f9d079c.zip |
[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.
## What changes were proposed in this pull request?
Several cleanup and improvements for ```spark.logit```:
* ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model.
* ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently.
* SparkR test improvement: comparing the training result with native R glmnet.
* Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users.
## How was this patch tested?
Unit tests.
The ```summary``` output after this change:
multinomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> model <- spark.logit(df, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
versicolor virginica setosa
(Intercept) 1.514031 -2.609108 1.095077
Sepal_Length 0.02511006 0.2649821 -0.2900921
Sepal_Width -0.5291215 -0.02016446 0.549286
Petal_Length 0.03647411 0.1544119 -0.190886
Petal_Width 0.000236092 0.4195804 -0.4198165
```
binomial logistic regression:
```
> df <- suppressWarnings(createDataFrame(iris))
> training <- df[df$Species %in% c("versicolor", "virginica"), ]
> model <- spark.logit(training, Species ~ ., regParam = 0.5)
> summary(model)
$coefficients
Estimate
(Intercept) -6.053815
Sepal_Length 0.2449379
Sepal_Width 0.1648321
Petal_Length 0.4730718
Petal_Width 1.031947
```
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #16117 from yanboliang/spark-18686.
Diffstat (limited to 'mllib-local')
0 files changed, 0 insertions, 0 deletions