[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit. - spark

diff options

author	Yanbo Liang <ybliang8@gmail.com>	2016-12-07 00:31:11 -0800
committer	Yanbo Liang <ybliang8@gmail.com>	2016-12-07 00:31:11 -0800
commit	90b59d1bf262b41c3a5f780697f504030f9d079c (patch)
tree	b52ec6f3b7a43040557ec882225f98aeccd336a6 /mllib-local
parent	5c6bcdbda4dd23bbd112a7395cd9d1cfd04cf4bb (diff)
download	spark-90b59d1bf262b41c3a5f780697f504030f9d079c.tar.gz spark-90b59d1bf262b41c3a5f780697f504030f9d079c.tar.bz2 spark-90b59d1bf262b41c3a5f780697f504030f9d079c.zip

[SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.

## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #16117 from yanboliang/spark-18686.

Diffstat (limited to 'mllib-local')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: