aboutsummaryrefslogtreecommitdiff
path: root/R/pkg/vignettes/sparkr-vignettes.Rmd
diff options
context:
space:
mode:
authoractuaryzhang <actuaryzhang10@gmail.com>2017-03-14 00:50:38 -0700
committerFelix Cheung <felixcheung@apache.org>2017-03-14 00:50:38 -0700
commitf6314eab4b494bd5b5e9e41c6f582d4f22c0967a (patch)
treeff067df4be9eb6f3b660abf8332136d778201146 /R/pkg/vignettes/sparkr-vignettes.Rmd
parent415f9f3423aacc395097e40427364c921a2ed7f1 (diff)
downloadspark-f6314eab4b494bd5b5e9e41c6f582d4f22c0967a.tar.gz
spark-f6314eab4b494bd5b5e9e41c6f582d4f22c0967a.tar.bz2
spark-f6314eab4b494bd5b5e9e41c6f582d4f22c0967a.zip
[SPARK-19391][SPARKR][ML] Tweedie GLM API for SparkR
## What changes were proposed in this pull request? Port Tweedie GLM #16344 to SparkR felixcheung yanboliang ## How was this patch tested? new test in SparkR Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16729 from actuaryzhang/sparkRTweedie.
Diffstat (limited to 'R/pkg/vignettes/sparkr-vignettes.Rmd')
-rw-r--r--R/pkg/vignettes/sparkr-vignettes.Rmd19
1 files changed, 18 insertions, 1 deletions
diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd
index 43c255cff3..a6ff650c33 100644
--- a/R/pkg/vignettes/sparkr-vignettes.Rmd
+++ b/R/pkg/vignettes/sparkr-vignettes.Rmd
@@ -672,6 +672,7 @@ gaussian | identity, log, inverse
binomial | logit, probit, cloglog (complementary log-log)
poisson | log, identity, sqrt
gamma | inverse, identity, log
+tweedie | power link function
There are three ways to specify the `family` argument.
@@ -679,7 +680,11 @@ There are three ways to specify the `family` argument.
* Family function, e.g. `family = binomial`.
-* Result returned by a family function, e.g. `family = poisson(link = log)`
+* Result returned by a family function, e.g. `family = poisson(link = log)`.
+
+* Note that there are two ways to specify the tweedie family:
+ a) Set `family = "tweedie"` and specify the `var.power` and `link.power`
+ b) When package `statmod` is loaded, the tweedie family is specified using the family definition therein, i.e., `tweedie()`.
For more information regarding the families and their link functions, see the Wikipedia page [Generalized Linear Model](https://en.wikipedia.org/wiki/Generalized_linear_model).
@@ -695,6 +700,18 @@ gaussianFitted <- predict(gaussianGLM, carsDF)
head(select(gaussianFitted, "model", "prediction", "mpg", "wt", "hp"))
```
+The following is the same fit using the tweedie family:
+```{r}
+tweedieGLM1 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie", var.power = 0.0)
+summary(tweedieGLM1)
+```
+We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link:
+```{r}
+tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
+ var.power = 1.2, link.power = 0.0)
+summary(tweedieGLM2)
+```
+
#### Isotonic Regression
`spark.isoreg` fits an [Isotonic Regression](https://en.wikipedia.org/wiki/Isotonic_regression) model against a `SparkDataFrame`. It solves a weighted univariate a regression problem under a complete order constraint. Specifically, given a set of real observed responses $y_1, \ldots, y_n$, corresponding real features $x_1, \ldots, x_n$, and optionally positive weights $w_1, \ldots, w_n$, we want to find a monotone (piecewise linear) function $f$ to minimize