aboutsummaryrefslogtreecommitdiff
path: root/R
diff options
context:
space:
mode:
authorzero323 <zero323@users.noreply.github.com>2017-04-18 19:59:18 -0700
committerFelix Cheung <felixcheung@apache.org>2017-04-18 19:59:18 -0700
commit702d85af2df9433254af6fa029683aa19c52a276 (patch)
treefea6ff94f7b3e2e78e71b22cc45a53be63783d15 /R
parente468a96c404eb54261ab219734f67dc2f5b06dc0 (diff)
downloadspark-702d85af2df9433254af6fa029683aa19c52a276.tar.gz
spark-702d85af2df9433254af6fa029683aa19c52a276.tar.bz2
spark-702d85af2df9433254af6fa029683aa19c52a276.zip
[SPARK-20208][R][DOCS] Document R fpGrowth support
## What changes were proposed in this pull request? Document fpGrowth in: - vignettes - programming guide - code example ## How was this patch tested? Manual tests. Author: zero323 <zero323@users.noreply.github.com> Closes #17557 from zero323/SPARK-20208.
Diffstat (limited to 'R')
-rw-r--r--R/pkg/vignettes/sparkr-vignettes.Rmd37
1 files changed, 36 insertions, 1 deletions
diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd
index a6ff650c33..f81dbab10b 100644
--- a/R/pkg/vignettes/sparkr-vignettes.Rmd
+++ b/R/pkg/vignettes/sparkr-vignettes.Rmd
@@ -505,6 +505,10 @@ SparkR supports the following machine learning models and algorithms.
* Alternating Least Squares (ALS)
+#### Frequent Pattern Mining
+
+* FP-growth
+
#### Statistics
* Kolmogorov-Smirnov Test
@@ -707,7 +711,7 @@ summary(tweedieGLM1)
```
We can try other distributions in the tweedie family, for example, a compound Poisson distribution with a log link:
```{r}
-tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
+tweedieGLM2 <- spark.glm(carsDF, mpg ~ wt + hp, family = "tweedie",
var.power = 1.2, link.power = 0.0)
summary(tweedieGLM2)
```
@@ -906,6 +910,37 @@ predicted <- predict(model, df)
head(predicted)
```
+#### FP-growth
+
+`spark.fpGrowth` executes FP-growth algorithm to mine frequent itemsets on a `SparkDataFrame`. `itemsCol` should be an array of values.
+
+```{r}
+df <- selectExpr(createDataFrame(data.frame(rawItems = c(
+ "T,R,U", "T,S", "V,R", "R,U,T,V", "R,S", "V,S,U", "U,R", "S,T", "V,R", "V,U,S",
+ "T,V,U", "R,V", "T,S", "T,S", "S,T", "S,U", "T,R", "V,R", "S,V", "T,S,U"
+))), "split(rawItems, ',') AS items")
+
+fpm <- spark.fpGrowth(df, minSupport = 0.2, minConfidence = 0.5)
+```
+
+`spark.freqItemsets` method can be used to retrieve a `SparkDataFrame` with the frequent itemsets.
+
+```{r}
+head(spark.freqItemsets(fpm))
+```
+
+`spark.associationRules` returns a `SparkDataFrame` with the association rules.
+
+```{r}
+head(spark.associationRules(fpm))
+```
+
+We can make predictions based on the `antecedent`.
+
+```{r}
+head(predict(fpm, df))
+```
+
#### Kolmogorov-Smirnov Test
`spark.kstest` runs a two-sided, one-sample [Kolmogorov-Smirnov (KS) test](https://en.wikipedia.org/wiki/Kolmogorov%E2%80%93Smirnov_test).