diff options
author | wm624@hotmail.com <wm624@hotmail.com> | 2017-02-28 22:31:35 -0800 |
---|---|---|
committer | Felix Cheung <felixcheung@apache.org> | 2017-02-28 22:31:35 -0800 |
commit | 89cd3845b6edb165236a6498dcade033975ee276 (patch) | |
tree | 1aae82ffb40b20e0cd0befa89d816d2ad3671368 /R/pkg/vignettes/sparkr-vignettes.Rmd | |
parent | 7315880568fd07d4dfb9f76d538f220e9d320c6f (diff) | |
download | spark-89cd3845b6edb165236a6498dcade033975ee276.tar.gz spark-89cd3845b6edb165236a6498dcade033975ee276.tar.bz2 spark-89cd3845b6edb165236a6498dcade033975ee276.zip |
[SPARK-19460][SPARKR] Update dataset used in R documentation, examples to reduce warning noise and confusions
## What changes were proposed in this pull request?
Replace `iris` dataset with `Titanic` or other dataset in example and document.
## How was this patch tested?
Manual and existing test
Author: wm624@hotmail.com <wm624@hotmail.com>
Closes #17032 from wangmiao1981/example.
Diffstat (limited to 'R/pkg/vignettes/sparkr-vignettes.Rmd')
-rw-r--r-- | R/pkg/vignettes/sparkr-vignettes.Rmd | 47 |
1 files changed, 25 insertions, 22 deletions
diff --git a/R/pkg/vignettes/sparkr-vignettes.Rmd b/R/pkg/vignettes/sparkr-vignettes.Rmd index bc8bc3c26c..43c255cff3 100644 --- a/R/pkg/vignettes/sparkr-vignettes.Rmd +++ b/R/pkg/vignettes/sparkr-vignettes.Rmd @@ -565,11 +565,10 @@ We use a simple example to demonstrate `spark.logit` usage. In general, there ar and 3). Obtain the coefficient matrix of the fitted model using `summary` and use the model for prediction with `predict`. Binomial logistic regression -```{r, warning=FALSE} -df <- createDataFrame(iris) -# Create a DataFrame containing two classes -training <- df[df$Species %in% c("versicolor", "virginica"), ] -model <- spark.logit(training, Species ~ ., regParam = 0.00042) +```{r} +t <- as.data.frame(Titanic) +training <- createDataFrame(t) +model <- spark.logit(training, Survived ~ ., regParam = 0.04741301) summary(model) ``` @@ -579,10 +578,11 @@ fitted <- predict(model, training) ``` Multinomial logistic regression against three classes -```{r, warning=FALSE} -df <- createDataFrame(iris) +```{r} +t <- as.data.frame(Titanic) +training <- createDataFrame(t) # Note in this case, Spark infers it is multinomial logistic regression, so family = "multinomial" is optional. -model <- spark.logit(df, Species ~ ., regParam = 0.056) +model <- spark.logit(training, Class ~ ., regParam = 0.07815179) summary(model) ``` @@ -609,11 +609,12 @@ MLPC employs backpropagation for learning the model. We use the logistic loss fu `spark.mlp` requires at least two columns in `data`: one named `"label"` and the other one `"features"`. The `"features"` column should be in libSVM-format. -We use iris data set to show how to use `spark.mlp` in classification. -```{r, warning=FALSE} -df <- createDataFrame(iris) +We use Titanic data set to show how to use `spark.mlp` in classification. +```{r} +t <- as.data.frame(Titanic) +training <- createDataFrame(t) # fit a Multilayer Perceptron Classification Model -model <- spark.mlp(df, Species ~ ., blockSize = 128, layers = c(4, 3), solver = "l-bfgs", maxIter = 100, tol = 0.5, stepSize = 1, seed = 1, initialWeights = c(0, 0, 0, 0, 0, 5, 5, 5, 5, 5, 9, 9, 9, 9, 9)) +model <- spark.mlp(training, Survived ~ Age + Sex, blockSize = 128, layers = c(2, 3), solver = "l-bfgs", maxIter = 100, tol = 0.5, stepSize = 1, seed = 1, initialWeights = c( 0, 0, 0, 5, 5, 5, 9, 9, 9)) ``` To avoid lengthy display, we only present partial results of the model summary. You can check the full result from your sparkR shell. @@ -630,7 +631,7 @@ options(ops) ``` ```{r} # make predictions use the fitted model -predictions <- predict(model, df) +predictions <- predict(model, training) head(select(predictions, predictions$prediction)) ``` @@ -769,12 +770,13 @@ predictions <- predict(rfModel, df) `spark.bisectingKmeans` is a kind of [hierarchical clustering](https://en.wikipedia.org/wiki/Hierarchical_clustering) using a divisive (or "top-down") approach: all observations start in one cluster, and splits are performed recursively as one moves down the hierarchy. -```{r, warning=FALSE} -df <- createDataFrame(iris) -model <- spark.bisectingKmeans(df, Sepal_Length ~ Sepal_Width, k = 4) +```{r} +t <- as.data.frame(Titanic) +training <- createDataFrame(t) +model <- spark.bisectingKmeans(training, Class ~ Survived, k = 4) summary(model) -fitted <- predict(model, df) -head(select(fitted, "Sepal_Length", "prediction")) +fitted <- predict(model, training) +head(select(fitted, "Class", "prediction")) ``` #### Gaussian Mixture Model @@ -912,9 +914,10 @@ testSummary ### Model Persistence The following example shows how to save/load an ML model by SparkR. -```{r, warning=FALSE} -irisDF <- createDataFrame(iris) -gaussianGLM <- spark.glm(irisDF, Sepal_Length ~ Sepal_Width + Species, family = "gaussian") +```{r} +t <- as.data.frame(Titanic) +training <- createDataFrame(t) +gaussianGLM <- spark.glm(training, Freq ~ Sex + Age, family = "gaussian") # Save and then load a fitted MLlib model modelPath <- tempfile(pattern = "ml", fileext = ".tmp") @@ -925,7 +928,7 @@ gaussianGLM2 <- read.ml(modelPath) summary(gaussianGLM2) # Check model prediction -gaussianPredictions <- predict(gaussianGLM2, irisDF) +gaussianPredictions <- predict(gaussianGLM2, training) head(gaussianPredictions) unlink(modelPath) |