aboutsummaryrefslogtreecommitdiff
path: root/docs/sparkr.md
Commit message (Collapse)AuthorAgeFilesLines
* [MINOR][SPARKR] Move 'Data type mapping between R and Spark' to right place ↵Yanbo Liang2017-03-271-69/+69
| | | | | | | | | | | | | | | | | in SparkR doc. Section ```Data type mapping between R and Spark``` was put in the wrong place in SparkR doc currently, we should move it to a separate section. ## What changes were proposed in this pull request? Before this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/24340911/bc01a532-126a-11e7-9a08-0d60d13a547c.png) After this PR: ![image](https://cloud.githubusercontent.com/assets/1962026/24340938/d9d32a9a-126a-11e7-8891-d2f5b46e0c71.png) Author: Yanbo Liang <ybliang8@gmail.com> Closes #17440 from yanboliang/sparkr-doc.
* [SPARK-18849][ML][SPARKR][DOC] vignettes final check reorgFelix Cheung2016-12-171-12/+29
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Reorganizing content (copy/paste) ## How was this patch tested? https://felixcheung.github.io/sparkr-vignettes.html Previous: https://felixcheung.github.io/sparkr-vignettes_old.html Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16301 from felixcheung/rvignettespass2.
* [SPARK-18325][SPARKR][ML] SparkR ML wrappers example code and user guideYanbo Liang2016-12-081-26/+20
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? * Add all R examples for ML wrappers which were added during 2.1 release cycle. * Split the whole ```ml.R``` example file into individual example for each algorithm, which will be convenient for users to rerun them. * Add corresponding examples to ML user guide. * Update ML section of SparkR user guide. Note: MLlib Scala/Java/Python examples will be consistent, however, SparkR examples may different from them, since R users may use the algorithms in a different way, for example, using R ```formula``` to specify ```featuresCol``` and ```labelCol```. ## How was this patch tested? Run all examples manually. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16148 from yanboliang/spark-18325.
* [SPARK-18643][SPARKR] SparkR hangs at session start when installed as a ↵Felix Cheung2016-12-041-1/+3
| | | | | | | | | | | | | | | | | | | | | package without Spark ## What changes were proposed in this pull request? If SparkR is running as a package and it has previously downloaded Spark Jar it should be able to run as before without having to set SPARK_HOME. Basically with this bug the auto install Spark will only work in the first session. This seems to be a regression on the earlier behavior. Fix is to always try to install or check for the cached Spark if running in an interactive session. As discussed before, we should probably only install Spark iff running in an interactive session (R shell, RStudio etc) ## How was this patch tested? Manually Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16077 from felixcheung/rsessioninteractive.
* [SPARK-18073][DOCS][WIP] Migrate wiki to spark.apache.org web siteSean Owen2016-11-231-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updates links to the wiki to links to the new location of content on spark.apache.org. ## How was this patch tested? Doc builds Author: Sean Owen <sowen@cloudera.com> Closes #15967 from srowen/SPARK-18073.1.
* [SQL][DOC] updating doc for JSON source to link to jsonlines.orgFelix Cheung2016-10-261-1/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? API and programming guide doc changes for Scala, Python and R. ## How was this patch tested? manual test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15629 from felixcheung/jsondoc.
* [SPARK-18013][SPARKR] add crossJoin APIFelix Cheung2016-10-211-0/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add crossJoin and do not default to cross join if joinExpr is left out ## How was this patch tested? unit test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15559 from felixcheung/rcrossjoin.
* [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when ↵Jeff Zhang2016-09-231-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | running sparkr in RStudio ## What changes were proposed in this pull request? Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala). ``` if (args.isR && clusterManager == YARN) { val sparkRPackagePath = RUtils.localSparkRPackagePath if (sparkRPackagePath.isEmpty) { printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.") } val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE) if (!sparkRPackageFile.exists()) { printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.") } val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString // Distribute the SparkR package. // Assigns a symbol link name "sparkr" to the shipped package. args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr") // Distribute the R package archive containing all the built R packages. if (!RUtils.rPackages.isEmpty) { val rPackageFile = RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE) if (!rPackageFile.exists()) { printErrorAndExit("Failed to zip all the built R packages.") } val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString // Assigns a symbol link name "rpkg" to the shipped package. args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg") } } ``` So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor. Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster. ## How was this patch tested? Verify it manually in R Studio using the following code. ``` Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark") .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths())) library(SparkR) sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1")) df <- as.DataFrame(mtcars) head(df) ``` … Author: Jeff Zhang <zjffdu@apache.org> Closes #14784 from zjffdu/SPARK-17210.
* [SPARK-17445][DOCS] Reference an ASF page as the main place to find ↵Sean Owen2016-09-141-1/+2
| | | | | | | | | | | | | | | | | | third-party packages ## What changes were proposed in this pull request? Point references to spark-packages.org to https://cwiki.apache.org/confluence/display/SPARK/Third+Party+Projects This will be accompanied by a parallel change to the spark-website repo, and additional changes to this wiki. ## How was this patch tested? Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #15075 from srowen/SPARK-17445.
* [SPARKR][DOCS] fix broken url in docFelix Cheung2016-07-251-54/+53
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix broken url, also, sparkR.session.stop doc page should have it in the header, instead of saying "sparkR.stop" ![image](https://cloud.githubusercontent.com/assets/8969467/17080129/26d41308-50d9-11e6-8967-79d6c920313f.png) Data type section is in the middle of a list of gapply/gapplyCollect subsections: ![image](https://cloud.githubusercontent.com/assets/8969467/17080122/f992d00a-50d8-11e6-8f2c-fd5786213920.png) ## How was this patch tested? manual test Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14329 from felixcheung/rdoclinkfix.
* [SPARKR][DOCS] minor code sample update in R programming guideFelix Cheung2016-07-181-2/+2
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix code style from ad hoc review of RC4 doc ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14250 from felixcheung/rdocs2rc4.
* [SPARK-16112][SPARKR] Programming guide for gapply/gapplyCollectNarine Kokhlikyan2016-07-161-4/+134
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updates programming guide for spark.gapply/spark.gapplyCollect. Similar to other examples I used `faithful` dataset to demonstrate gapply's functionality. Please, let me know if you prefer another example. ## How was this patch tested? Existing test cases in R Author: Narine Kokhlikyan <narine@slice.com> Closes #14090 from NarineK/gapplyProgGuide.
* [SPARKR][DOCS][MINOR] R programming guide to include csv data source exampleFelix Cheung2016-07-131-9/+18
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Minor documentation update for code example, code style, and missed reference to "sparkR.init" ## How was this patch tested? manual shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #14178 from felixcheung/rcsvprogrammingguide.
* [SPARKR][DOC] SparkR ML user guides update for 2.0Yanbo Liang2016-07-111-18/+25
| | | | | | | | | | | | | ## What changes were proposed in this pull request? * Update SparkR ML section to make them consistent with SparkR API docs. * Since #13972 adds labelling support for the ```include_example``` Jekyll plugin, so that we can split the single ```ml.R``` example file into multiple line blocks with different labels, and include them in different algorithms/models in the generated HTML page. ## How was this patch tested? Only docs update, manually check the generated docs. Author: Yanbo Liang <ybliang8@gmail.com> Closes #14011 from yanboliang/r-user-guide-update.
* [SPARK-16088][SPARKR] update setJobGroup, cancelJobGroup, clearJobGroupFelix Cheung2016-06-231-0/+2
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Updated setJobGroup, cancelJobGroup, clearJobGroup to not require sc/SparkContext as parameter. Also updated roxygen2 doc and R programming guide on deprecations. ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13838 from felixcheung/rjobgroup.
* [SPARK-15672][R][DOC] R programming guide updateKai Jiang2016-06-221-0/+77
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Guide for - UDFs with dapply, dapplyCollect - spark.lapply for running parallel R functions ## How was this patch tested? build locally <img width="654" alt="screen shot 2016-06-14 at 03 12 56" src="https://cloud.githubusercontent.com/assets/3419881/16039344/12a3b6a0-31de-11e6-8d77-fe23308075c0.png"> Author: Kai Jiang <jiangkai@gmail.com> Closes #13660 from vectorijk/spark-15672-R-guide-update.
* [SPARK-15863][SQL][DOC][SPARKR] sql programming guide updates to include ↵Felix Cheung2016-06-211-1/+1
| | | | | | | | | | | | | | | | | | sparkSession in R ## What changes were proposed in this pull request? Update doc as per discussion in PR #13592 ## How was this patch tested? manual shivaram liancheng Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13799 from felixcheung/rsqlprogrammingguide.
* [SPARK-15159][SPARKR] SparkSession roxygen2 doc, programming guide, example ↵Felix Cheung2016-06-201-51/+48
| | | | | | | | | | | | | | | | | updates ## What changes were proposed in this pull request? roxygen2 doc, programming guide, example updates ## How was this patch tested? manual checks shivaram Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #13751 from felixcheung/rsparksessiondoc.
* [SPARK-15129][R][DOC] R API changes in MLGayathriMurali2016-06-171-58/+19
| | | | | | | | | | ## What changes were proposed in this pull request? Make user guide changes to SparkR documentation for all changes that happened in 2.0 to Machine Learning APIs Author: GayathriMurali <gayathri.m@intel.com> Closes #13285 from GayathriMurali/SPARK-15129.
* [SPARK-10903] followup - update API doc for SqlContextfelixcheung2016-05-261-0/+1
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Follow up on the earlier PR - in here we are fixing up roxygen2 doc examples. Also add to the programming guide migration section. ## How was this patch tested? SparkR tests Author: felixcheung <felixcheung_m@hotmail.com> Closes #13340 from felixcheung/sqlcontextdoc.
* [SPARK-12071][DOC] Document the behaviour of NA in RKrishna Kalyan2016-05-241-0/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`. ## How was this patch tested? Document update, no tests. Author: Krishna Kalyan <krishnakalyan3@gmail.com> Closes #13268 from krishnakalyan3/spark-12071-1.
* [MINOR] [SPARKR] Update data-manipulation.R to use native csv readerYanbo Liang2016-05-091-2/+2
| | | | | | | | | | | | | ## What changes were proposed in this pull request? * Since Spark has supported native csv reader, it does not necessary to use the third party ```spark-csv``` in ```examples/src/main/r/data-manipulation.R```. Meanwhile, remove all ```spark-csv``` usage in SparkR. * Running R applications through ```sparkR``` is not supported as of Spark 2.0, so we change to use ```./bin/spark-submit``` to run the example. ## How was this patch tested? Offline test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #13005 from yanboliang/r-df-examples.
* [SPARK-14883][DOCS] Fix wrong R examples and make them up-to-dateDongjoon Hyun2016-04-241-6/+5
| | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later. - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency - Fix datatypes in `sparkr.md`. - Update a data result in `sparkr.md`. - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and a typo. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12649 from dongjoon-hyun/SPARK-14883.
* [SPARK-12148][SPARKR] fix doc after renaming DataFrame to SparkDataFramefelixcheung2016-04-231-2/+3
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixed inadvertent roxygen2 doc changes, added class name change to programming guide Follow up of #12621 ## How was this patch tested? manually checked Author: felixcheung <felixcheung_m@hotmail.com> Closes #12647 from felixcheung/rdataframe.
* [SPARK-12232][SPARKR] New R API for read.table to avoid name conflictfelixcheung2016-01-191-7/+4
| | | | | | | | shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10406 from felixcheung/readtable.
* [SPARKR][DOC] minor doc update for version in migration guidefelixcheung2016-01-051-3/+3
| | | | | | | | | checked that the change is in Spark 1.6.0. shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #10574 from felixcheung/rwritemodedoc.
* [SPARK-12318][SPARKR] Save mode in SparkR should be error by defaultJeff Zhang2015-12-161-1/+8
| | | | | | | | shivaram Please help review. Author: Jeff Zhang <zjffdu@apache.org> Closes #10290 from zjffdu/SPARK-12318.
* [SPARK-12116][SPARKR][DOCS] document how to workaround function name ↵felixcheung2015-12-031-1/+2
| | | | | | | | | | conflicts with dplyr shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #10119 from felixcheung/rdocdplyrmasked.
* [SPARK-11339][SPARKR] Document the list of functions in R base package that ↵felixcheung2015-11-181-1/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | are masked by functions with same name in SparkR Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called. For those we can't call, added them to SparkR programming guide. It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg. ``` > methods("transform") [1] transform,ANY-method transform.data.frame [3] transform,DataFrame-method transform.default see '?methods' for accessing help and source code > methods("subset") [1] subset.data.frame subset,DataFrame-method subset.default [4] subset.matrix see '?methods' for accessing help and source code Warning message: In .S3methods(generic.function, class, parent.frame()) : function 'subset' appears not to be S3 generic; found functions that look like S3 methods ``` Any idea? More information on masking: http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htm http://www.sfu.ca/~sweldon/howTo/guide4.pdf This is what the output doc looks like (minus css): ![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png) Author: felixcheung <felixcheung_m@hotmail.com> Closes #9785 from felixcheung/rmasked.
* [SPARK-11684][R][ML][DOC] Update SparkR glm API doc, user guide and example ↵Yanbo Liang2015-11-181-8/+42
| | | | | | | | | | | | | | | | | codes This PR includes: * Update SparkR:::glm, SparkR:::summary API docs. * Update SparkR machine learning user guide and example codes to show: * supporting feature interaction in R formula. * summary for gaussian GLM model. * coefficients for binomial GLM model. mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #9727 from yanboliang/spark-11684.
* [SPARK-11407][SPARKR] Add doc for running from RStudiofelixcheung2015-11-031-3/+43
| | | | | | | | | | ![image](https://cloud.githubusercontent.com/assets/8969467/10871746/612ba44a-80a4-11e5-99a0-40b9931dee52.png) (This is without css, but you get the idea) shivaram Author: felixcheung <felixcheung_m@hotmail.com> Closes #9401 from felixcheung/rstudioprogrammingguide.
* [SPARK-11340][SPARKR] Support setting driver properties when starting Spark ↵felixcheung2015-10-301-8/+20
| | | | | | | | | | | | | from R programmatically or from RStudio Mapping spark.driver.memory from sparkEnvir to spark-submit commandline arguments. shivaram suggested that we possibly add other spark.driver.* properties - do we want to add all of those? I thought those could be set in SparkConf? sun-rui Author: felixcheung <felixcheung_m@hotmail.com> Closes #9290 from felixcheung/rdrivermem.
* [SPARK-9713] [ML] Document SparkR MLlib glm() integration in Spark 1.5Eric Liang2015-08-111-1/+36
| | | | | | | | | | This documents the use of R model formulae in the SparkR guide. Also fixes some bugs in the R api doc. mengxr Author: Eric Liang <ekl@databricks.com> Closes #8085 from ericl/docs.
* [SPARK-8900] [SPARKR] Fix sparkPackages in init documentationShivaram Venkataraman2015-07-081-1/+1
| | | | | | | | | | cc pwendell Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #7293 from shivaram/sparkr-packages-doc and squashes the following commits: c91471d [Shivaram Venkataraman] Fix sparkPackages in init documentation
* [SPARK-8894] [SPARKR] [DOC] Example code errors in SparkR documentation.Sun Rui2015-07-081-1/+1
| | | | | | | | Author: Sun Rui <rui.sun@intel.com> Closes #7287 from sun-rui/SPARK-8894 and squashes the following commits: da63898 [Sun Rui] [SPARK-8894][SPARKR][DOC] Example code errors in SparkR documentation.
* [SPARK-8506] Add pakages to R context created through init.Holden Karau2015-06-241-4/+13
| | | | | | | | | | | | | Author: Holden Karau <holden@pigscanfly.ca> Closes #6928 from holdenk/SPARK-8506-sparkr-does-not-provide-an-easy-way-to-depend-on-spark-packages-when-performing-init-from-inside-of-r and squashes the following commits: b60dd63 [Holden Karau] Add an example with the spark-csv package fa8bc92 [Holden Karau] typo: sparm -> spark 865a90c [Holden Karau] strip spaces for comparision c7a4471 [Holden Karau] Add some documentation c1a9233 [Holden Karau] refactor for testing c818556 [Holden Karau] Add pakages to R
* [SPARK-6806] [SPARKR] [DOCS] Add a new SparkR programming guideShivaram Venkataraman2015-05-291-0/+223
This PR adds a new SparkR programming guide at the top-level. This will be useful for R users as our APIs don't directly match the Scala/Python APIs and as we need to explain SparkR without using RDDs as examples etc. cc rxin davies pwendell cc cafreeman -- Would be great if you could also take a look at this ! Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Closes #6490 from shivaram/sparkr-guide and squashes the following commits: d5ff360 [Shivaram Venkataraman] Add a section on HiveContext, HQL queries 408dce5 [Shivaram Venkataraman] Fix link dbb86e3 [Shivaram Venkataraman] Fix minor typo 9aff5e0 [Shivaram Venkataraman] Address comments, use dplyr-like syntax in example d09703c [Shivaram Venkataraman] Fix default argument in read.df ea816a1 [Shivaram Venkataraman] Add a new SparkR programming guide Also update write.df, read.df to handle defaults better