spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-13734][SPARKR] Added histogram function	Oscar D. Lara Yejas	2016-04-26	4	-0/+170
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Added method histogram() to compute the histogram of a Column Usage: ``` ## Create a DataFrame from the Iris dataset irisDF <- createDataFrame(sqlContext, iris) ## Render a histogram for the Sepal_Length column histogram(irisDF, "Sepal_Length", nbins=12) ``` ![histogram](https://cloud.githubusercontent.com/assets/13985649/13588486/e1e751c6-e484-11e5-85db-2fc2115c4bb2.png) Note: Usage will change once SPARK-9325 is figured out so that histogram() only takes a Column as a parameter, as opposed to a DataFrame and a name ## How was this patch tested? All unit tests pass. I added specific unit cases for different scenarios. Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #11569 from olarayej/SPARK-13734.
*	[SPARK-14313][ML][SPARKR] AFTSurvivalRegression model persistence in SparkR	Yanbo Liang	2016-04-26	2	-0/+40
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? ```AFTSurvivalRegressionModel``` supports ```save/load``` in SparkR. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #12685 from yanboliang/spark-14313.
*	[SPARK-14312][ML][SPARKR] NaiveBayes model persistence in SparkR	Yanbo Liang	2016-04-25	4	-2/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? SparkR ```NaiveBayesModel``` supports ```save/load``` by the following API: ``` df <- createDataFrame(sqlContext, infert) model <- naiveBayes(education ~ ., df, laplace = 0) ml.save(model, path) model2 <- ml.load(path) ``` ## How was this patch tested? Add unit tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12573 from yanboliang/spark-14312.
*	[SPARK-14883][DOCS] Fix wrong R examples and make them up-to-date	Dongjoon Hyun	2016-04-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This issue aims to fix some errors in R examples and make them up-to-date in docs and example modules. - Remove the wrong usage of `map`. We need to use `lapply` in `sparkR` if needed. However, `lapply` is private so far. The corrected example will be added later. - Fix the wrong example in Section `Generic Load/Save Functions` of `docs/sql-programming-guide.md` for consistency - Fix datatypes in `sparkr.md`. - Update a data result in `sparkr.md`. - Replace deprecated functions to remove warnings: jsonFile -> read.json, parquetFile -> read.parquet - Use up-to-date R-like functions: loadDF -> read.df, saveDF -> write.df, saveAsParquetFile -> write.parquet - Replace `SparkR DataFrame` with `SparkDataFrame` in `dataframe.R` and `data-manipulation.R`. - Other minor syntax fixes and a typo. ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12649 from dongjoon-hyun/SPARK-14883.
*	[SPARK-12148][SPARKR] fix doc after renaming DataFrame to SparkDataFrame	felixcheung	2016-04-23	1	-8/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Fixed inadvertent roxygen2 doc changes, added class name change to programming guide Follow up of #12621 ## How was this patch tested? manually checked Author: felixcheung <felixcheung_m@hotmail.com> Closes #12647 from felixcheung/rdataframe.
*	[SPARK-14869][SQL] Don't mask exceptions in ResolveRelations	Reynold Xin	2016-04-23	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? In order to support running SQL directly on files, we added some code in ResolveRelations to catch the exception thrown by catalog.lookupRelation and ignore it. This unfortunately masks all the exceptions. This patch changes the logic to simply test the table's existence. ## How was this patch tested? I manually hacked some bugs into Spark and made sure the exceptions were being propagated up. Author: Reynold Xin <rxin@databricks.com> Closes #12634 from rxin/SPARK-14869.
*	[SPARK-14594][SPARKR] check execution return status code	felixcheung	2016-04-23	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? When JVM backend fails without going proper error handling (eg. process crashed), the R error message could be ambiguous. ``` Error in if (returnStatus != 0) { : argument is of length zero ``` This change attempts to make it more clear (however, one would still need to investigate why JVM fails) ## How was this patch tested? manually Author: felixcheung <felixcheung_m@hotmail.com> Closes #12622 from felixcheung/rreturnstatus.
*	[SPARK-12148][SPARKR] SparkR: rename DataFrame to SparkDataFrame	felixcheung	2016-04-23	14	-468/+473
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Changed class name defined in R from "DataFrame" to "SparkDataFrame". A popular package, S4Vector already defines "DataFrame" - this change is to avoid conflict. Aside from class name and API/roxygen2 references, SparkR APIs like `createDataFrame`, `as.DataFrame` are not changed (S4Vector does not define a "as.DataFrame"). Since in R, one would rarely reference type/class, this change should have minimal/almost-no impact to a SparkR user in terms of back compat. ## How was this patch tested? SparkR tests, manually loading S4Vector then SparkR package Author: felixcheung <felixcheung_m@hotmail.com> Closes #12621 from felixcheung/rdataframe.
*	[SPARK-13178] RRDD faces with concurrency issue in case of rdd.zip(rdd).count().	Sun Rui	2016-04-22	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The concurrency issue reported in SPARK-13178 was fixed by the PR https://github.com/apache/spark/pull/10947 for SPARK-12792. This PR just removes a workaround not needed anymore. ## How was this patch tested? SparkR unit tests. Author: Sun Rui <rui.sun@intel.com> Closes #12606 from sun-rui/SPARK-13178.
*	[SPARK-14780] [R] Add `setLogLevel` to SparkR	Dongjoon Hyun	2016-04-21	3	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR aims to add `setLogLevel` function to SparkR shell. Spark Shell ```scala scala> sc.setLogLevel("ERROR") ``` PySpark ```python >>> sc.setLogLevel("ERROR") ``` SparkR (this PR) ```r > setLogLevel(sc, "ERROR") NULL ``` ## How was this patch tested? Pass the Jenkins tests including a new R testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12547 from dongjoon-hyun/SPARK-14780.
*	[SPARK-14639] [PYTHON] [R] Add `bround` function in Python/R.	Dongjoon Hyun	2016-04-19	4	-1/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This issue aims to expose Scala `bround` function in Python/R API. `bround` function is implemented in SPARK-14614 by extending current `round` function. We used the following semantics from Hive. ```java public static double bround(double input, int scale) { if (Double.isNaN(input) \|\| Double.isInfinite(input)) { return input; } return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue(); } ``` After this PR, `pyspark` and `sparkR` also support `bround` function. PySpark ```python >>> from pyspark.sql.functions import bround >>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect() [Row(r=2.0)] ``` SparkR ```r > df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5))) > head(collect(select(df, bround(df$x, 0)))) bround(x, 0) 1 2 2 4 ``` ## How was this patch tested? Pass the Jenkins tests (including new testcases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12509 from dongjoon-hyun/SPARK-14639.
*	[SPARK-13905][SPARKR] Change signature of as.data.frame() to be consistent ↵	Sun Rui	2016-04-19	4	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	with the R base package. ## What changes were proposed in this pull request? Change the signature of as.data.frame() to be consistent with that in the R base package to meet R user's convention. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #11811 from sun-rui/SPARK-13905.
*	[SPARK-12224][SPARKR] R support for JDBC source	felixcheung	2016-04-19	6	-1/+139
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add R API for `read.jdbc`, `write.jdbc`. Tested this quite a bit manually with different combinations of parameters. It's not clear if we could have automated tests in R for this - Scala `JDBCSuite` depends on Java H2 in-memory database. Refactored some code into util so they could be tested. Core's R SerDe code needs to be updated to allow access to java.util.Properties as `jobj` handle which is required by DataFrameReader/Writer's `jdbc` method. It would be possible, though more code to add a `sql/r/SQLUtils` helper function. Tested: ``` # with postgresql ../bin/sparkR --driver-class-path /usr/share/java/postgresql-9.4.1207.jre7.jar # read.jdbc df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345") df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", user = "user", password = 12345) # partitionColumn and numPartitions test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, numPartitions = 4, user = "user", password = 12345) a <- SparkR:::toRDD(df) SparkR:::getNumPartitions(a) [1] 4 SparkR:::collectPartition(a, 2L) # defaultParallelism test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", partitionColumn = "did", lowerBound = 0, upperBound = 200, user = "user", password = 12345) SparkR:::getNumPartitions(a) [1] 2 # predicates test df <- read.jdbc(sqlContext, "jdbc:postgresql://localhost/db", "films2", predicates = list("did<=105"), user = "user", password = 12345) count(df) == 1 # write.jdbc, default save mode "error" irisDf <- as.DataFrame(sqlContext, iris) write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "films2", user = "user", password = "12345") "error, already exists" write.jdbc(irisDf, "jdbc:postgresql://localhost/db", "iris", user = "user", password = "12345") ``` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10480 from felixcheung/rreadjdbc.
*	[SPARK-13925][ML][SPARKR] Expose R-like summary statistics in SparkR::glm ↵	Yanbo Liang	2016-04-15	3	-4/+97
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	for more family and link functions ## What changes were proposed in this pull request? Expose R-like summary statistics in SparkR::glm for more family and link functions. Note: Not all values in R [summary.glm](http://stat.ethz.ch/R-manual/R-patched/library/stats/html/summary.glm.html) are exposed, we only provide the most commonly used statistics in this PR. More statistics can be added in the followup work. ## How was this patch tested? Unit tests. SparkR Output: ``` Deviance Residuals: (Note: These are approximate quantiles with relative error <= 0.01) Min 1Q Median 3Q Max -0.95096 -0.16585 -0.00232 0.17410 0.72918 Coefficients: Estimate Std. Error t value Pr(>\|t\|) (Intercept) 1.6765 0.23536 7.1231 4.4561e-11 Sepal_Length 0.34988 0.046301 7.5566 4.1873e-12 Species_versicolor -0.98339 0.072075 -13.644 0 Species_virginica -1.0075 0.093306 -10.798 0 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.22 Number of Fisher Scoring iterations: 1 ``` R output: ``` Deviance Residuals: Min 1Q Median 3Q Max -0.95096 -0.16522 0.00171 0.18416 0.72918 Coefficients: Estimate Std. Error t value Pr(>\|t\|) (Intercept) 1.67650 0.23536 7.123 4.46e-11 * Sepal.Length 0.34988 0.04630 7.557 4.19e-12 * Speciesversicolor -0.98339 0.07207 -13.644 < 2e-16 * Speciesvirginica -1.00751 0.09331 -10.798 < 2e-16 * --- Signif. codes: 0 ‘*’ 0.001 ‘’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 (Dispersion parameter for gaussian family taken to be 0.08351462) Null deviance: 28.307 on 149 degrees of freedom Residual deviance: 12.193 on 146 degrees of freedom AIC: 59.217 Number of Fisher Scoring iterations: 2 ``` cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12393 from yanboliang/spark-13925.
*	[SPARK-12566][SPARK-14324][ML] GLM model family, link function support in ↵	Yanbo Liang	2016-04-12	2	-144/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SparkR:::glm * SparkR glm supports families and link functions which match R's signature for family. * SparkR glm API refactor. The comparative standard of the new API is R glm, so I only expose the arguments that R glm supports: ```formula, family, data, epsilon and maxit```. * This PR is focus on glm() and predict(), summary statistics will be done in a separate PR after this get in. * This PR depends on #12287 which make GLMs support link prediction at Scala side. After that merged, I will add more tests for predict() to this PR. Unit tests. cc mengxr jkbradley hhbyyh Author: Yanbo Liang <ybliang8@gmail.com> Closes #12294 from yanboliang/spark-12566.
*	[SPARK-14362][SPARK-14406][SQL][FOLLOW-UP] DDL Native Support: Drop View and ↵	gatorsmile	2016-04-10	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Drop Table #### What changes were proposed in this pull request? This PR is to address the comment: https://github.com/apache/spark/pull/12146#discussion-diff-59092238. It removes the function `isViewSupported` from `SessionCatalog`. After the removal, we still can capture the user errors if users try to drop a table using `DROP VIEW`. #### How was this patch tested? Modified the existing test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #12284 from gatorsmile/followupDropTable.
*	[SPARK-14353] Dataset Time Window `window` API for R	Burak Yavuz	2016-04-05	5	-1/+105
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008). This PR adds the R API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python and R, users can access all APIs above, but in addition they can do - In R: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12141 from brkyvz/R-windows.
*	[SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans	Yanbo Liang	2016-03-31	1	-29/+62
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the code refactor for the original ```KMeans``` wrapper. ## How was this patch tested? Existing tests. cc mengxr Author: Yanbo Liang <ybliang8@gmail.com> Closes #12039 from yanboliang/spark-14059.
*	[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF.	Sun Rui	2016-03-28	1	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs. Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #12024 from sun-rui/SPARK-12792_new.
*	Revert "[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF."	Davies Liu	2016-03-28	1	-8/+0
\| \| \| \|	This reverts commit 40984f67065eeaea731940008e6677c2323dda3e.
*	[SPARK-12792] [SPARKR] Refactor RRDD to support R UDF.	Sun Rui	2016-03-28	1	-0/+8
\| \| \| \| \| \| \| \| \| \|	Refactor RRDD by separating the common logic interacting with the R worker to a new class RRunner, which can be used to evaluate R UDFs. Now RRDD relies on RRuner for RDD computation and RRDD could be reomved if we want to remove RDD API in SparkR later. Author: Sun Rui <rui.sun@intel.com> Closes #10947 from sun-rui/SPARK-12792.
*	[SPARK-14014][SQL] Integrate session catalog (attempt #2)	Andrew Or	2016-03-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests. ## How was this patch tested? See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`. Author: Andrew Or <andrew@databricks.com> Closes #11938 from andrewor14/session-catalog-again.
*	[SPARK-13010][ML][SPARKR] Implement a simple wrapper of ↵	Yanbo Liang	2016-03-24	5	-2/+132
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	AFTSurvivalRegression in SparkR ## What changes were proposed in this pull request? This PR continues the work in #11447, we implemented the wrapper of ```AFTSurvivalRegression``` named ```survreg``` in SparkR. ## How was this patch tested? Test against output from R package survival's survreg. cc mengxr felixcheung Close #11447 Author: Yanbo Liang <ybliang8@gmail.com> Closes #11932 from yanboliang/spark-13010-new.
*	Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog"	Andrew Or	2016-03-23	1	-2/+1
\| \| \| \|	This reverts commit 5dfc01976bb0d72489620b4f32cc12d620bb6260.
*	[SPARK-14014][SQL] Replace existing catalog with SessionCatalog	Andrew Or	2016-03-23	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? `SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`. As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely: - SPARK-14013: Properly implement temporary functions in `SessionCatalog` - SPARK-13879: Decide which DDL/DML commands to support natively in Spark - SPARK-?????: Implement the ones we do want to support through `SessionCatalog`. - SPARK-?????: Merge SQL/HiveContext ## How was this patch tested? This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`. Author: Andrew Or <andrew@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #11836 from andrewor14/use-session-catalog.
*	[SPARK-13449] Naive Bayes wrapper in SparkR	Xusen Yin	2016-03-22	5	-7/+153
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR continues the work in #11486 from yinxusen with some code refactoring. In R package e1071, `naiveBayes` supports both categorical (Bernoulli) and continuous features (Gaussian), while in MLlib we support Bernoulli and multinomial. This PR implements the common subset: Bernoulli. I moved the implementation out from SparkRWrappers to NaiveBayesWrapper to make it easier to read. Argument names, default values, and summary now match e1071's naiveBayes. I removed the preprocess part that omit NA values because we don't know which columns to process. ## How was this patch tested? Test against output from R package e1071's naiveBayes. cc: yanboliang yinxusen Closes #11486 Author: Xusen Yin <yinxusen@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #11890 from mengxr/SPARK-13449.
*	[MINOR][DOCS] Use `spark-submit` instead of `sparkR` to submit R script.	Dongjoon Hyun	2016-03-19	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Since `sparkR` is not used for submitting R Scripts from Spark 2.0, a user faces the following error message if he follows the instruction on `R/README.md`. This PR updates `R/README.md`. ```bash $ ./bin/sparkR examples/src/main/r/dataframe.R Running R applications through 'sparkR' is not supported as of Spark 2.0. Use ./bin/spark-submit <R file> ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11842 from dongjoon-hyun/update_r_readme.
*	[SPARK-13812][SPARKR] Fix SparkR lint-r test errors.	Sun Rui	2016-03-13	21	-174/+178
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? This PR fixes all newly captured SparkR lint-r errors after the lintr package is updated from github. ## How was this patch tested? dev/lint-r SparkR unit tests Author: Sun Rui <rui.sun@intel.com> Closes #11652 from sun-rui/SPARK-13812.
*	[SPARK-13389][SPARKR] SparkR support first/last with ignore NAs	Yanbo Liang	2016-03-10	3	-10/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? SparkR support first/last with ignore NAs cc sun-rui felixcheung shivaram ## How was the this patch tested? unit tests Author: Yanbo Liang <ybliang8@gmail.com> Closes #11267 from yanboliang/spark-13389.
*	[SPARK-13327][SPARKR] Added parameter validations for colnames<-	Oscar D. Lara Yejas	2016-03-10	2	-1/+32
\| \| \| \| \| \| \|	Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Closes #11220 from olarayej/SPARK-13312-3.
*	[SPARK-13504] [SPARKR] Add approxQuantile for SparkR	Yanbo Liang	2016-02-25	4	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \| \|	## What changes were proposed in this pull request? Add ```approxQuantile``` for SparkR. ## How was this patch tested? unit tests Author: Yanbo Liang <ybliang8@gmail.com> Closes #11383 from yanboliang/spark-13504 and squashes the following commits: 4f17adb [Yanbo Liang] Add approxQuantile for SparkR
*	[SPARK-13472] [SPARKR] Fix unstable Kmeans test in R	Liang-Chi Hsieh	2016-02-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-13472 ## What changes were proposed in this pull request? One Kmeans test in R is unstable and sometimes fails. We should fix it. ## How was this patch tested? Unit test is modified in this PR. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #11345 from viirya/fix-kmeans-r-test and squashes the following commits: f959f61 [Liang-Chi Hsieh] Sort resulted clusters.
*	[SPARK-13011] K-means wrapper in SparkR	Xusen Yin	2016-02-23	4	-5/+109
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-13011 Author: Xusen Yin <yinxusen@gmail.com> Closes #11124 from yinxusen/SPARK-13011.
*	[MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns ↵	Dongjoon Hyun	2016-02-22	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in other comments ## What changes were proposed in this pull request? This PR tries to fix all typos in all markdown files under `docs` module, and fixes similar typos in other comments, too. ## How was the this patch tested? manual tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #11300 from dongjoon-hyun/minor_fix_typos.
*	[SPARK-12799] Simplify various string output for expressions	Cheng Lian	2016-02-21	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR introduces several major changes: 1. Replacing `Expression.prettyString` with `Expression.sql` The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users. 1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed) Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird. Here are several examples: Expression \| `prettyString` \| `sql` \| Note ------------------ \| -------------- \| ---------- \| --------------- `a && b` \| `a && b` \| `a AND b` \| `a.getField("f")` \| `a[f]` \| `a.f` \| `a` is a struct 1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders) `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression. Author: Cheng Lian <lian@databricks.com> Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
*	[SPARK-13339][DOCS] Clarify commutative / associative operator requirements ↵	Sean Owen	2016-02-19	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	for reduce, fold Clarify that reduce functions need to be commutative, and fold functions do not See https://github.com/apache/spark/pull/11091 Author: Sean Owen <sowen@cloudera.com> Closes #11217 from srowen/SPARK-13339.
*	[SPARK-13264][DOC] Removed multi-byte characters in spark-env.sh.template	Sasaki Toru	2016-02-11	1	-1/+1
\| \| \| \| \| \| \| \|	In spark-env.sh.template, there are multi-byte characters, this PR will remove it. Author: Sasaki Toru <sasakitoa@nttdata.co.jp> Closes #11149 from sasakitoa/remove_multibyte_in_sparkenv.
*	[SPARK-12903][SPARKR] Add covar_samp and covar_pop for SparkR	Yanbo Liang	2016-01-26	5	-2/+73
\| \| \| \| \| \| \| \| \| \| \|	Add ```covar_samp``` and ```covar_pop``` for SparkR. Should we also provide ```cov``` alias for ```covar_samp```? There is ```cov``` implementation at stats.R which masks ```stats::cov``` already, but may bring to breaking API change. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10829 from yanboliang/spark-12903.
*	[SPARK-12629][SPARKR] Fixes for DataFrame saveAsTable method	Narine Kokhlikyan	2016-01-22	3	-9/+41
\| \| \| \| \| \| \| \| \| \|	I've tried to solve some of the issues mentioned in: https://issues.apache.org/jira/browse/SPARK-12629 Please, let me know what do you think. Thanks! Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com> Closes #10580 from NarineK/sparkrSavaAsRable.
*	[SPARK-12204][SPARKR] Implement drop method for DataFrame in SparkR.	Sun Rui	2016-01-20	5	-27/+88
\| \| \| \| \| \|	Author: Sun Rui <rui.sun@intel.com> Closes #10201 from sun-rui/SPARK-12204.
*	[SPARK-12910] Fixes : R version for installing sparkR	Shubhanshu Mishra	2016-01-20	2	-2/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Testing code: ``` $ ./install-dev.sh USING R_HOME = /usr/bin ERROR: this R is version 2.15.1, package 'SparkR' requires R >= 3.0 ``` Using the new argument: ``` $ ./install-dev.sh /content/username/SOFTWARE/R-3.2.3 USING R_HOME = /content/username/SOFTWARE/R-3.2.3/bin * installing source package â€˜SparkRâ€™ ... R inst preparing package for lazy loading Creating a new generic function for â€˜colnamesâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜colnames<-â€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜covâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜na.omitâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜filterâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜intersectâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜sampleâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜transformâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜subsetâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜summaryâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜lagâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜rankâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜sdâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜varâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜predictâ€™ in package â€˜SparkRâ€™ Creating a new generic function for â€˜rbindâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜lapplyâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜Filterâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜aliasâ€™ from package â€˜statsâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜substrâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜%in%â€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜meanâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜uniqueâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜nrowâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜ncolâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜headâ€™ from package â€˜utilsâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜factorialâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜atan2â€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ Creating a generic function for â€˜ifelseâ€™ from package â€˜baseâ€™ in package â€˜SparkRâ€™ help No man pages found in package â€˜SparkRâ€™ * installing help indices building package indices ** testing if installed package can be loaded * DONE (SparkR) ``` Author: Shubhanshu Mishra <smishra8@illinois.edu> Closes #10836 from napsternxg/master.
*	[SPARK-12848][SQL] Change parsed decimal literal datatype from Double to Decimal	Herman van Hovell	2016-01-20	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```. The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double. This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D``` cc davies rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10796 from hvanhovell/SPARK-12848.
*	[SPARK-12232][SPARKR] New R API for read.table to avoid name conflict	felixcheung	2016-01-19	4	-20/+17
\| \| \| \| \| \| \| \|	shivaram sorry it took longer to fix some conflicts, this is the change to add an alias for `table` Author: felixcheung <felixcheung_m@hotmail.com> Closes #10406 from felixcheung/readtable.
*	[SPARK-12337][SPARKR] Implement dropDuplicates() method of DataFrame in SparkR.	Sun Rui	2016-01-19	4	-1/+75
\| \| \| \| \| \|	Author: Sun Rui <rui.sun@intel.com> Closes #10309 from sun-rui/SPARK-12337.
*	[SPARK-12168][SPARKR] Add automated tests for conflicted function in R	felixcheung	2016-01-19	2	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently this is reported when loading the SparkR package in R (probably would add is.nan) ``` Loading required package: methods Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var The following objects are masked from ‘package:base’: colnames, colnames<-, intersect, rank, rbind, sample, subset, summary, table, transform ``` Adding this test adds an automated way to track changes to masked method. Also, the second part of this test check for those functions that would not be accessible without namespace/package prefix. Incidentally, this might point to how we would fix those inaccessible functions in base or stats. Looking for feedback for adding this test. Author: felixcheung <felixcheung_m@hotmail.com> Closes #10171 from felixcheung/rmaskedtest.
*	[SPARK-12862][SPARKR] Jenkins does not run R tests	felixcheung	2016-01-17	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	Slight correction: I'm leaving sparkR as-is (ie. R file not supported) and fixed only run-tests.sh as shivaram described. I also assume we are going to cover all doc changes in https://issues.apache.org/jira/browse/SPARK-12846 instead of here. rxin shivaram zjffdu Author: felixcheung <felixcheung_m@hotmail.com> Closes #10792 from felixcheung/sparkRcmd.
*	[SPARK-11031][SPARKR] Method str() on a DataFrame	Oscar D. Lara Yejas	2016-01-15	5	-22/+140
\| \| \| \| \| \| \| \| \|	Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com> Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu> Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com> Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net> Closes #9613 from olarayej/SPARK-11031.
*	[SPARK-12756][SQL] use hash expression in Exchange	Wenchen Fan	2016-01-13	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one. This PR also fixes the tests that are broken by the new hash behaviour in shuffle. Author: Wenchen Fan <wenchen@databricks.com> Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
*	[SPARK-12645][SPARKR] SparkR support hash function	Yanbo Liang	2016-01-09	4	-1/+26
\| \| \| \| \| \| \| \|	Add ```hash``` function for SparkR ```DataFrame```. Author: Yanbo Liang <ybliang8@gmail.com> Closes #10597 from yanboliang/spark-12645.
*	[SPARK-12393][SPARKR] Add read.text and write.text for SparkR	Yanbo Liang	2016-01-06	5	-1/+82
\| \| \| \| \| \| \| \| \|	Add ```read.text``` and ```write.text``` for SparkR. cc sun-rui felixcheung shivaram Author: Yanbo Liang <ybliang8@gmail.com> Closes #10348 from yanboliang/spark-12393.