aboutsummaryrefslogtreecommitdiff
path: root/R/pkg/inst
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp ↵titicaca2017-02-121-2/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | column ## What changes were proposed in this pull request? Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs: ``` library(SparkR) sparkR.session(master = "local") df <- data.frame(col1 = c(0, 1, 2), col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01"))) sdf1 <- createDataFrame(df) print(dtypes(sdf1)) df1 <- collect(sdf1) print(lapply(df1, class)) sdf2 <- filter(sdf1, "col1 > 0") print(dtypes(sdf2)) df2 <- collect(sdf2) print(lapply(df2, class)) ``` As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column. This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct. Therefore, we need to cast the data type of the vector explicitly. ## How was this patch tested? The patch can be tested manually with the same code above. Author: titicaca <fangzhou.yang@hotmail.com> Closes #16689 from titicaca/sparkr-dev.
* [SPARK-19464][BUILD][HOTFIX] run-tests should use hadoop2.6Dongjoon Hyun2017-02-081-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? After SPARK-19464, **SparkPullRequestBuilder** fails because it still tries to use hadoop2.3. **BEFORE** https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [error] Could not find hadoop2.3 in the list. Valid options are ['hadoop2.6', 'hadoop2.7'] Attempting to post to Github... > Post successful. ``` **AFTER** https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/72595/console ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark (w/Hive 1.2.1) using SBT with these arguments: -Phadoop-2.6 -Pmesos -Pkinesis-asl -Pyarn -Phive-thriftserver -Phive test:package streaming-kafka-0-8-assembly/assembly streaming-flume-assembly/assembly streaming-kinesis-asl-assembly/assembly Using /usr/java/jdk1.8.0_60 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. ``` ## How was this patch tested? Pass the existing test. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16858 from dongjoon-hyun/hotfix_run-tests.
* [SPARK-16609] Add to_date/to_timestamp with format functionsanabranch2017-02-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This pull request adds two new user facing functions: - `to_date` which accepts an expression and a format and returns a date. - `to_timestamp` which accepts an expression and a format and returns a timestamp. For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM) ### Date Function *Previously* ``` to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")) ``` *Current* ``` to_date(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Timestamp Function *Previously* ``` unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp") ``` *Current* ``` to_timestamp(lit("2016-21-05"), "yyyy-dd-MM") ``` ### Tasks - [X] Add `to_date` to Scala Functions - [x] Add `to_date` to Python Functions - [x] Add `to_date` to SQL Functions - [X] Add `to_timestamp` to Scala Functions - [x] Add `to_timestamp` to Python Functions - [x] Add `to_timestamp` to SQL Functions - [x] Add function to R ## How was this patch tested? - [x] Add Functions to `DateFunctionsSuite` - Test new `ParseToTimestamp` Expression (*not necessary*) - Test new `ParseToDate` Expression (*not necessary*) - [x] Add test for R - [x] Add test for Python in test.py Please review http://spark.apache.org/contributing.html before opening a pull request. Author: anabranch <wac.chambers@gmail.com> Author: Bill Chambers <bill@databricks.com> Author: anabranch <bill@databricks.com> Closes #16138 from anabranch/SPARK-16609.
* [SPARK-19452][SPARKR] Fix bug in the name assignment methodactuaryzhang2017-02-051-0/+8
| | | | | | | | | | | | ## What changes were proposed in this pull request? The names method fails to check for validity of the assignment values. This can be fixed by calling colnames within names. ## How was this patch tested? new tests. Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16794 from actuaryzhang/sparkRNames.
* [SPARK-19319][SPARKR] SparkR Kmeans summary returns error when the cluster ↵wm624@hotmail.com2017-01-311-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | size doesn't equal to k ## What changes were proposed in this pull request When Kmeans using initMode = "random" and some random seed, it is possible the actual cluster size doesn't equal to the configured `k`. In this case, summary(model) returns error due to the number of cols of coefficient matrix doesn't equal to k. Example: > col1 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col2 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > col3 <- c(1, 2, 3, 4, 0, 1, 2, 3, 4, 0) > cols <- as.data.frame(cbind(col1, col2, col3)) > df <- createDataFrame(cols) > > model2 <- spark.kmeans(data = df, ~ ., k = 5, maxIter = 10, initMode = "random", seed = 22222, tol = 1E-5) > > summary(model2) Error in `colnames<-`(`*tmp*`, value = c("col1", "col2", "col3")) : length of 'dimnames' [2] not equal to array extent In addition: Warning message: In matrix(coefficients, ncol = k) : data length [9] is not a sub-multiple or multiple of the number of rows [2] Fix: Get the actual cluster size in the summary and use it to build the coefficient matrix. ## How was this patch tested? Add unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16666 from wangmiao1981/kmeans.
* [SPARK-19395][SPARKR] Convert coefficients in summary to matrixactuaryzhang2017-01-313-18/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `coefficients` component in model summary should be 'matrix' but the underlying structure is indeed list. This affects several models except for 'AFTSurvivalRegressionModel' which has the correct implementation. The fix is to first `unlist` the coefficients returned from the `callJMethod` before converting to matrix. An example illustrates the issues: ``` data(iris) df <- createDataFrame(iris) model <- spark.glm(df, Sepal_Length ~ Sepal_Width, family = "gaussian") s <- summary(model) > str(s$coefficients) List of 8 $ : num 6.53 $ : num -0.223 $ : num 0.479 $ : num 0.155 $ : num 13.6 $ : num -1.44 $ : num 0 $ : num 0.152 - attr(*, "dim")= int [1:2] 2 4 - attr(*, "dimnames")=List of 2 ..$ : chr [1:2] "(Intercept)" "Sepal_Width" ..$ : chr [1:4] "Estimate" "Std. Error" "t value" "Pr(>|t|)" > s$coefficients[, 2] $`(Intercept)` [1] 0.4788963 $Sepal_Width [1] 0.1550809 ``` This shows that the underlying structure of coefficients is still `list`. felixcheung wangmiao1981 Author: actuaryzhang <actuaryzhang10@gmail.com> Closes #16730 from actuaryzhang/sparkRCoef.
* [SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkRFelix Cheung2017-01-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible) Before: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) NULL ``` After: ``` > a <- as.DataFrame(cars) > b <- group_by(a, "dist") > c <- count(b) > sparkR.callJMethod(c$countjc, "explain", TRUE) count#11L NULL ``` Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped. ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16670 from felixcheung/rjvmstdout.
* [SPARK-18788][SPARKR] Add API for getNumPartitionsFelix Cheung2017-01-262-12/+12
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? With doc to say this would convert DF into RDD ## How was this patch tested? unit tests, manual tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16668 from felixcheung/rgetnumpartitions.
* [SPARK-18821][SPARKR] Bisecting k-means wrapper in SparkRwm624@hotmail.com2017-01-261-0/+40
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add R wrapper for bisecting Kmeans. As JIRA is down, I will update title to link with corresponding JIRA later. ## How was this patch tested? Add new unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16566 from wangmiao1981/bk.
* [SPARK-18823][SPARKR] add support for assigning to columnFelix Cheung2017-01-241-0/+20
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Support for ``` df[[myname]] <- 1 df[[2]] <- df$eruptions ``` ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16663 from felixcheung/rcolset.
* [SPARK-19291][SPARKR][ML] spark.gaussianMixture supports output log-likelihood.Yanbo Liang2017-01-211-0/+7
| | | | | | | | | | | | ## What changes were proposed in this pull request? ```spark.gaussianMixture``` supports output total log-likelihood for the model like R ```mvnormalmixEM```. ## How was this patch tested? R unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16646 from yanboliang/spark-19291.
* [SPARK-19066][SPARKR] SparkR LDA doesn't set optimizer correctlywm624@hotmail.com2017-01-162-3/+14
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code. In addition, the `summary` method should bring back the results related to `DistributedLDAModel`. ## How was this patch tested? Manual tests by comparing with scala example. Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16464 from wangmiao1981/new.
* [SPARK-18335][SPARKR] createDataFrame to support numPartitions parameterFelix Cheung2017-01-132-3/+24
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? To allow specifying number of partitions when the DataFrame is created ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16512 from felixcheung/rnumpart.
* [SPARK-19142][SPARKR] spark.kmeans should take seed, initSteps, and tol as ↵wm624@hotmail.com2017-01-121-0/+20
| | | | | | | | | | | | | | | parameters ## What changes were proposed in this pull request? spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans. Add missing parameters and corresponding document. Modified existing unit tests to take additional parameters. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16523 from wangmiao1981/kmeans.
* [SPARK-19130][SPARKR] Support setting literal value as column implicitlyFelix Cheung2017-01-111-0/+18
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ``` df$foo <- 1 ``` instead of ``` df$foo <- lit(1) ``` ## How was this patch tested? unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16510 from felixcheung/rlitcol.
* [SPARK-19133][SPARKR][ML] fix glm for Gamma, clarify glm family supportedFelix Cheung2017-01-101-9/+17
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? R family is a longer list than what Spark supports. ## How was this patch tested? manual Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16511 from felixcheung/rdocglmfamily.
* [SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple filesYanbo Liang2017-01-087-1170/+1303
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain: * mllib_classification.R * mllib_clustering.R * mllib_recommendation.R * mllib_regression.R * mllib_stat.R * mllib_tree.R * mllib_utils.R Note: Only reorg, no actual code change. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16312 from yanboliang/spark-18862.
* [SPARK-18958][SPARKR] R API toJSON on DataFrameFelix Cheung2016-12-221-6/+7
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? It would make it easier to integrate with other component expecting row-based JSON format. This replaces the non-public toJSON RDD API. ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16368 from felixcheung/rJSON.
* [SPARK-18903][SPARKR] Add API to get SparkUI URLFelix Cheung2016-12-211-1/+4
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? API for SparkUI URL from SparkContext ## How was this patch tested? manual, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16367 from felixcheung/rwebui.
* [SPARK-18897][SPARKR] Fix SparkR SQL Test to drop test tableDongjoon Hyun2016-12-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? SparkR tests, `R/run-tests.sh`, succeeds only once because `test_sparkSQL.R` does not clean up the test table, `people`. As a result, the rows in `people` table are accumulated at every run and the test cases fail. The following is the failure result for the second run. ```r Failed ------------------------------------------------------------------------- 1. Failure: create DataFrame from RDD (test_sparkSQL.R#204) ------------------- collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to c(16). Lengths differ: 2 vs 1 2. Failure: create DataFrame from RDD (test_sparkSQL.R#206) ------------------- collect(sql("SELECT height from people WHERE name ='Bob'"))$height not equal to c(176.5). Lengths differ: 2 vs 1 ``` ## How was this patch tested? Manual. Run `run-tests.sh` twice and check if it passes without failures. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #16310 from dongjoon-hyun/SPARK-18897.
* [MINOR][SPARKR] fix kstest example error and add unit testwm624@hotmail.com2016-12-131-0/+6
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? While adding vignettes for kstest, I found some errors in the example: 1. There is a typo of kstest; 2. print.summary.KStest doesn't work with the example; Fix the example errors; Add a new unit test for print.summary.KStest; ## How was this patch tested? Manual test; Add new unit test; Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16259 from wangmiao1981/ks.
* [SPARK-18810][SPARKR] SparkR install.spark does not work for RCs, snapshotsFelix Cheung2016-12-121-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Support overriding the download url (include version directory) in an environment variable, `SPARKR_RELEASE_DOWNLOAD_URL` ## How was this patch tested? unit test, manually testing - snapshot build url - download when spark jar not cached - when spark jar is cached - RC build url - download when spark jar not cached - when spark jar is cached - multiple cached spark versions - starting with sparkR shell To use this, ``` SPARKR_RELEASE_DOWNLOAD_URL=http://this_is_the_url_to_spark_release_tgz R ``` then in R, ``` library(SparkR) # or specify lib.loc sparkR.session() ``` Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16248 from felixcheung/rinstallurl.
* [SPARK-18807][SPARKR] Should suppress output print for calls to JVM methods ↵Felix Cheung2016-12-091-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | with void return values ## What changes were proposed in this pull request? Several SparkR API calling into JVM methods that have void return values are getting printed out, especially when running in a REPL or IDE. example: ``` > setLogLevel("WARN") NULL ``` We should fix this to make the result more clear. Also found a small change to return value of dropTempView in 2.1 - adding doc and test for it. ## How was this patch tested? manually - I didn't find a expect_*() method in testthat for this Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #16237 from felixcheung/rinvis.
* [SPARK-18349][SPARKR] Update R API documentation on ml model summarywm624@hotmail.com2016-12-081-0/+2
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In this PR, the document of `summary` method is improved in the format: returns summary information of the fitted model, which is a list. The list includes ....... Since `summary` in R is mainly about the model, which is not the same as `summary` object on scala side, if there is one, the scala API doc is not pointed here. In current document, some `return` have `.` and some don't have. `.` is added to missed ones. Since spark.logit `summary` has a big refactoring, this PR doesn't include this one. It will be changed when the `spark.logit` PR is merged. ## How was this patch tested? Manual build. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #16150 from wangmiao1981/audit2.
* [SPARK-18326][SPARKR][ML] Review SparkR ML wrappers API for 2.1Yanbo Liang2016-12-071-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Reviewing SparkR ML wrappers API for 2.1 release, mainly two issues: * Remove ```probabilityCol``` from the argument list of ```spark.logit``` and ```spark.randomForest```. Since it was used when making prediction and should be an argument of ```predict```, and we will work on this at [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) in the next release cycle. * Fix ```spark.als``` params to make it consistent with MLlib. ## How was this patch tested? Existing tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16169 from yanboliang/spark-18326.
* [SPARK-18678][ML] Skewed reservoir sampling in SamplingUtilsSean Owen2016-12-071-4/+5
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix reservoir sampling bias for small k. An off-by-one error meant that the probability of replacement was slightly too high -- k/(l-1) after l element instead of k/l, which matters for small k. ## How was this patch tested? Existing test plus new test case. Author: Sean Owen <sowen@cloudera.com> Closes #16129 from srowen/SPARK-18678.
* [SPARK-18686][SPARKR][ML] Several cleanup and improvements for spark.logit.Yanbo Liang2016-12-071-55/+128
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Several cleanup and improvements for ```spark.logit```: * ```summary``` should return coefficients matrix, and should output labels for each class if the model is multinomial logistic regression model. * ```summary``` should not return ```areaUnderROC, roc, pr, ...```, since most of them are DataFrame which are less important for R users. Meanwhile, these metrics ignore instance weights (setting all to 1.0) which will be changed in later Spark version. In case it will introduce breaking changes, we do not expose them currently. * SparkR test improvement: comparing the training result with native R glmnet. * Remove argument ```aggregationDepth``` from ```spark.logit```, since it's an expert Param(related with Spark architecture and job execution) that would be used rarely by R users. ## How was this patch tested? Unit tests. The ```summary``` output after this change: multinomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > model <- spark.logit(df, Species ~ ., regParam = 0.5) > summary(model) $coefficients versicolor virginica setosa (Intercept) 1.514031 -2.609108 1.095077 Sepal_Length 0.02511006 0.2649821 -0.2900921 Sepal_Width -0.5291215 -0.02016446 0.549286 Petal_Length 0.03647411 0.1544119 -0.190886 Petal_Width 0.000236092 0.4195804 -0.4198165 ``` binomial logistic regression: ``` > df <- suppressWarnings(createDataFrame(iris)) > training <- df[df$Species %in% c("versicolor", "virginica"), ] > model <- spark.logit(training, Species ~ ., regParam = 0.5) > summary(model) $coefficients Estimate (Intercept) -6.053815 Sepal_Length 0.2449379 Sepal_Width 0.1648321 Petal_Length 0.4730718 Petal_Width 1.031947 ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #16117 from yanboliang/spark-18686.
* [SPARK-18291][SPARKR][ML] Revert "[SPARK-18291][SPARKR][ML] SparkR glm ↵Yanbo Liang2016-12-021-15/+5
| | | | | | | | | | | | | | | | predict should output original label when family = binomial." ## What changes were proposed in this pull request? It's better we can fix this issue by providing an option ```type``` for users to change the ```predict``` output schema, then they could output probabilities, log-space predictions, or original labels. In order to not involve breaking API change for 2.1, so revert this change firstly and will add it back after [SPARK-18618](https://issues.apache.org/jira/browse/SPARK-18618) resolved. ## How was this patch tested? Existing unit tests. This reverts commit daa975f4bfa4f904697bf3365a4be9987032e490. Author: Yanbo Liang <ybliang8@gmail.com> Closes #16118 from yanboliang/spark-18291-revert.
* [SPARK-18476][SPARKR][ML] SparkR Logistic Regression should should support ↵wm624@hotmail.com2016-11-301-8/+18
| | | | | | | | | | | | | | | | | | output original label. ## What changes were proposed in this pull request? Similar to SPARK-18401, as a classification algorithm, logistic regression should support output original label instead of supporting index label. In this PR, original label output is supported and test cases are modified and added. Document is also modified. ## How was this patch tested? Unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #15910 from wangmiao1981/audit.
* [SPARK-18510] Fix data corruption from inferred partition column dataTypesBurak Yavuz2016-11-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? ### The Issue If I specify my schema when doing ```scala spark.read .schema(someSchemaWherePartitionColumnsAreStrings) ``` but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted. ### Proposed solution The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path. The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption. My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type. We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later. A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15951 from brkyvz/partition-corruption.
* [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear dataYanbo Liang2016-11-221-0/+9
| | | | | | | | | | | | | ## What changes were proposed in this pull request? * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition. * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs". ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15930 from yanboliang/spark-18501.
* [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not ↵Yanbo Liang2016-11-221-0/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | download Spark package. ## What changes were proposed in this pull request? When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary. ``` ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R ``` The following is output: ``` Attaching package: ‘SparkR’ The following objects are masked from ‘package:stats’: cov, filter, lag, na.omit, predict, sd, var, window The following objects are masked from ‘package:base’: as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind, sample, startsWith, subset, summary, transform, union Spark not found in SPARK_HOME: Spark not found in the cache directory. Installation will start. MirrorUrl not provided. Looking for preferred site from apache website... ...... ``` There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running. ## How was this patch tested? Offline test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15888 from yanboliang/spark-18444.
* [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula.Yanbo Liang2016-11-161-20/+43
| | | | | | | | | | | | | ## What changes were proposed in this pull request? ```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers. BTW, I did some cleanup and improvement for ```spark.mlp```. ## How was this patch tested? Unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15883 from yanboliang/spark-18438.
* [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms ↵Yanbo Liang2016-11-131-3/+15
| | | | | | | | | | | | | | | | | | | | training on libsvm data ## What changes were proposed in this pull request? * Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data. ``` java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true. ``` See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more detail about how to reproduce this bug. * Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function. * Drop some unwanted columns when making prediction. ## How was this patch tested? Add unit test. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15851 from yanboliang/spark-18412.
* [SPARK-18401][SPARKR][ML] SparkR random forest should support output ↵Yanbo Liang2016-11-101-0/+24
| | | | | | | | | | | | | | original label. ## What changes were proposed in this pull request? SparkR ```spark.randomForest``` classification prediction should output original label rather than the indexed label. This issue is very similar with [SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291). ## How was this patch tested? Add unit tests. Author: Yanbo Liang <ybliang8@gmail.com> Closes #15842 from yanboliang/spark-18401.
* [SPARK-18239][SPARKR] Gradient Boosted Tree for RFelix Cheung2016-11-081-0/+68
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Gradient Boosted Tree in R. With a few minor improvements to RandomForest in R. Since this is relatively isolated I'd like to target this for branch-2.1 ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15746 from felixcheung/rgbt.
* [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label ↵Yanbo Liang2016-11-071-5/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when family = binomial. ## What changes were proposed in this pull request? SparkR ```spark.glm``` predict should output original label when family = "binomial". ## How was this patch tested? Add unit test. You can also run the following code to test: ```R training <- suppressWarnings(createDataFrame(iris)) training <- training[training$Species %in% c("versicolor", "virginica"), ] model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit")) showDF(predict(model, training)) ``` Before this change: ``` +------------+-----------+------------+-----------+----------+-----+-------------------+ |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label| prediction| +------------+-----------+------------+-----------+----------+-----+-------------------+ | 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| 0.8271421517601544| | 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| 0.6044595910413112| | 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| 0.7916340858281998| | 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|0.16080518180591158| | 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| 0.6112229217050189| | 5.7| 2.8| 4.5| 1.3|versicolor| 0.0| 0.2555087295500885| | 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| 0.5681507664364834| | 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|0.05990570219972002| | 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| 0.6644434078306246| | 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|0.11293577405862379| | 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|0.06152372321585971| | 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|0.35250697207602555| | 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|0.32267018290814303| | 6.1| 2.9| 4.7| 1.4|versicolor| 0.0| 0.433391153814592| | 5.6| 2.9| 3.6| 1.3|versicolor| 0.0| 0.2280744262436993| | 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| 0.7219848389339459| | 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|0.23527698971404695| | 5.8| 2.7| 4.1| 1.0|versicolor| 0.0| 0.285024533520016| | 6.2| 2.2| 4.5| 1.5|versicolor| 0.0| 0.4107047877447493| | 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|0.20083561961645083| +------------+-----------+------------+-----------+----------+-----+-------------------+ ``` After this change: ``` +------------+-----------+------------+-----------+----------+-----+----------+ |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width| Species|label|prediction| +------------+-----------+------------+-----------+----------+-----+----------+ | 7.0| 3.2| 4.7| 1.4|versicolor| 0.0| virginica| | 6.4| 3.2| 4.5| 1.5|versicolor| 0.0| virginica| | 6.9| 3.1| 4.9| 1.5|versicolor| 0.0| virginica| | 5.5| 2.3| 4.0| 1.3|versicolor| 0.0|versicolor| | 6.5| 2.8| 4.6| 1.5|versicolor| 0.0| virginica| | 5.7| 2.8| 4.5| 1.3|versicolor| 0.0|versicolor| | 6.3| 3.3| 4.7| 1.6|versicolor| 0.0| virginica| | 4.9| 2.4| 3.3| 1.0|versicolor| 0.0|versicolor| | 6.6| 2.9| 4.6| 1.3|versicolor| 0.0| virginica| | 5.2| 2.7| 3.9| 1.4|versicolor| 0.0|versicolor| | 5.0| 2.0| 3.5| 1.0|versicolor| 0.0|versicolor| | 5.9| 3.0| 4.2| 1.5|versicolor| 0.0|versicolor| | 6.0| 2.2| 4.0| 1.0|versicolor| 0.0|versicolor| | 6.1| 2.9| 4.7| 1.4|versicolor| 0.0|versicolor| | 5.6| 2.9| 3.6| 1.3|versicolor| 0.0|versicolor| | 6.7| 3.1| 4.4| 1.4|versicolor| 0.0| virginica| | 5.6| 3.0| 4.5| 1.5|versicolor| 0.0|versicolor| | 5.8| 2.7| 4.1| 1.0|versicolor| 0.0|versicolor| | 6.2| 2.2| 4.5| 1.5|versicolor| 0.0|versicolor| | 5.6| 2.5| 3.9| 1.1|versicolor| 0.0|versicolor| +------------+-----------+------------+-----------+----------+-----+----------+ ``` Author: Yanbo Liang <ybliang8@gmail.com> Closes #15788 from yanboliang/spark-18291.
* [SPARKR][TEST] remove unnecessary suppressWarningswm624@hotmail.com2016-11-031-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In test_mllib.R, there are two unnecessary suppressWarnings. This PR just removes them. ## How was this patch tested? Existing unit tests. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #15697 from wangmiao1981/rtest.
* [SPARK-17470][SQL] unify path for data source table and locationUri for hive ↵Wenchen Fan2016-11-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | serde table ## What changes were proposed in this pull request? Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties. This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field. This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog. For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm. For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`. To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #15024 from cloud-fan/path.
* [SPARK-16839][SQL] Simplify Struct creation code patheyal farago2016-11-021-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`. This PR includes: 1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`). 2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees. 3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`. 4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved. 5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns. ## How was this patch tested? Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully. Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`. Author: eyal farago <eyal farago> Author: Herman van Hovell <hvanhovell@databricks.com> Author: eyal farago <eyal.farago@gmail.com> Author: Eyal Farago <eyal.farago@actimize.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Author: eyalfa <eyal.farago@gmail.com> Closes #15718 from hvanhovell/SPARK-16839-2.
* [SPARK-17838][SPARKR] Check named arguments for options and use formatted R ↵hyukjinkwon2016-11-012-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | friendly message from JVM exception message ## What changes were proposed in this pull request? This PR proposes to - improve the R-friendly error messages rather than raw JVM exception one. As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle JVM messages. - prevent `zero-length variable name` and prints the ignored options as an warning message. **Before** ``` r > read.json("path", a = 1, 2, 3, "a") Error in env[[name]] <- value : zero-length variable name ``` ``` r > read.json("arbitrary_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Path does not exist: file:/...; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398) ... > read.orc("arbitrary_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Path does not exist: file:/...; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398) ... > read.text("arbitrary_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Path does not exist: file:/...; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398) ... > read.parquet("arbitrary_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: Path does not exist: file:/...; at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398) ... ``` ``` r > write.json(df, "existing_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: path file:/... already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68) > write.orc(df, "existing_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: path file:/... already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68) > write.text(df, "existing_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: path file:/... already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68) > write.parquet(df, "existing_path") Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) : org.apache.spark.sql.AnalysisException: path file:/... already exists.; at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68) ``` **After** ``` r read.json("arbitrary_path", a = 1, 2, 3, "a") Unnamed arguments ignored: 2, 3, a. ``` ``` r > read.json("arbitrary_path") Error in json : analysis error - Path does not exist: file:/... > read.orc("arbitrary_path") Error in orc : analysis error - Path does not exist: file:/... > read.text("arbitrary_path") Error in text : analysis error - Path does not exist: file:/... > read.parquet("arbitrary_path") Error in parquet : analysis error - Path does not exist: file:/... ``` ``` r > write.json(df, "existing_path") Error in json : analysis error - path file:/... already exists.; > write.orc(df, "existing_path") Error in orc : analysis error - path file:/... already exists.; > write.text(df, "existing_path") Error in text : analysis error - path file:/... already exists.; > write.parquet(df, "existing_path") Error in parquet : analysis error - path file:/... already exists.; ``` ## How was this patch tested? Unit tests in `test_utils.R` and `test_sparkSQL.R`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #15608 from HyukjinKwon/SPARK-17838.
* Revert "[SPARK-16839][SQL] redundant aliases after cleanupAliases"Herman van Hovell2016-11-011-6/+6
| | | | This reverts commit 5441a6269e00e3903ae6c1ea8deb4ddf3d2e9975.
* [SPARK-16839][SQL] redundant aliases after cleanupAliaseseyal farago2016-11-011-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`. This PR includes: 1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`). 2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees. 3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`. 4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved. 5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns. ## How was this patch tested? running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully. modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`. Credit goes to hvanhovell for assisting with this PR. Author: eyal farago <eyal farago> Author: eyal farago <eyal.farago@gmail.com> Author: Herman van Hovell <hvanhovell@databricks.com> Author: Eyal Farago <eyal.farago@actimize.com> Author: Hyukjin Kwon <gurwls223@gmail.com> Author: eyalfa <eyal.farago@gmail.com> Closes #14444 from eyalfa/SPARK-16839_redundant_aliases_after_cleanupAliases.
* [SPARK-16137][SPARKR] randomForest for RFelix Cheung2016-10-301-0/+68
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Random Forest Regression and Classification for R Clean-up/reordering generics.R ## How was this patch tested? manual tests, unit tests Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15607 from felixcheung/rrandomforest.
* [SPARK-17919] Make timeout to RBackend configurable in SparkRHossein2016-10-302-3/+8
| | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch makes RBackend connection timeout configurable by user. ## How was this patch tested? N/A Author: Hossein <hossein@databricks.com> Closes #15471 from falaki/SPARK-17919.
* [SPARK-17157][SPARKR] Add multiclass logistic regression SparkR Wrapperwm624@hotmail.com2016-10-261-0/+55
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? As we discussed in #14818, I added a separate R wrapper spark.logit for logistic regression. This single interface supports both binary and multinomial logistic regression. It also has "predict" and "summary" for binary logistic regression. ## How was this patch tested? New unit tests are added. Author: wm624@hotmail.com <wm624@hotmail.com> Closes #15365 from wangmiao1981/glm.
* [SPARK-17961][SPARKR][SQL] Add storageLevel to DataFrame for SparkRWeichenXu2016-10-261-1/+4
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add storageLevel to DataFrame for SparkR. This is similar to this RP: https://github.com/apache/spark/pull/13780 but in R I do not make a class for `StorageLevel` but add a method `storageToString` ## How was this patch tested? test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15516 from WeichenXu123/storageLevel_df_r.
* [SPARK-18007][SPARKR][ML] update SparkR MLP - add initalWeights parameterWeichenXu2016-10-251-0/+15
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? update SparkR MLP, add initalWeights parameter. ## How was this patch tested? test added. Author: WeichenXu <WeichenXu123@outlook.com> Closes #15552 from WeichenXu123/mlp_r_add_initialWeight_param.
* [SPARKR][BRANCH-2.0] R merge API doc and example fixFelix Cheung2016-10-231-1/+1
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fixes for R doc ## How was this patch tested? N/A Author: Felix Cheung <felixcheung_m@hotmail.com> Closes #15589 from felixcheung/rdocmergefix. (cherry picked from commit 0e0d83a597885ab1773cb69d6dcc10346d6976a3) Signed-off-by: Felix Cheung <felixcheung@apache.org>
* [SPARK-17811] SparkR cannot parallelize data.frame with NA or NULL in Date ↵Hossein2016-10-211-0/+13
| | | | | | | | | | | | | | columns ## What changes were proposed in this pull request? NA date values are serialized as "NA" and NA time values are serialized as NaN from R. In the backend we did not have proper logic to deal with them. As a result we got an IllegalArgumentException for Date and wrong value for time. This PR adds support for deserializing NA as Date and Time. ## How was this patch tested? * [x] TODO Author: Hossein <hossein@databricks.com> Closes #15421 from falaki/SPARK-17811.