| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR:
1. Suppress all known warnings.
2. Cleanup test cases and fix some errors in test cases.
3. Fix errors in HiveContext related test cases. These test cases are actually not run previously due to a bug of creating TestHiveContext.
4. Support 'testthat' package version 0.11.0 which prefers that test cases be under 'tests/testthat'
5. Make sure the default Hadoop file system is local when running test cases.
6. Turn on warnings into errors.
Author: Sun Rui <rui.sun@intel.com>
Closes #10030 from sun-rui/SPARK-12034.
|
|
|
|
|
|
|
|
|
|
|
|
| |
1, Add ```isNaN``` to ```Column``` for SparkR. ```Column``` should has three related variable functions: ```isNaN, isNull, isNotNull```.
2, Replace ```DataFrame.isNaN``` with ```DataFrame.isnan``` at SparkR side. Because ```DataFrame.isNaN``` has been deprecated and will be removed at Spark 2.0.
<del>3, Add ```isnull``` to ```DataFrame``` for SparkR. ```DataFrame``` should has two related functions: ```isnan, isnull```.<del>
cc shivaram sun-rui felixcheung
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10037 from yanboliang/spark-12044.
|
|
|
|
|
|
|
|
| |
Need to match existing method signature
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9680 from felixcheung/rcorr.
|
|
|
|
|
|
|
|
| |
SparkR.
Author: Sun Rui <rui.sun@intel.com>
Closes #9804 from sun-rui/SPARK-11774.
|
|
|
|
|
|
| |
Author: Sun Rui <rui.sun@intel.com>
Closes #10118 from sun-rui/SPARK-12104.
|
|
|
|
|
|
| |
Author: Sun Rui <rui.sun@intel.com>
Closes #9769 from sun-rui/SPARK-11781.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add support for for colnames, colnames<-, coltypes<-
Also added tests for names, names<- which have no test previously.
I merged with PR 8984 (coltypes). Clicked the wrong thing, crewed up the PR. Recreated it here. Was #9218
shivaram sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9654 from felixcheung/colnamescoltypes.
|
|
|
|
|
|
|
|
|
|
| |
tests, fix doc and add examples
shivaram sun-rui
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #10019 from felixcheung/rfunctionsdoc.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Change ```cumeDist -> cume_dist, denseRank -> dense_rank, percentRank -> percent_rank, rowNumber -> row_number``` at SparkR side.
There are two reasons that we should make this change:
* We should follow the [naming convention rule of R](http://www.inside-r.org/node/230645)
* Spark DataFrame has deprecated the old convention (such as ```cumeDist```) and will remove it in Spark 2.0.
It's better to fix this issue before 1.6 release, otherwise we will make breaking API change.
cc shivaram sun-rui
Author: Yanbo Liang <ybliang8@gmail.com>
Closes #10016 from yanboliang/SPARK-12025.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
are masked by functions with same name in SparkR
Added tests for function that are reported as masked, to make sure the base:: or stats:: function can be called.
For those we can't call, added them to SparkR programming guide.
It would seem to me `table, sample, subset, filter, cov` not working are not actually expected - I investigated/experimented with them but couldn't get them to work. It looks like as they are defined in base or stats they are missing the S3 generic, eg.
```
> methods("transform")
[1] transform,ANY-method transform.data.frame
[3] transform,DataFrame-method transform.default
see '?methods' for accessing help and source code
> methods("subset")
[1] subset.data.frame subset,DataFrame-method subset.default
[4] subset.matrix
see '?methods' for accessing help and source code
Warning message:
In .S3methods(generic.function, class, parent.frame()) :
function 'subset' appears not to be S3 generic; found functions that look like S3 methods
```
Any idea?
More information on masking:
http://www.ats.ucla.edu/stat/r/faq/referencing_objects.htm
http://www.sfu.ca/~sweldon/howTo/guide4.pdf
This is what the output doc looks like (minus css):
![image](https://cloud.githubusercontent.com/assets/8969467/11229714/2946e5de-8d4d-11e5-94b0-dda9696b6fdd.png)
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9785 from felixcheung/rmasked.
|
|
|
|
|
|
| |
Author: Sun Rui <rui.sun@intel.com>
Closes #9764 from sun-rui/SPARK-11773.
|
|
|
|
|
|
|
|
| |
The goal of this PR is to add tests covering the issue to ensure that is was resolved by [SPARK-11086](https://issues.apache.org/jira/browse/SPARK-11086).
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes #9743 from zero323/SPARK-11281-tests.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
when createDataFrame
Use `dropFactors` column-wise instead of nested loop when `createDataFrame` from a `data.frame`
At this moment SparkR createDataFrame is using nested loop to convert factors to character when called on a local data.frame. It works but is incredibly slow especially with data.table (~ 2 orders of magnitude compared to PySpark / Pandas version on a DateFrame of size 1M rows x 2 columns).
A simple improvement is to apply `dropFactor `column-wise and then reshape output list.
It should at least partially address [SPARK-8277](https://issues.apache.org/jira/browse/SPARK-8277).
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes #9099 from zero323/SPARK-11086.
|
|
|
|
|
|
|
|
| |
switched stddev support from DeclarativeAggregate to ImperativeAggregate.
Author: JihongMa <linlin200605@gmail.com>
Closes #9380 from JihongMA/SPARK-11420.
|
|
|
|
|
|
|
|
|
|
| |
Checked names, none of them should conflict with anything in base
shivaram davies rxin
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9489 from felixcheung/rstddev.
|
|
|
|
|
|
|
|
| |
This is a follow up on PR #8984, as the corresponding branch for such PR was damaged.
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Closes #9579 from olarayej/SPARK-10863_NEW14.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make sample test less flaky by setting the seed
Tested with
```
repeat { if (count(sample(df, FALSE, 0.1)) == 3) { break } }
```
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #9549 from felixcheung/rsample.
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes #8314 from squito/SPARK-10116.
|
|
|
|
|
|
|
| |
Author: adrian555 <wzhuang@us.ibm.com>
Author: Adrian Zhuang <adrian555@users.noreply.github.com>
Closes #9443 from adrian555/with.
|
|
|
|
|
|
| |
Author: Sun Rui <rui.sun@intel.com>
Closes #9196 from sun-rui/SPARK-11210.
|
|
|
|
|
|
| |
Author: Sun Rui <rui.sun@intel.com>
Closes #9193 from sun-rui/SPARK-11209.
|
|
|
|
|
|
|
|
|
| |
Add merge function to DataFrame, which supports R signature.
https://stat.ethz.ch/R-manual/R-devel/library/base/html/merge.html
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes #9012 from NarineK/sparkrmerge.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR introduce a new feature to run SQL directly on files without create a table, for example:
```
select id from json.`path/to/json/files` as j
```
Author: Davies Liu <davies@databricks.com>
Closes #9173 from davies/source.
|
|
|
|
|
|
| |
Author: Sun Rui <rui.sun@intel.com>
Closes #9023 from sun-rui/SPARK-10996.
|
|
|
|
|
|
|
|
|
|
| |
I was having issues with collect() and orderBy() in Spark 1.5.0 so I used the DataFrame.R file and test_sparkSQL.R file from the Spark 1.5.1 download. I only modified the join() function in DataFrame.R to include "full", "fullouter", "left", "right", and "leftsemi" and added corresponding test cases in the test for join() and merge() in test_sparkSQL.R file.
Pull request because I filed this JIRA bug report:
https://issues.apache.org/jira/browse/SPARK-10981
Author: Monica Liu <liu.monica.f@gmail.com>
Closes #9029 from mfliu/master.
|
|
|
|
|
|
|
|
|
| |
Bring the change code up to date.
Author: Adrian Zhuang <adrian555@users.noreply.github.com>
Author: adrian555 <wzhuang@us.ibm.com>
Closes #9031 from adrian555/attach2.
|
|
|
|
|
|
|
|
|
| |
as.DataFrame is more a R-style like signature.
Also, I'd like to know if we could make the context, e.g. sqlContext global, so that we do not have to specify it as an argument, when we each time create a dataframe.
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes #8952 from NarineK/sparkrasDataFrame.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Two points in this PR:
1. Originally thought was that a named R list is assumed to be a struct in SerDe. But this is problematic because some R functions will implicitly generate named lists that are not intended to be a struct when transferred by SerDe. So SerDe clients have to explicitly mark a names list as struct by changing its class from "list" to "struct".
2. SerDe is in the Spark Core module, and data of StructType is represented as GenricRow which is defined in Spark SQL module. SerDe can't import GenricRow as in maven build Spark SQL module depends on Spark Core module. So this PR adds a registration hook in SerDe to allow SQLUtils in Spark SQL module to register its functions for serialization and deserialization of StructType.
Author: Sun Rui <rui.sun@intel.com>
Closes #8794 from sun-rui/SPARK-10051.
|
|
|
|
|
|
|
|
|
|
|
| |
1. Add a "col" function into DataFrame.
2. Move the current "col" function in Column.R to functions.R, convert it to S4 function.
3. Add a s4 "column" function in functions.R.
4. Convert the "column" function in Column.R to S4 function. This is for private use.
Author: Sun Rui <rui.sun@intel.com>
Closes #8864 from sun-rui/SPARK-10079.
|
|
|
|
|
|
|
|
|
|
|
| |
[SPARK-10905][SparkR]: Export freqItems() for DataFrameStatFunctions
- Add function (together with roxygen2 doc) to DataFrame.R and generics.R
- Expose the function in NAMESPACE
- Add unit test for the function
Author: Rerngvit Yanggratoke <rerngvit@kth.se>
Closes #8962 from rerngvit/SPARK-10905.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
the sort function can be used as an alternative to arrange(... ).
As arguments it accepts x - dataframe, decreasing - TRUE/FALSE, a list of orderings for columns and the list of columns, represented as string names
for example:
sort(df, TRUE, "col1","col2","col3","col5") # for example, if we want to sort some of the columns in the same order
sort(df, decreasing=TRUE, "col1")
sort(df, decreasing=c(TRUE,FALSE), "col1","col2")
Author: Narine Kokhlikyan <narine.kokhlikyan@gmail.com>
Closes #8920 from NarineK/sparkrsort.
|
|
|
|
|
|
| |
Author: Sun Rui <rui.sun@intel.com>
Closes #8869 from sun-rui/SPARK-10752.
|
|
|
|
|
|
|
|
| |
The fix is to coerce `c("a", "b")` into a list such that it could be serialized to call JVM with.
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #8961 from felixcheung/rselect.
|
|
|
|
|
|
|
|
|
|
| |
Created method as.data.frame as a synonym for collect().
Author: Oscar D. Lara Yejas <olarayej@mail.usf.edu>
Author: olarayej <oscar.lara.yejas@us.ibm.com>
Author: Oscar D. Lara Yejas <oscar.lara.yejas@us.ibm.com>
Closes #8908 from olarayej/SPARK-10807.
|
|
|
|
|
|
|
|
|
| |
1. Support collecting data of MapType from DataFrame.
2. Support data of MapType in createDataFrame.
Author: Sun Rui <rui.sun@intel.com>
Closes #8711 from sun-rui/SPARK-10050.
|
|
|
|
|
|
|
|
|
|
|
| |
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
Closes #6297 from JihongMA/SPARK-SQL.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this PR :
1. Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.
2. Enhance the SerDe to support transferring a Scala seq to R side. Data of ArrayType in DataFrame
after collection is observed to be of Scala Seq type.
3. Support ArrayType in createDataFrame().
Author: Sun Rui <rui.sun@intel.com>
Closes #8458 from sun-rui/SPARK-10049.
|
|
|
|
|
|
|
|
|
| |
Spark gives an error message and does not show the output when a field of the result DataFrame contains characters in CJK.
I changed SerDe.scala in order that Spark support Unicode characters when writes a string to R.
Author: CHOIJAEHONG <redrock07@naver.com>
Closes #7494 from CHOIJAEHONG1/SPARK-8951.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add subset and transform
Also reorganize `[` & `[[` to subset instead of select
Note: for transform, transform is very similar to mutate. Spark doesn't seem to replace existing column with the name in mutate (ie. `mutate(df, age = df$age + 2)` - returned DataFrame has 2 columns with the same name 'age'), so therefore not doing that for now in transform.
Though it is clearly stated it should replace column with matching name (should I open a JIRA for mutate/transform?)
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #8503 from felixcheung/rsubset_transform.
|
|
|
|
|
|
|
|
|
|
| |
S3 function is at https://stat.ethz.ch/R-manual/R-patched/library/stats/html/na.fail.html
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com>
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8495 from shivaram/na-omit-fix.
|
|
|
|
|
|
|
|
| |
cc sun-rui davies
Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu>
Closes #8475 from shivaram/varargs-fix.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Getting rid of some validation problems in SparkR
https://github.com/apache/spark/pull/7883
cc shivaram
```
inst/tests/test_Serde.R:26:1: style: Trailing whitespace is superfluous.
^~
inst/tests/test_Serde.R:34:1: style: Trailing whitespace is superfluous.
^~
inst/tests/test_Serde.R:37:38: style: Trailing whitespace is superfluous.
expect_equal(class(x), "character")
^~
inst/tests/test_Serde.R:50:1: style: Trailing whitespace is superfluous.
^~
inst/tests/test_Serde.R:55:1: style: Trailing whitespace is superfluous.
^~
inst/tests/test_Serde.R:60:1: style: Trailing whitespace is superfluous.
^~
inst/tests/test_sparkSQL.R:611:1: style: Trailing whitespace is superfluous.
^~
R/DataFrame.R:664:1: style: Trailing whitespace is superfluous.
^~~~~~~~~~~~~~
R/DataFrame.R:670:55: style: Trailing whitespace is superfluous.
df <- data.frame(row.names = 1 : nrow)
^~~~~~~~~~~~~~~~
R/DataFrame.R:672:1: style: Trailing whitespace is superfluous.
^~~~~~~~~~~~~~
R/DataFrame.R:686:49: style: Trailing whitespace is superfluous.
df[[names[colIndex]]] <- vec
^~~~~~~~~~~~~~~~~~
```
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8474 from yu-iskw/minor-fix-sparkr.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
filter / select)
Add support for
```
df[df$name == "Smith", c(1,2)]
df[df$age %in% c(19, 30), 1:2]
```
shivaram
Author: felixcheung <felixcheung_m@hotmail.com>
Closes #8394 from felixcheung/rsubset.
|
|
|
|
|
|
|
|
|
| |
### JIRA
[[SPARK-10106] Add `ifelse` Column function to SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10106)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8303 from yu-iskw/SPARK-10106.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
complicated
I added lots of Column functinos into SparkR. And I also added `rand(seed: Int)` and `randn(seed: Int)` in Scala. Since we need such APIs for R integer type.
### JIRA
[[SPARK-9856] Add expression functions into SparkR whose params are complicated - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9856)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8264 from yu-iskw/SPARK-9856-3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
- Add `when` and `otherwise` as `Column` methods
- Add `When` as an expression function
- Add `%otherwise%` infix as an alias of `otherwise`
Since R doesn't support a feature like method chaining, `otherwise(when(condition, value), value)` style is a little annoying for me. If `%otherwise%` looks strange for shivaram, I can remove it. What do you think?
### JIRA
[[SPARK-10075] Add `when` expressino function in SparkR - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-10075)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8266 from yu-iskw/SPARK-10075.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
variable parameter
### Summary
- Add `lit` function
- Add `concat`, `greatest`, `least` functions
I think we need to improve `collect` function in order to implement `struct` function. Since `collect` doesn't work with arguments which includes a nested `list` variable. It seems that a list against `struct` still has `jobj` classes. So it would be better to solve this problem on another issue.
### JIRA
[[SPARK-9871] Add expression functions into SparkR which have a variable parameter - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9871)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8194 from yu-iskw/SPARK-9856.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a WIP patch for SPARK-8844 for collecting reviews.
This bug is about reading an empty DataFrame. in readCol(),
lapply(1:numRows, function(x) {
does not take into consideration the case where numRows = 0.
Will add unit test case.
Author: Sun Rui <rui.sun@intel.com>
Closes #7419 from sun-rui/SPARK-8844.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
simple
I added lots of expression functions for SparkR. This PR includes only functions whose params are only `(Column)` or `(Column, Column)`. And I think we need to improve how to test those functions. However, it would be better to work on another issue.
## Diff Summary
- Add lots of functions in `functions.R` and their generic in `generic.R`
- Add aliases for `ceiling` and `sign`
- Move expression functions from `column.R` to `functions.R`
- Modify `rdname` from `column` to `functions`
I haven't supported `not` function, because the name has a collesion with `testthat` package. I didn't think of the way to define it.
## New Supported Functions
```
approxCountDistinct
ascii
base64
bin
bitwiseNOT
ceil (alias: ceiling)
crc32
dayofmonth
dayofyear
explode
factorial
hex
hour
initcap
isNaN
last_day
length
log2
ltrim
md5
minute
month
negate
quarter
reverse
round
rtrim
second
sha1
signum (alias: sign)
size
soundex
to_date
trim
unbase64
unhex
weekofyear
year
datediff
levenshtein
months_between
nanvl
pmod
```
## JIRA
[[SPARK-9855] Add expression functions into SparkR whose params are simple - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-9855)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes #8123 from yu-iskw/SPARK-9855.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
on DataFrames
This PR adds synonyms for ```merge``` and ```summary``` in SparkR DataFrame API.
cc shivaram
Author: Hossein <hossein@databricks.com>
Closes #7806 from falaki/SPARK-9320 and squashes the following commits:
72600f7 [Hossein] Updated docs
92a6e75 [Hossein] Fixed merge generic signature issue
4c2b051 [Hossein] Fixing naming with mllib summary
0f3a64c [Hossein] Added ... to generic for merge
30fbaf8 [Hossein] Merged master
ae1a4cf [Hossein] Merge branch 'master' into SPARK-9320
e8eb86f [Hossein] Add a generic for merge
fc01f2d [Hossein] Added unit test
8d92012 [Hossein] Added merge as an alias for join
5b8bedc [Hossein] Added unit test
632693d [Hossein] Added summary as an alias for describe for DataFrame
|