aboutsummaryrefslogtreecommitdiff
path: root/R/pkg/inst/tests/testthat/test_sparkSQL.R
diff options
context:
space:
mode:
authorBurak Yavuz <brkyvz@gmail.com>2016-11-23 11:48:59 -0800
committerTathagata Das <tathagata.das1565@gmail.com>2016-11-23 11:48:59 -0800
commit0d1bf2b6c8ac4d4141d7cef0552c22e586843c57 (patch)
tree165473dabfb5793daa1739116ad534b487483763 /R/pkg/inst/tests/testthat/test_sparkSQL.R
parentf129ebcd302168b628f47705f4a7d6b7e7b057b0 (diff)
downloadspark-0d1bf2b6c8ac4d4141d7cef0552c22e586843c57.tar.gz
spark-0d1bf2b6c8ac4d4141d7cef0552c22e586843c57.tar.bz2
spark-0d1bf2b6c8ac4d4141d7cef0552c22e586843c57.zip
[SPARK-18510] Fix data corruption from inferred partition column dataTypes
## What changes were proposed in this pull request? ### The Issue If I specify my schema when doing ```scala spark.read .schema(someSchemaWherePartitionColumnsAreStrings) ``` but if the partition inference can infer it as IntegerType or I assume LongType or DoubleType (basically fixed size types), then once UnsafeRows are generated, your data will be corrupted. ### Proposed solution The partition handling code path is kind of a mess. In my fix I'm probably adding to the mess, but at least trying to standardize the code path. The real issue is that a user that uses the `spark.read` code path can never clearly specify what the partition columns are. If you try to specify the fields in `schema`, we practically ignore what the user provides, and fall back to our inferred data types. What happens in the end is data corruption. My solution tries to fix this by always trying to infer partition columns the first time you specify the table. Once we find what the partition columns are, we try to find them in the user specified schema and use the dataType provided there, or fall back to the smallest common data type. We will ALWAYS append partition columns to the user's schema, even if they didn't ask for it. We will only use the data type they provided if they specified it. While this is confusing, this has been the behavior since Spark 1.6, and I didn't want to change this behavior in the QA period of Spark 2.1. We may revisit this decision later. A side effect of this PR is that we won't need https://github.com/apache/spark/pull/15942 if this PR goes in. ## How was this patch tested? Regression tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #15951 from brkyvz/partition-corruption.
Diffstat (limited to 'R/pkg/inst/tests/testthat/test_sparkSQL.R')
-rw-r--r--R/pkg/inst/tests/testthat/test_sparkSQL.R2
1 files changed, 1 insertions, 1 deletions
diff --git a/R/pkg/inst/tests/testthat/test_sparkSQL.R b/R/pkg/inst/tests/testthat/test_sparkSQL.R
index ee48baa59c..c669c2e2e2 100644
--- a/R/pkg/inst/tests/testthat/test_sparkSQL.R
+++ b/R/pkg/inst/tests/testthat/test_sparkSQL.R
@@ -2684,7 +2684,7 @@ test_that("Call DataFrameWriter.load() API in Java without path and check argume
# It makes sure that we can omit path argument in read.df API and then it calls
# DataFrameWriter.load() without path.
expect_error(read.df(source = "json"),
- paste("Error in loadDF : analysis error - Unable to infer schema for JSON at .",
+ paste("Error in loadDF : analysis error - Unable to infer schema for JSON.",
"It must be specified manually"))
expect_error(read.df("arbitrary_path"), "Error in loadDF : analysis error - Path does not exist")
expect_error(read.json("arbitrary_path"), "Error in json : analysis error - Path does not exist")