aboutsummaryrefslogtreecommitdiff
path: root/sql/core/src/test
diff options
context:
space:
mode:
authorhyukjinkwon <gurwls223@gmail.com>2017-03-23 00:25:01 -0700
committerFelix Cheung <felixcheung@apache.org>2017-03-23 00:25:01 -0700
commit07c12c09a75645f6b56b30654455b3838b7b6637 (patch)
tree7680418bff0d7885ea8bdefd0d3e182f751f3606 /sql/core/src/test
parent12cd00706cbfff4c8ac681fcae65b4c4c8751877 (diff)
downloadspark-07c12c09a75645f6b56b30654455b3838b7b6637.tar.gz
spark-07c12c09a75645f6b56b30654455b3838b7b6637.tar.bz2
spark-07c12c09a75645f6b56b30654455b3838b7b6637.zip
[SPARK-18579][SQL] Use ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options in CSV writing
## What changes were proposed in this pull request? This PR proposes to support _not_ trimming the white spaces when writing out. These are `false` by default in CSV reading path but these are `true` by default in CSV writing in univocity parser. Both `ignoreLeadingWhiteSpace` and `ignoreTrailingWhiteSpace` options are not being used for writing and therefore, we are always trimming the white spaces. It seems we should provide a way to keep this white spaces easily. WIth the data below: ```scala val df = spark.read.csv(Seq("a , b , c").toDS) df.show() ``` ``` +---+----+---+ |_c0| _c1|_c2| +---+----+---+ | a | b | c| +---+----+---+ ``` **Before** ```scala df.write.csv("/tmp/text.csv") spark.read.text("/tmp/text.csv").show() ``` ``` +-----+ |value| +-----+ |a,b,c| +-----+ ``` It seems this can't be worked around via `quoteAll` too. ```scala df.write.option("quoteAll", true).csv("/tmp/text.csv") spark.read.text("/tmp/text.csv").show() ``` ``` +-----------+ | value| +-----------+ |"a","b","c"| +-----------+ ``` **After** ```scala df.write.option("ignoreLeadingWhiteSpace", false).option("ignoreTrailingWhiteSpace", false).csv("/tmp/text.csv") spark.read.text("/tmp/text.csv").show() ``` ``` +----------+ | value| +----------+ |a , b , c| +----------+ ``` Note that this case is possible in R ```r > system("cat text.csv") f1,f2,f3 a , b , c > df <- read.csv(file="text.csv") > df f1 f2 f3 1 a b c > write.csv(df, file="text1.csv", quote=F, row.names=F) > system("cat text1.csv") f1,f2,f3 a , b , c ``` ## How was this patch tested? Unit tests in `CSVSuite` and manual tests for Python. Author: hyukjinkwon <gurwls223@gmail.com> Closes #17310 from HyukjinKwon/SPARK-18579.
Diffstat (limited to 'sql/core/src/test')
-rw-r--r--sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala57
1 files changed, 57 insertions, 0 deletions
diff --git a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
index 2600894ca3..d70c47f4e2 100644
--- a/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
+++ b/sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/csv/CSVSuite.scala
@@ -1117,4 +1117,61 @@ class CSVSuite extends QueryTest with SharedSQLContext with SQLTestUtils {
assert(df2.schema === schema)
}
+ test("ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options - read") {
+ val input = " a,b , c "
+
+ // For reading, default of both `ignoreLeadingWhiteSpace` and`ignoreTrailingWhiteSpace`
+ // are `false`. So, these are excluded.
+ val combinations = Seq(
+ (true, true),
+ (false, true),
+ (true, false))
+
+ // Check if read rows ignore whitespaces as configured.
+ val expectedRows = Seq(
+ Row("a", "b", "c"),
+ Row(" a", "b", " c"),
+ Row("a", "b ", "c "))
+
+ combinations.zip(expectedRows)
+ .foreach { case ((ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace), expected) =>
+ val df = spark.read
+ .option("ignoreLeadingWhiteSpace", ignoreLeadingWhiteSpace)
+ .option("ignoreTrailingWhiteSpace", ignoreTrailingWhiteSpace)
+ .csv(Seq(input).toDS())
+
+ checkAnswer(df, expected)
+ }
+ }
+
+ test("SPARK-18579: ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options - write") {
+ val df = Seq((" a", "b ", " c ")).toDF()
+
+ // For writing, default of both `ignoreLeadingWhiteSpace` and `ignoreTrailingWhiteSpace`
+ // are `true`. So, these are excluded.
+ val combinations = Seq(
+ (false, false),
+ (false, true),
+ (true, false))
+
+ // Check if written lines ignore each whitespaces as configured.
+ val expectedLines = Seq(
+ " a,b , c ",
+ " a,b, c",
+ "a,b ,c ")
+
+ combinations.zip(expectedLines)
+ .foreach { case ((ignoreLeadingWhiteSpace, ignoreTrailingWhiteSpace), expected) =>
+ withTempPath { path =>
+ df.write
+ .option("ignoreLeadingWhiteSpace", ignoreLeadingWhiteSpace)
+ .option("ignoreTrailingWhiteSpace", ignoreTrailingWhiteSpace)
+ .csv(path.getAbsolutePath)
+
+ // Read back the written lines.
+ val readBack = spark.read.text(path.getAbsolutePath)
+ checkAnswer(readBack, Row(expected))
+ }
+ }
+ }
}