diff options
author | Jason White <jason.white@shopify.com> | 2017-03-09 10:34:54 -0800 |
---|---|---|
committer | Wenchen Fan <wenchen@databricks.com> | 2017-03-09 10:34:54 -0800 |
commit | 206030bd12405623c00c1ff334663984b9250adb (patch) | |
tree | 231e548a364627fac2a11f86f58abedcb7cf7447 | |
parent | 274973d2a32ff4eb28545b50a3135e4784eb3047 (diff) | |
download | spark-206030bd12405623c00c1ff334663984b9250adb.tar.gz spark-206030bd12405623c00c1ff334663984b9250adb.tar.bz2 spark-206030bd12405623c00c1ff334663984b9250adb.zip |
[SPARK-19561][SQL] add int case handling for TimestampType
## What changes were proposed in this pull request?
Add handling of input of type `Int` for dataType `TimestampType` to `EvaluatePython.scala`. Py4J serializes ints smaller than MIN_INT or larger than MAX_INT to Long, which are handled correctly already, but values between MIN_INT and MAX_INT are serialized to Int.
These range limits correspond to roughly half an hour on either side of the epoch. As a result, PySpark doesn't allow TimestampType values to be created in this range.
Alternatives attempted: patching the `TimestampType.toInternal` function to cast return values to `long`, so Py4J would always serialize them to Scala Long. Python3 does not have a `long` type, so this approach failed on Python3.
## How was this patch tested?
Added a new PySpark-side test that fails without the change.
The contribution is my original work and I license the work to the project under the project’s open source license.
Resubmission of https://github.com/apache/spark/pull/16896. The original PR didn't go through Jenkins and broke the build. davies dongjoon-hyun
cloud-fan Could you kick off a Jenkins run for me? It passed everything for me locally, but it's possible something has changed in the last few weeks.
Author: Jason White <jason.white@shopify.com>
Closes #17200 from JasonMWhite/SPARK-19561.
-rw-r--r-- | python/pyspark/sql/tests.py | 8 | ||||
-rw-r--r-- | sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala | 2 |
2 files changed, 10 insertions, 0 deletions
diff --git a/python/pyspark/sql/tests.py b/python/pyspark/sql/tests.py index 81f3d1d36a..1b873e9578 100644 --- a/python/pyspark/sql/tests.py +++ b/python/pyspark/sql/tests.py @@ -1555,6 +1555,14 @@ class SQLTests(ReusedPySparkTestCase): self.assertEqual(now, now1) self.assertEqual(now, utcnow1) + # regression test for SPARK-19561 + def test_datetime_at_epoch(self): + epoch = datetime.datetime.fromtimestamp(0) + df = self.spark.createDataFrame([Row(date=epoch)]) + first = df.select('date', lit(epoch).alias('lit_date')).first() + self.assertEqual(first['date'], epoch) + self.assertEqual(first['lit_date'], epoch) + def test_decimal(self): from decimal import Decimal schema = StructType([StructField("decimal", DecimalType(10, 5))]) diff --git a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala index 46fd54e5c7..fcd84705f7 100644 --- a/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala +++ b/sql/core/src/main/scala/org/apache/spark/sql/execution/python/EvaluatePython.scala @@ -112,6 +112,8 @@ object EvaluatePython { case (c: Int, DateType) => c case (c: Long, TimestampType) => c + // Py4J serializes values between MIN_INT and MAX_INT as Ints, not Longs + case (c: Int, TimestampType) => c.toLong case (c, StringType) => UTF8String.fromString(c.toString) |