| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR avoids an exception in the case where `scala.math.BigInt` has a value that does not fit into long value range (e.g. `Long.MAX_VALUE+1`). When we run the following code by using the current Spark, the following exception is thrown.
This PR keeps the value using `BigDecimal` if we detect such an overflow case by catching `ArithmeticException`.
Sample program:
```
case class BigIntWrapper(value:scala.math.BigInt)```
spark.createDataset(BigIntWrapper(scala.math.BigInt("10000000000000000002"))::Nil).show
```
Exception:
```
Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
java.lang.RuntimeException: Error while encoding: java.lang.ArithmeticException: BigInteger out of long range
staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, true])).value, true) AS value#0
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
at org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
at scala.collection.immutable.List.foreach(List.scala:381)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
at scala.collection.immutable.List.map(List.scala:285)
at org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
at org.apache.spark.sql.Agg$$anonfun$18.apply$mcV$sp(MySuite.scala:192)
at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
at org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
at org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
...
Caused by: java.lang.ArithmeticException: BigInteger out of long range
at java.math.BigInteger.longValueExact(BigInteger.java:4531)
at org.apache.spark.sql.types.Decimal.set(Decimal.scala:140)
at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:434)
at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
... 59 more
```
## How was this patch tested?
Add new test suite into `DecimalSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #17684 from kiszk/SPARK-20341.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
instead of returning null values.
## What changes were proposed in this pull request?
If a partitionSpec is supposed to not contain optional values, a ParseException should be thrown, and not nulls returned.
The nulls can later cause NullPointerExceptions in places not expecting them.
## How was this patch tested?
A query like "SHOW PARTITIONS tbl PARTITION(col1='val1', col2)" used to throw a NullPointerException.
Now it throws a ParseException.
Author: Juliusz Sompolski <julek@databricks.com>
Closes #17707 from juliuszsompolski/SPARK-20412.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
It is often useful to be able to track changes to the `ExternalCatalog`. This PR makes the `ExternalCatalog` emit events when a catalog object is changed. Events are fired before and after the change.
The following events are fired per object:
- Database
- CreateDatabasePreEvent: event fired before the database is created.
- CreateDatabaseEvent: event fired after the database has been created.
- DropDatabasePreEvent: event fired before the database is dropped.
- DropDatabaseEvent: event fired after the database has been dropped.
- Table
- CreateTablePreEvent: event fired before the table is created.
- CreateTableEvent: event fired after the table has been created.
- RenameTablePreEvent: event fired before the table is renamed.
- RenameTableEvent: event fired after the table has been renamed.
- DropTablePreEvent: event fired before the table is dropped.
- DropTableEvent: event fired after the table has been dropped.
- Function
- CreateFunctionPreEvent: event fired before the function is created.
- CreateFunctionEvent: event fired after the function has been created.
- RenameFunctionPreEvent: event fired before the function is renamed.
- RenameFunctionEvent: event fired after the function has been renamed.
- DropFunctionPreEvent: event fired before the function is dropped.
- DropFunctionPreEvent: event fired after the function has been dropped.
The current events currently only contain the names of the object modified. We add more events, and more details at a later point.
A user can monitor changes to the external catalog by adding a listener to the Spark listener bus checking for `ExternalCatalogEvent`s using the `SparkListener.onOtherEvent` hook. A more direct approach is add listener directly to the `ExternalCatalog`.
## How was this patch tested?
Added the `ExternalCatalogEventSuite`.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes #17710 from hvanhovell/SPARK-20420.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
A cast expression with a resolved time zone is not equal to a cast expression without a resolved time zone. The `ResolveAggregateFunction` assumed that these expression were the same, and would fail to resolve `HAVING` clauses which contain a `Cast` expression.
This is in essence caused by the fact that a `TimeZoneAwareExpression` can be resolved without a set time zone. This PR fixes this, and makes a `TimeZoneAwareExpression` unresolved as long as it has no TimeZone set.
## How was this patch tested?
Added a regression test to the `SQLQueryTestSuite.having` file.
Author: Herman van Hovell <hvanhovell@databricks.com>
Closes #17641 from hvanhovell/SPARK-20329.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
contain aggregate expression that has mixture of outer and local references.
## What changes were proposed in this pull request?
Address a follow up in [comment](https://github.com/apache/spark/pull/16954#discussion_r105718880)
Currently subqueries with correlated predicates containing aggregate expression having mixture of outer references and local references generate a codegen error like following :
```SQL
SELECT t1a
FROM t1
GROUP BY 1
HAVING EXISTS (SELECT 1
FROM t2
WHERE t2a < min(t1a + t2a));
```
Exception snippet.
```
Cannot evaluate expression: min((input[0, int, false] + input[4, int, false]))
at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.doGenCode(Expression.scala:226)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:87)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:106)
at org.apache.spark.sql.catalyst.expressions.Expression$$anonfun$genCode$2.apply(Expression.scala:103)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:103)
```
After this PR, a better error message is issued.
```
org.apache.spark.sql.AnalysisException
Error in query: Found an aggregate expression in a correlated
predicate that has both outer and local references, which is not supported yet.
Aggregate expression: min((t1.`t1a` + t2.`t2a`)),
Outer references: t1.`t1a`,
Local references: t2.`t2a`.;
```
## How was this patch tested?
Added tests in SQLQueryTestSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #17636 from dilipbiswal/subquery_followup1.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
It's illegal to have aggregate function in GROUP BY, and we should fail at analysis phase, if this happens.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #17704 from cloud-fan/minor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
in Database and Table DDLs
### What changes were proposed in this pull request?
Database and Table names conform the Hive standard ("[a-zA-z_0-9]+"), i.e. if this name only contains characters, numbers, and _.
When calling `toLowerCase` on the names, we should add `Locale.ROOT` to the `toLowerCase`for avoiding inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes #17655 from gatorsmile/locale.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Also went through the same file to ensure other string concatenation are correct.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <shixiong@databricks.com>
Closes #17691 from zsxwing/fix-error-message.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Apply Complementation Laws during boolean expression simplification.
## How was this patch tested?
Tested using unit tests, integration tests, and manual tests.
Author: ptkool <michael.styles@shopify.com>
Author: Michael Styles <michael.styles@shopify.com>
Closes #17650 from ptkool/apply_complementation_laws.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
optimization that can lead to NPE
Avoid necessary execution that can lead to NPE in EliminateOuterJoin and add test in DataFrameSuite to confirm NPE is no longer thrown
## What changes were proposed in this pull request?
Change leftHasNonNullPredicate and rightHasNonNullPredicate to lazy so they are only executed when needed.
## How was this patch tested?
Added test in DataFrameSuite that failed before this fix and now succeeds. Note that a test in catalyst project would be better but i am unsure how to do this.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Koert Kuipers <koert@tresata.com>
Closes #17660 from koertkuipers/feat-catch-npe-in-eliminate-outer-join.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
primitive array
## What changes were proposed in this pull request?
This PR elminates unnecessary data conversion, which is introduced by SPARK-19716, for Dataset with primitve array in the generated Java code.
When we run the following example program, now we get the Java code "Without this PR". In this code, lines 56-82 are unnecessary since the primitive array in ArrayData can be converted into Java primitive array by using ``toDoubleArray()`` method. ``GenericArrayData`` is not required.
```java
val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
ds.count
ds.map(e => e).show
```
Without this PR
```
== Parsed Logical Plan ==
'SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25]
+- 'MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
+- 'DeserializeToObject unresolveddeserializer(unresolvedmapobjects(<function1>, getcolumnbyordinal(0, ArrayType(DoubleType,false)), None).toDoubleArray), obj#23: [D
+- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2]
+- ExternalRDD [obj#1]
== Analyzed Logical Plan ==
value: array<double>
SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25]
+- MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
+- DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D
+- SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2]
+- ExternalRDD [obj#1]
== Optimized Logical Plan ==
SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25]
+- MapElements <function1>, class [D, [StructField(value,ArrayType(DoubleType,false),true)], obj#24: [D
+- DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D
+- InMemoryRelation [value#2], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2]
+- Scan ExternalRDDScan[obj#1]
== Physical Plan ==
*SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#25]
+- *MapElements <function1>, obj#24: [D
+- *DeserializeToObject mapobjects(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, assertnotnull(lambdavariable(MapObjects_loopValue5, MapObjects_loopIsNull5, DoubleType, true), - array element class: "scala.Double", - root class: "scala.Array"), value#2, None, MapObjects_builderValue5).toDoubleArray, obj#23: [D
+- InMemoryTableScan [value#2]
+- InMemoryRelation [value#2], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *SerializeFromObject [staticinvoke(class org.apache.spark.sql.catalyst.expressions.UnsafeArrayData, ArrayType(DoubleType,false), fromPrimitiveArray, input[0, [D, true], true) AS value#2]
+- Scan ExternalRDDScan[obj#1]
```
```java
/* 050 */ protected void processNext() throws java.io.IOException {
/* 051 */ while (inputadapter_input.hasNext() && !stopEarly()) {
/* 052 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 053 */ boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 054 */ ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
/* 055 */
/* 056 */ ArrayData deserializetoobject_value1 = null;
/* 057 */
/* 058 */ if (!inputadapter_isNull) {
/* 059 */ int deserializetoobject_dataLength = inputadapter_value.numElements();
/* 060 */
/* 061 */ Double[] deserializetoobject_convertedArray = null;
/* 062 */ deserializetoobject_convertedArray = new Double[deserializetoobject_dataLength];
/* 063 */
/* 064 */ int deserializetoobject_loopIndex = 0;
/* 065 */ while (deserializetoobject_loopIndex < deserializetoobject_dataLength) {
/* 066 */ MapObjects_loopValue2 = (double) (inputadapter_value.getDouble(deserializetoobject_loopIndex));
/* 067 */ MapObjects_loopIsNull2 = inputadapter_value.isNullAt(deserializetoobject_loopIndex);
/* 068 */
/* 069 */ if (MapObjects_loopIsNull2) {
/* 070 */ throw new RuntimeException(((java.lang.String) references[0]));
/* 071 */ }
/* 072 */ if (false) {
/* 073 */ deserializetoobject_convertedArray[deserializetoobject_loopIndex] = null;
/* 074 */ } else {
/* 075 */ deserializetoobject_convertedArray[deserializetoobject_loopIndex] = MapObjects_loopValue2;
/* 076 */ }
/* 077 */
/* 078 */ deserializetoobject_loopIndex += 1;
/* 079 */ }
/* 080 */
/* 081 */ deserializetoobject_value1 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_convertedArray); /*###*/
/* 082 */ }
/* 083 */ boolean deserializetoobject_isNull = true;
/* 084 */ double[] deserializetoobject_value = null;
/* 085 */ if (!inputadapter_isNull) {
/* 086 */ deserializetoobject_isNull = false;
/* 087 */ if (!deserializetoobject_isNull) {
/* 088 */ Object deserializetoobject_funcResult = null;
/* 089 */ deserializetoobject_funcResult = deserializetoobject_value1.toDoubleArray();
/* 090 */ if (deserializetoobject_funcResult == null) {
/* 091 */ deserializetoobject_isNull = true;
/* 092 */ } else {
/* 093 */ deserializetoobject_value = (double[]) deserializetoobject_funcResult;
/* 094 */ }
/* 095 */
/* 096 */ }
/* 097 */ deserializetoobject_isNull = deserializetoobject_value == null;
/* 098 */ }
/* 099 */
/* 100 */ boolean mapelements_isNull = true;
/* 101 */ double[] mapelements_value = null;
/* 102 */ if (!false) {
/* 103 */ mapelements_resultIsNull = false;
/* 104 */
/* 105 */ if (!mapelements_resultIsNull) {
/* 106 */ mapelements_resultIsNull = deserializetoobject_isNull;
/* 107 */ mapelements_argValue = deserializetoobject_value;
/* 108 */ }
/* 109 */
/* 110 */ mapelements_isNull = mapelements_resultIsNull;
/* 111 */ if (!mapelements_isNull) {
/* 112 */ Object mapelements_funcResult = null;
/* 113 */ mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue);
/* 114 */ if (mapelements_funcResult == null) {
/* 115 */ mapelements_isNull = true;
/* 116 */ } else {
/* 117 */ mapelements_value = (double[]) mapelements_funcResult;
/* 118 */ }
/* 119 */
/* 120 */ }
/* 121 */ mapelements_isNull = mapelements_value == null;
/* 122 */ }
/* 123 */
/* 124 */ serializefromobject_resultIsNull = false;
/* 125 */
/* 126 */ if (!serializefromobject_resultIsNull) {
/* 127 */ serializefromobject_resultIsNull = mapelements_isNull;
/* 128 */ serializefromobject_argValue = mapelements_value;
/* 129 */ }
/* 130 */
/* 131 */ boolean serializefromobject_isNull = serializefromobject_resultIsNull;
/* 132 */ final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue);
/* 133 */ serializefromobject_isNull = serializefromobject_value == null;
/* 134 */ serializefromobject_holder.reset();
/* 135 */
/* 136 */ serializefromobject_rowWriter.zeroOutNullBytes();
/* 137 */
/* 138 */ if (serializefromobject_isNull) {
/* 139 */ serializefromobject_rowWriter.setNullAt(0);
/* 140 */ } else {
/* 141 */ // Remember the current cursor so that we can calculate how many bytes are
/* 142 */ // written later.
/* 143 */ final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
/* 144 */
/* 145 */ if (serializefromobject_value instanceof UnsafeArrayData) {
/* 146 */ final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
/* 147 */ // grow the global buffer before writing data.
/* 148 */ serializefromobject_holder.grow(serializefromobject_sizeInBytes);
/* 149 */ ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
/* 150 */ serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
/* 151 */
/* 152 */ } else {
/* 153 */ final int serializefromobject_numElements = serializefromobject_value.numElements();
/* 154 */ serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
/* 155 */
/* 156 */ for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
/* 157 */ if (serializefromobject_value.isNullAt(serializefromobject_index)) {
/* 158 */ serializefromobject_arrayWriter.setNullDouble(serializefromobject_index);
/* 159 */ } else {
/* 160 */ final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index);
/* 161 */ serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
/* 162 */ }
/* 163 */ }
/* 164 */ }
/* 165 */
/* 166 */ serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
/* 167 */ }
/* 168 */ serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
/* 169 */ append(serializefromobject_result);
/* 170 */ if (shouldStop()) return;
/* 171 */ }
/* 172 */ }
```
With this PR (eliminated lines 56-62 in the above code)
```java
/* 047 */ protected void processNext() throws java.io.IOException {
/* 048 */ while (inputadapter_input.hasNext() && !stopEarly()) {
/* 049 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 050 */ boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 051 */ ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
/* 052 */
/* 053 */ boolean deserializetoobject_isNull = true;
/* 054 */ double[] deserializetoobject_value = null;
/* 055 */ if (!inputadapter_isNull) {
/* 056 */ deserializetoobject_isNull = false;
/* 057 */ if (!deserializetoobject_isNull) {
/* 058 */ Object deserializetoobject_funcResult = null;
/* 059 */ deserializetoobject_funcResult = inputadapter_value.toDoubleArray();
/* 060 */ if (deserializetoobject_funcResult == null) {
/* 061 */ deserializetoobject_isNull = true;
/* 062 */ } else {
/* 063 */ deserializetoobject_value = (double[]) deserializetoobject_funcResult;
/* 064 */ }
/* 065 */
/* 066 */ }
/* 067 */ deserializetoobject_isNull = deserializetoobject_value == null;
/* 068 */ }
/* 069 */
/* 070 */ boolean mapelements_isNull = true;
/* 071 */ double[] mapelements_value = null;
/* 072 */ if (!false) {
/* 073 */ mapelements_resultIsNull = false;
/* 074 */
/* 075 */ if (!mapelements_resultIsNull) {
/* 076 */ mapelements_resultIsNull = deserializetoobject_isNull;
/* 077 */ mapelements_argValue = deserializetoobject_value;
/* 078 */ }
/* 079 */
/* 080 */ mapelements_isNull = mapelements_resultIsNull;
/* 081 */ if (!mapelements_isNull) {
/* 082 */ Object mapelements_funcResult = null;
/* 083 */ mapelements_funcResult = ((scala.Function1) references[0]).apply(mapelements_argValue);
/* 084 */ if (mapelements_funcResult == null) {
/* 085 */ mapelements_isNull = true;
/* 086 */ } else {
/* 087 */ mapelements_value = (double[]) mapelements_funcResult;
/* 088 */ }
/* 089 */
/* 090 */ }
/* 091 */ mapelements_isNull = mapelements_value == null;
/* 092 */ }
/* 093 */
/* 094 */ serializefromobject_resultIsNull = false;
/* 095 */
/* 096 */ if (!serializefromobject_resultIsNull) {
/* 097 */ serializefromobject_resultIsNull = mapelements_isNull;
/* 098 */ serializefromobject_argValue = mapelements_value;
/* 099 */ }
/* 100 */
/* 101 */ boolean serializefromobject_isNull = serializefromobject_resultIsNull;
/* 102 */ final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue);
/* 103 */ serializefromobject_isNull = serializefromobject_value == null;
/* 104 */ serializefromobject_holder.reset();
/* 105 */
/* 106 */ serializefromobject_rowWriter.zeroOutNullBytes();
/* 107 */
/* 108 */ if (serializefromobject_isNull) {
/* 109 */ serializefromobject_rowWriter.setNullAt(0);
/* 110 */ } else {
/* 111 */ // Remember the current cursor so that we can calculate how many bytes are
/* 112 */ // written later.
/* 113 */ final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
/* 114 */
/* 115 */ if (serializefromobject_value instanceof UnsafeArrayData) {
/* 116 */ final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
/* 117 */ // grow the global buffer before writing data.
/* 118 */ serializefromobject_holder.grow(serializefromobject_sizeInBytes);
/* 119 */ ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
/* 120 */ serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
/* 121 */
/* 122 */ } else {
/* 123 */ final int serializefromobject_numElements = serializefromobject_value.numElements();
/* 124 */ serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
/* 125 */
/* 126 */ for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
/* 127 */ if (serializefromobject_value.isNullAt(serializefromobject_index)) {
/* 128 */ serializefromobject_arrayWriter.setNullDouble(serializefromobject_index);
/* 129 */ } else {
/* 130 */ final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index);
/* 131 */ serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
/* 132 */ }
/* 133 */ }
/* 134 */ }
/* 135 */
/* 136 */ serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
/* 137 */ }
/* 138 */ serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
/* 139 */ append(serializefromobject_result);
/* 140 */ if (shouldStop()) return;
/* 141 */ }
/* 142 */ }
```
## How was this patch tested?
Add test suites into `DatasetPrimitiveSuite`
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #17568 from kiszk/SPARK-20254.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
If a plan has multi-level successive joins, e.g.:
```
Join
/ \
Union t5
/ \
Join t4
/ \
Join t3
/ \
t1 t2
```
Currently we fail to reorder the inside joins, i.e. t1, t2, t3.
In join reorder, we use `OrderedJoin` to indicate a join has been ordered, such that when transforming down the plan, these joins don't need to be rerodered again.
But there's a problem in the definition of `OrderedJoin`:
The real join node is a parameter, but not a child. This breaks the transform procedure because `mapChildren` applies transform function on parameters which should be children.
In this patch, we change `OrderedJoin` to a class having the same structure as a join node.
## How was this patch tested?
Add a corresponding test case.
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes #17668 from wzhfy/recursiveReorder.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
fix typo
## How was this patch tested?
manual
Author: Felix Cheung <felixcheung_m@hotmail.com>
Closes #17663 from felixcheung/likedoctypo.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Replace non-existent `repartitionBy` with `distribute` in `CollapseRepartitionSuite`.
## How was this patch tested?
local build and `catalyst/testOnly *CollapseRepartitionSuite`
Author: Jacek Laskowski <jacek@japila.pl>
Closes #17657 from jaceklaskowski/CollapseRepartitionSuite.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This patch fixes a bug in the way LIKE patterns are translated to Java regexes. The bug causes any character following an escaped backslash to be escaped, i.e. there is double-escaping.
A concrete example is the following pattern:`'%\\%'`. The expected Java regex that this pattern should correspond to (according to the behavior described below) is `'.*\\.*'`, however the current situation leads to `'.*\\%'` instead.
---
Update: in light of the discussion that ensued, we should explicitly define the expected behaviour of LIKE expressions, especially in certain edge cases. With the help of gatorsmile, we put together a list of different RDBMS and their variations wrt to certain standard features.
| RDBMS\Features | Wildcards | Default escape [1] | Case sensitivity |
| --- | --- | --- | --- |
| [MS SQL Server](https://msdn.microsoft.com/en-us/library/ms179859.aspx) | _, %, [], [^] | none | no |
| [Oracle](https://docs.oracle.com/cd/B12037_01/server.101/b10759/conditions016.htm) | _, % | none | yes |
| [DB2 z/OS](http://www.ibm.com/support/knowledgecenter/SSEPEK_11.0.0/sqlref/src/tpc/db2z_likepredicate.html) | _, % | none | yes |
| [MySQL](http://dev.mysql.com/doc/refman/5.7/en/string-comparison-functions.html) | _, % | none | no |
| [PostreSQL](https://www.postgresql.org/docs/9.0/static/functions-matching.html) | _, % | \ | yes |
| [Hive](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF) | _, % | none | yes |
| Current Spark | _, % | \ | yes |
[1] Default escape character: most systems do not have a default escape character, instead the user can specify one by calling a like expression with an escape argument [A] LIKE [B] ESCAPE [C]. This syntax is currently not supported by Spark, however I would volunteer to implement this feature in a separate ticket.
The specifications are often quite terse and certain scenarios are undocumented, so here is a list of scenarios that I am uncertain about and would appreciate any input. Specifically I am looking for feedback on whether or not Spark's current behavior should be changed.
1. [x] Ending a pattern with the escape sequence, e.g. `like 'a\'`.
PostreSQL gives an error: 'LIKE pattern must not end with escape character', which I personally find logical. Currently, Spark allows "non-terminated" escapes and simply ignores them as part of the pattern.
According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), ending a pattern in an escape character is invalid.
_Proposed new behaviour in Spark: throw AnalysisException_
2. [x] Empty input, e.g. `'' like ''`
Postgres and DB2 will match empty input only if the pattern is empty as well, any other combination of empty input will not match. Spark currently follows this rule.
3. [x] Escape before a non-special character, e.g. `'a' like '\a'`.
Escaping a non-wildcard character is not really documented but PostgreSQL just treats it verbatim, which I also find the least surprising behavior. Spark does the same.
According to [DB2's documentation](http://www.ibm.com/support/knowledgecenter/SSEPGG_9.7.0/com.ibm.db2.luw.messages.sql.doc/doc/msql00130n.html), it is invalid to follow an escape character with anything other than an escape character, an underscore or a percent sign.
_Proposed new behaviour in Spark: throw AnalysisException_
The current specification is also described in the operator's source code in this patch.
## How was this patch tested?
Extra case in regex unit tests.
Author: Jakob Odersky <jakob@odersky.com>
This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>
Closes #15398 from jodersky/SPARK-17647.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
persistent functions
### What changes were proposed in this pull request?
The session catalog caches some persistent functions in the `FunctionRegistry`, so there can be duplicates. Our Catalog API `listFunctions` does not handle it.
It would be better if `SessionCatalog` API can de-duplciate the records, instead of doing it by each API caller. In `FunctionRegistry`, our functions are identified by the unquoted string. Thus, this PR is try to parse it using our parser interface and then de-duplicate the names.
### How was this patch tested?
Added test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes #17646 from gatorsmile/showFunctions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
In https://github.com/apache/spark/pull/17398 we introduced `UnresolvedMapObjects` as a placeholder of `MapObjects`. Unfortunately `UnresolvedMapObjects` is not serializable as its `function` may reference Scala `Type` which is not serializable.
Ideally this is fine, as we will never serialize and send unresolved expressions to executors. However users may accidentally do this, e.g. mistakenly reference an encoder instance when implementing `Aggregator`, we should fix it so that it's just a performance issue(more network traffic) and should not fail the query.
## How was this patch tested?
N/A
Author: Wenchen Fan <wenchen@databricks.com>
Closes #17639 from cloud-fan/minor.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
estimation
## What changes were proposed in this pull request?
Currently when estimating predicates like col > literal or col = literal, we will update min or max in column stats based on literal value. However, literal value is of Catalyst type (internal type), while min/max is of external type. Then for the next predicate, we again need to do type conversion to compare and update column stats. This is awkward and causes many unnecessary conversions in estimation.
To solve this, we use Catalyst type for min/max in `ColumnStat`. Note that the persistent format in metastore is still of external type, so there's no inconsistency for statistics in metastore.
This pr also fixes a bug for boolean type in `IN` condition.
## How was this patch tested?
The changes for ColumnStat are covered by existing tests.
For bug fix, a new test for boolean type in IN condition is added
Author: wangzhenhua <wangzhenhua@huawei.com>
Closes #17630 from wzhfy/refactorColumnStat.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
join enumeration
## What changes were proposed in this pull request?
Implements star-join filter to reduce the search space for dynamic programming join enumeration. Consider the following join graph:
```
T1 D1 - T2 - T3
\ /
F1
|
D2
star-join: {F1, D1, D2}
non-star: {T1, T2, T3}
```
The following join combinations will be generated:
```
level 0: (F1), (D1), (D2), (T1), (T2), (T3)
level 1: {F1, D1}, {F1, D2}, {T2, T3}
level 2: {F1, D1, D2}
level 3: {F1, D1, D2, T1}, {F1, D1, D2, T2}
level 4: {F1, D1, D2, T1, T2}, {F1, D1, D2, T2, T3 }
level 6: {F1, D1, D2, T1, T2, T3}
```
## How was this patch tested?
New test suite ```StarJOinCostBasedReorderSuite.scala```.
Author: Ioana Delaney <ioanamdelaney@gmail.com>
Closes #17546 from ioana-delaney/starSchemaCBOv3.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
representation
## What changes were proposed in this pull request?
AssertNotNull's toString/simpleString dumps the entire walkedTypePath. walkedTypePath is used for error message reporting and shouldn't be part of the output.
## How was this patch tested?
Manually tested.
Author: Reynold Xin <rxin@databricks.com>
Closes #17616 from rxin/SPARK-20304.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
### What changes were proposed in this pull request?
Session catalog API `createTempFunction` is being used by Hive build-in functions, persistent functions, and temporary functions. Thus, the name is confusing. This PR is to rename it by `registerFunction`. Also we can move construction of `FunctionBuilder` and `ExpressionInfo` into the new `registerFunction`, instead of duplicating the logics everywhere.
In the next PRs, the remaining Function-related APIs also need cleanups.
### How was this patch tested?
Existing test cases.
Author: Xiao Li <gatorsmile@gmail.com>
Closes #17615 from gatorsmile/cleanupCreateTempFunction.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
There are several problems with it:
- It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
- > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
(see joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
## How was this patch tested?
Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
This was tested via manually adding `time.time()` as below:
```diff
profiles_and_goals = build_profiles + sbt_goals
print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
" ".join(profiles_and_goals))
+ import time
+ st = time.time()
exec_sbt(profiles_and_goals)
+ print("Elapsed :[%s]" % str(time.time() - st))
```
produces
```
...
========================================================================
Building Unidoc API Documentation
========================================================================
...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #17477 from HyukjinKwon/SPARK-18692.
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Update count distinct error message for streaming datasets/dataframes to match current behavior. These aggregations are not yet supported, regardless of whether the dataset/dataframe is aggregated.
Author: jtoka <jason.tokayer@gmail.com>
Closes #17609 from jtoka/master.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
structurally the same
## What changes were proposed in this pull request?
When we perform a cast expression and the from and to types are structurally the same (having the same structure but different field names), we should be able to skip the actual cast.
## How was this patch tested?
Added unit tests for the newly introduced functions.
Author: Reynold Xin <rxin@databricks.com>
Closes #17614 from rxin/SPARK-20302.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
does not work.
## What changes were proposed in this pull request?
The sameResult() method does not work when the logical plan contains subquery expressions.
**Before the fix**
```SQL
scala> val ds = spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)")
ds: org.apache.spark.sql.DataFrame = [c1: int]
scala> ds.cache
res13: ds.type = [c1: int]
scala> spark.sql("select * from s1 where s1.c1 in (select s2.c1 from s2 where s1.c1 = s2.c1)").explain(true)
== Analyzed Logical Plan ==
c1: int
Project [c1#86]
+- Filter c1#86 IN (list#78 [c1#86])
: +- Project [c1#87]
: +- Filter (outer(c1#86) = c1#87)
: +- SubqueryAlias s2
: +- Relation[c1#87] parquet
+- SubqueryAlias s1
+- Relation[c1#86] parquet
== Optimized Logical Plan ==
Join LeftSemi, ((c1#86 = c1#87) && (c1#86 = c1#87))
:- Relation[c1#86] parquet
+- Relation[c1#87] parquet
```
**Plan after fix**
```SQL
== Analyzed Logical Plan ==
c1: int
Project [c1#22]
+- Filter c1#22 IN (list#14 [c1#22])
: +- Project [c1#23]
: +- Filter (outer(c1#22) = c1#23)
: +- SubqueryAlias s2
: +- Relation[c1#23] parquet
+- SubqueryAlias s1
+- Relation[c1#22] parquet
== Optimized Logical Plan ==
InMemoryRelation [c1#22], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
+- *BroadcastHashJoin [c1#1, c1#1], [c1#2, c1#2], LeftSemi, BuildRight
:- *FileScan parquet default.s1[c1#1] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s1], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int>
+- BroadcastExchange HashedRelationBroadcastMode(List((shiftleft(cast(input[0, int, true] as bigint), 32) | (cast(input[0, int, true] as bigint) & 4294967295))))
+- *FileScan parquet default.s2[c1#2] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/dbiswal/mygit/apache/spark/bin/spark-warehouse/s2], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<c1:int>
```
## How was this patch tested?
New tests are added to CachedTableSuite.
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #17330 from dilipbiswal/subquery_cache_final.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
NaNvl(DoubleType, DoubleType)
## What changes were proposed in this pull request?
`NaNvl(float value, null)` will be converted into `NaNvl(float value, Cast(null, DoubleType))` and finally `NaNvl(Cast(float value, DoubleType), Cast(null, DoubleType))`.
This will cause mismatching in the output type when the input type is float.
By adding extra rule in TypeCoercion can resolve this issue.
## How was this patch tested?
unite tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: DB Tsai <dbt@netflix.com>
Closes #17606 from dbtsai/fixNaNvl.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Dataset typed API currently uses NewInstance to box primitive types (i.e. calling the constructor). Instead, it'd be slightly more idiomatic in Java to use PrimitiveType.valueOf, which can be invoked using StaticInvoke expression.
## How was this patch tested?
The change should be covered by existing tests for Dataset encoders.
Author: Reynold Xin <rxin@databricks.com>
Closes #17604 from rxin/SPARK-20289.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Similar to `ListQuery`, `Exists` should not be evaluated in `Join` operator too.
## How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Liang-Chi Hsieh <viirya@gmail.com>
Closes #17491 from viirya/dont-push-exists-to-join.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This is a regression caused by SPARK-19716.
Before SPARK-19716, we will cast an array field to the expected array type. However, after SPARK-19716, the cast is removed, but we forgot to push the cast to the element level.
## How was this patch tested?
new regression tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes #17587 from cloud-fan/array.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
locale bug" causes Spark problems
## What changes were proposed in this pull request?
Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
## How was this patch tested?
Existing tests.
Author: Sean Owen <sowen@cloudera.com>
Closes #17527 from srowen/SPARK-20156.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Conditions
## What changes were proposed in this pull request?
```
sql("SELECT t1.b, rand(0) as r FROM cachedData, cachedData t1 GROUP BY t1.b having r > 0.5").show()
```
We will get the following error:
```
Job aborted due to stage failure: Task 1 in stage 4.0 failed 1 times, most recent failure: Lost task 1.0 in stage 4.0 (TID 8, localhost, executor driver): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificPredicate.eval(Unknown Source)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at org.apache.spark.sql.execution.joins.BroadcastNestedLoopJoinExec$$anonfun$org$apache$spark$sql$execution$joins$BroadcastNestedLoopJoinExec$$boundCondition$1.apply(BroadcastNestedLoopJoinExec.scala:87)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:463)
```
Filters could be pushed down to the join conditions by the optimizer rule `PushPredicateThroughJoin`. However, Analyzer [blocks users to add non-deterministics conditions](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L386-L395) (For details, see the PR https://github.com/apache/spark/pull/7535).
We should not push down non-deterministic conditions; otherwise, we need to explicitly initialize the non-deterministic expressions. This PR is to simply block it.
### How was this patch tested?
Added a test case
Author: Xiao Li <gatorsmile@gmail.com>
Closes #17585 from gatorsmile/joinRandCondition.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This PR proposes to add `IGNORE NULLS` keyword in `first`/`last` in Spark's parser likewise http://docs.oracle.com/cd/B19306_01/server.102/b14200/functions057.htm. This simply maps the keywords to existing `ignoreNullsExpr`.
**Before**
```scala
scala> sql("select first('a' IGNORE NULLS)").show()
```
```
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'NULLS' expecting {')', ','}(line 1, pos 24)
== SQL ==
select first('a' IGNORE NULLS)
------------------------^^^
at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:210)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:112)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:46)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:66)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:622)
... 48 elided
```
**After**
```scala
scala> sql("select first('a' IGNORE NULLS)").show()
```
```
+--------------+
|first(a, true)|
+--------------+
| a|
+--------------+
```
## How was this patch tested?
Unit tests in `ExpressionParserSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #17566 from HyukjinKwon/SPARK-19518.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Like `Expression`, `QueryPlan` should also have a `semanticHash` method, then we can put plans to a hash map and look it up fast. This PR refactors `QueryPlan` to follow `Expression` and put all the normalization logic in `QueryPlan.canonicalized`, so that it's very natural to implement `semanticHash`.
follow-up: improve `CacheManager` to leverage this `semanticHash` and speed up plan lookup, instead of iterating all cached plans.
## How was this patch tested?
existing tests. Note that we don't need to test the `semanticHash` method, once the existing tests prove `sameResult` is correct, we are good.
Author: Wenchen Fan <wenchen@databricks.com>
Closes #17541 from cloud-fan/plan-semantic.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Spark runtime routines in generated Java code
## What changes were proposed in this pull request?
This PR elminates unnecessary nullchecks of a return value from known Spark runtime routines. We know whether a given Spark runtime routine returns ``null`` or not (e.g. ``ArrayData.toDoubleArray()`` never returns ``null``). Thus, we can eliminate a null check for the return value from the Spark runtime routine.
When we run the following example program, now we get the Java code "Without this PR". In this code, since we know ``ArrayData.toDoubleArray()`` never returns ``null```, we can eliminate null checks at lines 90-92, and 97.
```java
val ds = sparkContext.parallelize(Seq(Array(1.1, 2.2)), 1).toDS.cache
ds.count
ds.map(e => e).show
```
Without this PR
```java
/* 050 */ protected void processNext() throws java.io.IOException {
/* 051 */ while (inputadapter_input.hasNext() && !stopEarly()) {
/* 052 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 053 */ boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 054 */ ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
/* 055 */
/* 056 */ ArrayData deserializetoobject_value1 = null;
/* 057 */
/* 058 */ if (!inputadapter_isNull) {
/* 059 */ int deserializetoobject_dataLength = inputadapter_value.numElements();
/* 060 */
/* 061 */ Double[] deserializetoobject_convertedArray = null;
/* 062 */ deserializetoobject_convertedArray = new Double[deserializetoobject_dataLength];
/* 063 */
/* 064 */ int deserializetoobject_loopIndex = 0;
/* 065 */ while (deserializetoobject_loopIndex < deserializetoobject_dataLength) {
/* 066 */ MapObjects_loopValue2 = (double) (inputadapter_value.getDouble(deserializetoobject_loopIndex));
/* 067 */ MapObjects_loopIsNull2 = inputadapter_value.isNullAt(deserializetoobject_loopIndex);
/* 068 */
/* 069 */ if (MapObjects_loopIsNull2) {
/* 070 */ throw new RuntimeException(((java.lang.String) references[0]));
/* 071 */ }
/* 072 */ if (false) {
/* 073 */ deserializetoobject_convertedArray[deserializetoobject_loopIndex] = null;
/* 074 */ } else {
/* 075 */ deserializetoobject_convertedArray[deserializetoobject_loopIndex] = MapObjects_loopValue2;
/* 076 */ }
/* 077 */
/* 078 */ deserializetoobject_loopIndex += 1;
/* 079 */ }
/* 080 */
/* 081 */ deserializetoobject_value1 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_convertedArray); /*###*/
/* 082 */ }
/* 083 */ boolean deserializetoobject_isNull = true;
/* 084 */ double[] deserializetoobject_value = null;
/* 085 */ if (!inputadapter_isNull) {
/* 086 */ deserializetoobject_isNull = false;
/* 087 */ if (!deserializetoobject_isNull) {
/* 088 */ Object deserializetoobject_funcResult = null;
/* 089 */ deserializetoobject_funcResult = deserializetoobject_value1.toDoubleArray();
/* 090 */ if (deserializetoobject_funcResult == null) {
/* 091 */ deserializetoobject_isNull = true;
/* 092 */ } else {
/* 093 */ deserializetoobject_value = (double[]) deserializetoobject_funcResult;
/* 094 */ }
/* 095 */
/* 096 */ }
/* 097 */ deserializetoobject_isNull = deserializetoobject_value == null;
/* 098 */ }
/* 099 */
/* 100 */ boolean mapelements_isNull = true;
/* 101 */ double[] mapelements_value = null;
/* 102 */ if (!false) {
/* 103 */ mapelements_resultIsNull = false;
/* 104 */
/* 105 */ if (!mapelements_resultIsNull) {
/* 106 */ mapelements_resultIsNull = deserializetoobject_isNull;
/* 107 */ mapelements_argValue = deserializetoobject_value;
/* 108 */ }
/* 109 */
/* 110 */ mapelements_isNull = mapelements_resultIsNull;
/* 111 */ if (!mapelements_isNull) {
/* 112 */ Object mapelements_funcResult = null;
/* 113 */ mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue);
/* 114 */ if (mapelements_funcResult == null) {
/* 115 */ mapelements_isNull = true;
/* 116 */ } else {
/* 117 */ mapelements_value = (double[]) mapelements_funcResult;
/* 118 */ }
/* 119 */
/* 120 */ }
/* 121 */ mapelements_isNull = mapelements_value == null;
/* 122 */ }
/* 123 */
/* 124 */ serializefromobject_resultIsNull = false;
/* 125 */
/* 126 */ if (!serializefromobject_resultIsNull) {
/* 127 */ serializefromobject_resultIsNull = mapelements_isNull;
/* 128 */ serializefromobject_argValue = mapelements_value;
/* 129 */ }
/* 130 */
/* 131 */ boolean serializefromobject_isNull = serializefromobject_resultIsNull;
/* 132 */ final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue);
/* 133 */ serializefromobject_isNull = serializefromobject_value == null;
/* 134 */ serializefromobject_holder.reset();
/* 135 */
/* 136 */ serializefromobject_rowWriter.zeroOutNullBytes();
/* 137 */
/* 138 */ if (serializefromobject_isNull) {
/* 139 */ serializefromobject_rowWriter.setNullAt(0);
/* 140 */ } else {
/* 141 */ // Remember the current cursor so that we can calculate how many bytes are
/* 142 */ // written later.
/* 143 */ final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
/* 144 */
/* 145 */ if (serializefromobject_value instanceof UnsafeArrayData) {
/* 146 */ final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
/* 147 */ // grow the global buffer before writing data.
/* 148 */ serializefromobject_holder.grow(serializefromobject_sizeInBytes);
/* 149 */ ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
/* 150 */ serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
/* 151 */
/* 152 */ } else {
/* 153 */ final int serializefromobject_numElements = serializefromobject_value.numElements();
/* 154 */ serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
/* 155 */
/* 156 */ for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
/* 157 */ if (serializefromobject_value.isNullAt(serializefromobject_index)) {
/* 158 */ serializefromobject_arrayWriter.setNullDouble(serializefromobject_index);
/* 159 */ } else {
/* 160 */ final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index);
/* 161 */ serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
/* 162 */ }
/* 163 */ }
/* 164 */ }
/* 165 */
/* 166 */ serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
/* 167 */ }
/* 168 */ serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
/* 169 */ append(serializefromobject_result);
/* 170 */ if (shouldStop()) return;
/* 171 */ }
/* 172 */ }
```
With this PR (removed most of lines 90-97 in the above code)
```java
/* 050 */ protected void processNext() throws java.io.IOException {
/* 051 */ while (inputadapter_input.hasNext() && !stopEarly()) {
/* 052 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
/* 053 */ boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
/* 054 */ ArrayData inputadapter_value = inputadapter_isNull ? null : (inputadapter_row.getArray(0));
/* 055 */
/* 056 */ ArrayData deserializetoobject_value1 = null;
/* 057 */
/* 058 */ if (!inputadapter_isNull) {
/* 059 */ int deserializetoobject_dataLength = inputadapter_value.numElements();
/* 060 */
/* 061 */ Double[] deserializetoobject_convertedArray = null;
/* 062 */ deserializetoobject_convertedArray = new Double[deserializetoobject_dataLength];
/* 063 */
/* 064 */ int deserializetoobject_loopIndex = 0;
/* 065 */ while (deserializetoobject_loopIndex < deserializetoobject_dataLength) {
/* 066 */ MapObjects_loopValue2 = (double) (inputadapter_value.getDouble(deserializetoobject_loopIndex));
/* 067 */ MapObjects_loopIsNull2 = inputadapter_value.isNullAt(deserializetoobject_loopIndex);
/* 068 */
/* 069 */ if (MapObjects_loopIsNull2) {
/* 070 */ throw new RuntimeException(((java.lang.String) references[0]));
/* 071 */ }
/* 072 */ if (false) {
/* 073 */ deserializetoobject_convertedArray[deserializetoobject_loopIndex] = null;
/* 074 */ } else {
/* 075 */ deserializetoobject_convertedArray[deserializetoobject_loopIndex] = MapObjects_loopValue2;
/* 076 */ }
/* 077 */
/* 078 */ deserializetoobject_loopIndex += 1;
/* 079 */ }
/* 080 */
/* 081 */ deserializetoobject_value1 = new org.apache.spark.sql.catalyst.util.GenericArrayData(deserializetoobject_convertedArray); /*###*/
/* 082 */ }
/* 083 */ boolean deserializetoobject_isNull = true;
/* 084 */ double[] deserializetoobject_value = null;
/* 085 */ if (!inputadapter_isNull) {
/* 086 */ deserializetoobject_isNull = false;
/* 087 */ if (!deserializetoobject_isNull) {
/* 088 */ Object deserializetoobject_funcResult = null;
/* 089 */ deserializetoobject_funcResult = deserializetoobject_value1.toDoubleArray();
/* 090 */ deserializetoobject_value = (double[]) deserializetoobject_funcResult;
/* 091 */
/* 092 */ }
/* 093 */
/* 094 */ }
/* 095 */
/* 096 */ boolean mapelements_isNull = true;
/* 097 */ double[] mapelements_value = null;
/* 098 */ if (!false) {
/* 099 */ mapelements_resultIsNull = false;
/* 100 */
/* 101 */ if (!mapelements_resultIsNull) {
/* 102 */ mapelements_resultIsNull = deserializetoobject_isNull;
/* 103 */ mapelements_argValue = deserializetoobject_value;
/* 104 */ }
/* 105 */
/* 106 */ mapelements_isNull = mapelements_resultIsNull;
/* 107 */ if (!mapelements_isNull) {
/* 108 */ Object mapelements_funcResult = null;
/* 109 */ mapelements_funcResult = ((scala.Function1) references[1]).apply(mapelements_argValue);
/* 110 */ if (mapelements_funcResult == null) {
/* 111 */ mapelements_isNull = true;
/* 112 */ } else {
/* 113 */ mapelements_value = (double[]) mapelements_funcResult;
/* 114 */ }
/* 115 */
/* 116 */ }
/* 117 */ mapelements_isNull = mapelements_value == null;
/* 118 */ }
/* 119 */
/* 120 */ serializefromobject_resultIsNull = false;
/* 121 */
/* 122 */ if (!serializefromobject_resultIsNull) {
/* 123 */ serializefromobject_resultIsNull = mapelements_isNull;
/* 124 */ serializefromobject_argValue = mapelements_value;
/* 125 */ }
/* 126 */
/* 127 */ boolean serializefromobject_isNull = serializefromobject_resultIsNull;
/* 128 */ final ArrayData serializefromobject_value = serializefromobject_resultIsNull ? null : org.apache.spark.sql.catalyst.expressions.UnsafeArrayData.fromPrimitiveArray(serializefromobject_argValue);
/* 129 */ serializefromobject_isNull = serializefromobject_value == null;
/* 130 */ serializefromobject_holder.reset();
/* 131 */
/* 132 */ serializefromobject_rowWriter.zeroOutNullBytes();
/* 133 */
/* 134 */ if (serializefromobject_isNull) {
/* 135 */ serializefromobject_rowWriter.setNullAt(0);
/* 136 */ } else {
/* 137 */ // Remember the current cursor so that we can calculate how many bytes are
/* 138 */ // written later.
/* 139 */ final int serializefromobject_tmpCursor = serializefromobject_holder.cursor;
/* 140 */
/* 141 */ if (serializefromobject_value instanceof UnsafeArrayData) {
/* 142 */ final int serializefromobject_sizeInBytes = ((UnsafeArrayData) serializefromobject_value).getSizeInBytes();
/* 143 */ // grow the global buffer before writing data.
/* 144 */ serializefromobject_holder.grow(serializefromobject_sizeInBytes);
/* 145 */ ((UnsafeArrayData) serializefromobject_value).writeToMemory(serializefromobject_holder.buffer, serializefromobject_holder.cursor);
/* 146 */ serializefromobject_holder.cursor += serializefromobject_sizeInBytes;
/* 147 */
/* 148 */ } else {
/* 149 */ final int serializefromobject_numElements = serializefromobject_value.numElements();
/* 150 */ serializefromobject_arrayWriter.initialize(serializefromobject_holder, serializefromobject_numElements, 8);
/* 151 */
/* 152 */ for (int serializefromobject_index = 0; serializefromobject_index < serializefromobject_numElements; serializefromobject_index++) {
/* 153 */ if (serializefromobject_value.isNullAt(serializefromobject_index)) {
/* 154 */ serializefromobject_arrayWriter.setNullDouble(serializefromobject_index);
/* 155 */ } else {
/* 156 */ final double serializefromobject_element = serializefromobject_value.getDouble(serializefromobject_index);
/* 157 */ serializefromobject_arrayWriter.write(serializefromobject_index, serializefromobject_element);
/* 158 */ }
/* 159 */ }
/* 160 */ }
/* 161 */
/* 162 */ serializefromobject_rowWriter.setOffsetAndSize(0, serializefromobject_tmpCursor, serializefromobject_holder.cursor - serializefromobject_tmpCursor);
/* 163 */ }
/* 164 */ serializefromobject_result.setTotalSize(serializefromobject_holder.totalSize());
/* 165 */ append(serializefromobject_result);
/* 166 */ if (shouldStop()) return;
/* 167 */ }
/* 168 */ }
```
## How was this patch tested?
Add test suites to ``DatasetPrimitiveSuite``
Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Closes #17569 from kiszk/SPARK-20253.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
AssertNotNull currently throws RuntimeException. It should throw NullPointerException, which is more specific.
## How was this patch tested?
N/A
Author: Reynold Xin <rxin@databricks.com>
Closes #17573 from rxin/SPARK-20262.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
non-deterministic expressions
## What changes were proposed in this pull request?
Similar to `Project`, when `Aggregate` has non-deterministic expressions, we should not push predicate down through it, as it will change the number of input rows and thus change the evaluation result of non-deterministic expressions in `Aggregate`.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #17562 from cloud-fan/filter.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Currently `LogicalRelation` has a `expectedOutputAttributes` parameter, which makes it hard to reason about what the actual output is. Like other leaf nodes, `LogicalRelation` should also take `output` as a parameter, to simplify the logic
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes #17552 from cloud-fan/minor.
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This is a tiny addendum to SPARK-19495 to remove the private visibility for copy, which is the only package private method in the entire file.
## How was this patch tested?
N/A - no semantic change.
Author: Reynold Xin <rxin@databricks.com>
Closes #17555 from rxin/SPARK-19495-2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
detection in CBO
## What changes were proposed in this pull request?
This commit moves star schema code from ```join.scala``` to ```StarSchemaDetection.scala```. It also applies some minor fixes in ```StarJoinReorderSuite.scala```.
## How was this patch tested?
Run existing ```StarJoinReorderSuite.scala```.
Author: Ioana Delaney <ioanamdelaney@gmail.com>
Closes #17544 from ioana-delaney/starSchemaCBOv2.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
timezone settings
## What changes were proposed in this pull request?
Make sure SESSION_LOCAL_TIMEZONE reflects the change in JVM's default timezone setting. Currently several timezone related tests fail as the change to default timezone is not picked up by SQLConf.
## How was this patch tested?
Added an unit test in ConfigEntrySuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #17537 from dilipbiswal/timezone_debug.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Previously when we construct deserializer expression for array type, we will first cast the corresponding field to expected array type and then apply `MapObjects`.
However, by doing that, we lose the opportunity to do by-name resolution for struct type inside array type. In this PR, I introduce a `UnresolvedMapObjects` to hold the lambda function and the input array expression. Then during analysis, after the input array expression is resolved, we get the actual array element type and apply by-name resolution. Then we don't need to add `Cast` for array type when constructing the deserializer expression, as the element type is determined later at analyzer.
## How was this patch tested?
new regression test
Author: Wenchen Fan <wenchen@databricks.com>
Closes #17398 from cloud-fan/dataset.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/17285 .
## How was this patch tested?
existing tests
Author: Wenchen Fan <wenchen@databricks.com>
Closes #17521 from cloud-fan/conf.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
conventions in SparkSession.Catalog APIs
### What changes were proposed in this pull request?
Observed by felixcheung , in `SparkSession`.`Catalog` APIs, we have different conventions/rules for table/function identifiers/names. Most APIs accept the qualified name (i.e., `databaseName`.`tableName` or `databaseName`.`functionName`). However, the following five APIs do not accept it.
- def listColumns(tableName: String): Dataset[Column]
- def getTable(tableName: String): Table
- def getFunction(functionName: String): Function
- def tableExists(tableName: String): Boolean
- def functionExists(functionName: String): Boolean
To make them consistent with the other Catalog APIs, this PR does the changes, updates the function/API comments and adds the `params` to clarify the inputs we allow.
### How was this patch tested?
Added the test cases .
Author: Xiao Li <gatorsmile@gmail.com>
Closes #17518 from gatorsmile/tableIdentifier.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
### What changes were proposed in this pull request?
This PR is to unify and clean up the outputs of `DESC EXTENDED/FORMATTED` and `SHOW TABLE EXTENDED` by moving the logics into the Catalog interface. The output formats are improved. We also add the missing attributes. It impacts the DDL commands like `SHOW TABLE EXTENDED`, `DESC EXTENDED` and `DESC FORMATTED`.
In addition, by following what we did in Dataset API `printSchema`, we can use `treeString` to show the schema in the more readable way.
Below is the current way:
```
Schema: STRUCT<`a`: STRING (nullable = true), `b`: INT (nullable = true), `c`: STRING (nullable = true), `d`: STRING (nullable = true)>
```
After the change, it should look like
```
Schema: root
|-- a: string (nullable = true)
|-- b: integer (nullable = true)
|-- c: string (nullable = true)
|-- d: string (nullable = true)
```
### How was this patch tested?
`describe.sql` and `show-tables.sql`
Author: Xiao Li <gatorsmile@gmail.com>
Closes #17394 from gatorsmile/descFollowUp.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
**Description** from JIRA
The TimestampType in Spark SQL is of microsecond precision. Ideally, we should convert Spark SQL timestamp values into Parquet TIMESTAMP_MICROS. But unfortunately parquet-mr hasn't supported it yet.
For the read path, we should be able to read TIMESTAMP_MILLIS Parquet values and pad a 0 microsecond part to read values.
For the write path, currently we are writing timestamps as INT96, similar to Impala and Hive. One alternative is that, we can have a separate SQL option to let users be able to write Spark SQL timestamp values as TIMESTAMP_MILLIS. Of course, in this way the microsecond part will be truncated.
## How was this patch tested?
Added new tests in ParquetQuerySuite and ParquetIOSuite
Author: Dilip Biswal <dbiswal@us.ibm.com>
Closes #15332 from dilipbiswal/parquet-time-millis.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.
This PR estimates filter selectivity on two columns of same table. For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"
## How was this patch tested?
We added 6 new test cases to test various logical predicates involving two columns of same table.
Please review http://spark.apache.org/contributing.html before opening a pull request.
Author: Ron Hu <ron.hu@huawei.com>
Author: U-CHINA\r00754707 <r00754707@R00754707-SC04.china.huawei.com>
Closes #17415 from ron8hu/filterTwoColumns.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
Range in SQL should be case insensitive
## How was this patch tested?
unit test
Author: samelamin <hussam.elamin@gmail.com>
Author: samelamin <sam_elamin@discovery.com>
Closes #17487 from samelamin/SPARK-20145.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
## What changes were proposed in this pull request?
This patch implements `listPartitionsByFilter()` for `InMemoryCatalog` and thus resolves an outstanding TODO causing the `PruneFileSourcePartitions` optimizer rule not to apply when "spark.sql.catalogImplementation" is set to "in-memory" (which is the default).
The change is straightforward: it extracts the code for further filtering of the list of partitions returned by the metastore's `getPartitionsByFilter()` out from `HiveExternalCatalog` into `ExternalCatalogUtils` and calls this new function from `InMemoryCatalog` on the whole list of partitions.
Now that this method is implemented we can always pass the `CatalogTable` to the `DataSource` in `FindDataSourceTable`, so that the latter is resolved to a relation with a `CatalogFileIndex`, which is what the `PruneFileSourcePartitions` rule matches for.
## How was this patch tested?
Ran existing tests and added new test for `listPartitionsByFilter` in `ExternalCatalogSuite`, which is subclassed by both `InMemoryCatalogSuite` and `HiveExternalCatalogSuite`.
Author: Adrian Ionescu <adrian@databricks.com>
Closes #17510 from adrian-ionescu/InMemoryCatalog.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(FastDateFormat specific) in CSV/JSON timeformat options
## What changes were proposed in this pull request?
This PR proposes to use `XXX` format instead of `ZZ`. `ZZ` seems a `FastDateFormat` specific.
`ZZ` supports "ISO 8601 extended format time zones" but it seems `FastDateFormat` specific option.
I misunderstood this is compatible format with `SimpleDateFormat` when this change is introduced.
Please see [SimpleDateFormat documentation]( https://docs.oracle.com/javase/7/docs/api/java/text/SimpleDateFormat.html#iso8601timezone) and [FastDateFormat documentation](https://commons.apache.org/proper/commons-lang/apidocs/org/apache/commons/lang3/time/FastDateFormat.html).
It seems we better replace `ZZ` to `XXX` because they look using the same strategy - [FastDateParser.java#L930](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L930), [FastDateParser.java#L932-L951 ](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L932-L951) and [FastDateParser.java#L596-L601](https://github.com/apache/commons-lang/blob/8767cd4f1a6af07093c1e6c422dae8e574be7e5e/src/main/java/org/apache/commons/lang3/time/FastDateParser.java#L596-L601).
I also checked the codes and manually debugged it for sure. It seems both cases use the same pattern `( Z|(?:[+-]\\d{2}(?::)\\d{2}))`.
_Note that this should be rather a fix about documentation and not the behaviour change because `ZZ` seems invalid date format in `SimpleDateFormat` as documented in `DataFrameReader` and etc, and both `ZZ` and `XXX` look identically working with `FastDateFormat`_
Current documentation is as below:
```
* <li>`timestampFormat` (default `yyyy-MM-dd'T'HH:mm:ss.SSSZZ`): sets the string that
* indicates a timestamp format. Custom date formats follow the formats at
* `java.text.SimpleDateFormat`. This applies to timestamp type.</li>
```
## How was this patch tested?
Existing tests should cover this. Also, manually tested as below (BTW, I don't think these are worth being added as tests within Spark):
**Parse**
```scala
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res4: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res10: java.util.Date = Tue Mar 21 09:00:00 KST 2017
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000-11:00"
at java.text.DateFormat.parse(DateFormat.java:366)
... 48 elided
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
java.text.ParseException: Unparseable date: "2017-03-21T00:00:00.000Z"
at java.text.DateFormat.parse(DateFormat.java:366)
... 48 elided
```
```scala
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00")
res7: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000Z")
res1: java.util.Date = Tue Mar 21 09:00:00 KST 2017
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000-11:00")
res8: java.util.Date = Tue Mar 21 20:00:00 KST 2017
scala> org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ").parse("2017-03-21T00:00:00.000Z")
res2: java.util.Date = Tue Mar 21 09:00:00 KST 2017
```
**Format**
```scala
scala> new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").format(new java.text.SimpleDateFormat("yyyy-MM-dd'T'HH:mm:ss.SSSXXX").parse("2017-03-21T00:00:00.000-11:00"))
res6: String = 2017-03-21T20:00:00.000+09:00
```
```scala
scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSZZ")
fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSZZ,ko_KR,Asia/Seoul]
scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
res1: String = 2017-03-21T20:00:00.000+09:00
scala> val fd = org.apache.commons.lang3.time.FastDateFormat.getInstance("yyyy-MM-dd'T'HH:mm:ss.SSSXXX")
fd: org.apache.commons.lang3.time.FastDateFormat = FastDateFormat[yyyy-MM-dd'T'HH:mm:ss.SSSXXX,ko_KR,Asia/Seoul]
scala> fd.format(fd.parse("2017-03-21T00:00:00.000-11:00"))
res2: String = 2017-03-21T20:00:00.000+09:00
```
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #17489 from HyukjinKwon/SPARK-20166.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
message
## What changes were proposed in this pull request?
Currently, `DataType.fromJson` throws `scala.MatchError` or `java.util.NoSuchElementException` in some cases when the JSON input is invalid as below:
```scala
DataType.fromJson(""""abcd"""")
```
```
java.util.NoSuchElementException: key not found: abcd
at ...
```
```scala
DataType.fromJson("""{"abcd":"a"}""")
```
```
scala.MatchError: JObject(List((abcd,JString(a)))) (of class org.json4s.JsonAST$JObject)
at ...
```
```scala
DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
```
```
scala.MatchError: JObject(List((a,JInt(123)))) (of class org.json4s.JsonAST$JObject)
at ...
```
After this PR,
```scala
DataType.fromJson(""""abcd"""")
```
```
java.lang.IllegalArgumentException: Failed to convert the JSON string 'abcd' to a data type.
at ...
```
```scala
DataType.fromJson("""{"abcd":"a"}""")
```
```
java.lang.IllegalArgumentException: Failed to convert the JSON string '{"abcd":"a"}' to a data type.
at ...
```
```scala
DataType.fromJson("""{"fields": [{"a":123}], "type": "struct"}""")
at ...
```
```
java.lang.IllegalArgumentException: Failed to convert the JSON string '{"a":123}' to a field.
```
## How was this patch tested?
Unit test added in `DataTypeSuite`.
Author: hyukjinkwon <gurwls223@gmail.com>
Closes #17468 from HyukjinKwon/fromjson_exception.
|