| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change does two things:
- tag a few tests and adds the mechanism in the build to be able to disable those tags,
both in maven and sbt, for both junit and scalatest suites.
- add some logic to run-tests.py to disable some tags depending on what files have
changed; that's used to disable expensive tests when a module hasn't explicitly
been changed, to speed up testing for changes that don't directly affect those
modules.
Author: Marcelo Vanzin <vanzin@cloudera.com>
Closes #8437 from vanzin/test-tags.
|
|
|
|
|
|
| |
Author: Reynold Xin <rxin@databricks.com>
Closes #8350 from rxin/1.6.
|
|
|
|
|
|
|
|
| |
Move .java files in `src/main/scala` to `src/main/java` root, except for `package-info.java` (to stay next to package.scala)
Author: Sean Owen <sowen@cloudera.com>
Closes #8736 from srowen/SPARK-10576.
|
|
|
|
|
|
|
|
| |
This PR is in conflict with #8535 and #8573. Will update this one when they are merged.
Author: zsxwing <zsxwing@gmail.com>
Closes #8642 from zsxwing/expand-nest-join.
|
|
|
|
|
|
|
|
|
|
| |
Alternative to PR #6122; in this case the refactored out classes are replaced by inner classes with the same name for backwards binary compatibility
* process in a lighter-weight, backwards-compatible way
Author: Edoardo Vacchi <uncommonnonsense@gmail.com>
Closes #6356 from evacchi/sqlctx-refactoring-lite.
|
|
|
|
|
|
|
|
|
|
| |
Or Hive can't read it back correctly.
Thanks vanzin for report this.
Author: Davies Liu <davies@databricks.com>
Closes #8674 from davies/positive_nano.
|
|
|
|
|
|
|
|
|
|
|
| |
spark.sql.hive.metastore.version is wrong.
The default value of hive metastore version is 1.2.1 but the documentation says the value of `spark.sql.hive.metastore.version` is 0.13.1.
Also, we cannot get the default value by `sqlContext.getConf("spark.sql.hive.metastore.version")`.
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #8739 from sarutak/SPARK-10584.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
enabled
This is a follow-up of https://github.com/apache/spark/pull/8317.
When speculation is enabled, there may be multiply tasks writing to the same path. Generally it's OK as we will write to a temporary directory first and only one task can commit the temporary directory to target path.
However, when we use direct output committer, tasks will write data to target path directly without temporary directory. This causes problems like corrupted data. Please see [PR comment](https://github.com/apache/spark/pull/8191#issuecomment-131598385) for more details.
Unfortunately, we don't have a simple flag to tell if a output committer will write to temporary directory or not, so for safety, we have to disable any customized output committer when `speculation` is true.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8687 from cloud-fan/direct-committer.
|
|
|
|
|
|
|
|
|
|
| |
JobContext methods
This is a followup to #8499 which adds a Scalastyle rule to mandate the use of SparkHadoopUtil's JobContext accessor methods and fixes the existing violations.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8521 from JoshRosen/SPARK-10330-part2.
|
|
|
|
|
|
|
|
|
|
|
| |
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
Closes #6297 from JihongMA/SPARK-SQL.
|
|
|
|
|
|
|
|
| |
Fix a few Java API test style issues: unused generic types, exceptions, wrong assert argument order
Author: Sean Owen <sowen@cloudera.com>
Closes #8706 from srowen/SPARK-10547.
|
|
|
|
|
|
|
|
|
| |
1. Hide `LocalNodeIterator` behind the `LocalNode#asIterator` method
2. Add tests for this
Author: Andrew Or <andrew@databricks.com>
Closes #8708 from andrewor14/local-hash-join-follow-up.
|
|
|
|
|
|
|
|
|
|
| |
sample and intersect operators
This PR is in conflict with #8535. I will update this one when #8535 gets merged.
Author: zsxwing <zsxwing@gmail.com>
Closes #8573 from zsxwing/more-local-operators.
|
|
|
|
|
|
|
|
|
|
|
|
| |
rule. Incorporate review comments
Adding changes suggested by cloud-fan in #5700
cc marmbrus
Author: Yash Datta <Yash.Datta@guavus.com>
Closes #8716 from saucam/bool_simp.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we cast string to boolean in hive, it returns `true` if the length of string is > 0, and spark SQL follows this behavior.
However, this behavior is very different from other SQL systems:
1. [presto](https://github.com/facebook/presto/blob/master/presto-main/src/main/java/com/facebook/presto/type/VarcharOperators.java#L89-L118) will return `true` for 't' 'true' '1', `false` for 'f' 'false' '0', throw exception for others.
2. [redshift](http://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others.
3. [postgresql](http://www.postgresql.org/docs/devel/static/datatype-boolean.html) will return `true` for 't' 'true' 'y' 'yes' 'on' '1', `false` for 'f' 'false' 'n' 'no' 'off' '0', throw exception for others.
4. [vertica](https://my.vertica.com/docs/5.0/HTML/Master/2983.htm) will return `true` for 't' 'true' 'y' 'yes' '1', `false` for 'f' 'false' 'n' 'no' '0', null for others.
5. [impala](http://www.cloudera.com/content/cloudera/en/documentation/cloudera-impala/latest/topics/impala_boolean.html) throw exception when try to cast string to boolean.
6. mysql, oracle, sqlserver don't have boolean type
Whether we should change the cast behavior according to other SQL system or not is not decided yet, this PR is a test to see if we changed, how many compatibility tests will fail.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8698 from cloud-fan/string2boolean.
|
|
|
|
|
|
|
|
|
|
|
|
| |
it is too flaky
If hadoopFsRelationSuites's "test all data types" is too flaky we can disable it for now.
https://issues.apache.org/jira/browse/SPARK-10540
Author: Yin Huai <yhuai@databricks.com>
Closes #8705 from yhuai/SPARK-10540-ignore.
|
|
|
|
|
|
|
|
| |
Before this fix, `MyDenseVectorUDT.typeName` gives `mydensevecto`, which is not desirable.
Author: Cheng Lian <lian@databricks.com>
Closes #8640 from liancheng/spark-10472/udt-type-name.
|
|
|
|
|
|
|
|
| |
`LeftOutputIterator` and `RightOutputIterator` are symmetrically identical and can share a lot of code. If someone makes a change in one but forgets to do the same thing in the other we'll end up with inconsistent behavior. This patch also adds inline comments to clarify the intention of the code.
Author: Andrew Or <andrew@databricks.com>
Closes #8596 from andrewor14/smoj-cleanup.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
this PR :
1. Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side.
2. Enhance the SerDe to support transferring a Scala seq to R side. Data of ArrayType in DataFrame
after collection is observed to be of Scala Seq type.
3. Support ArrayType in createDataFrame().
Author: Sun Rui <rui.sun@intel.com>
Closes #8458 from sun-rui/SPARK-10049.
|
|
|
|
|
|
|
|
|
|
|
| |
This PR includes the following changes:
- Add SQLConf to LocalNode
- Add HashJoinNode
- Add ConvertToUnsafeNode and ConvertToSafeNode.scala to test unsafe hash join.
Author: zsxwing <zsxwing@gmail.com>
Closes #8535 from zsxwing/SPARK-9990.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Data Spill with UnsafeRow causes assert failure.
```
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:165)
at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75)
at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687)
at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683)
at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
```
To reproduce that with code (thanks andrewor14):
```scala
bin/spark-shell --master local
--conf spark.shuffle.memoryFraction=0.005
--conf spark.shuffle.sort.bypassMergeThreshold=0
sc.parallelize(1 to 2 * 1000 * 1000, 10)
.map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count()
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes #8635 from chenghao-intel/unsafe_spill.
|
|
|
|
|
|
|
|
| |
for master
Author: Cheng Lian <lian@databricks.com>
Closes #8670 from liancheng/spark-10301/address-pr-comments.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use these in the optimizer as well:
A and (not(A) or B) => A and B
not(A and B) => not(A) or not(B)
not(A or B) => not(A) and not(B)
Author: Yash Datta <Yash.Datta@guavus.com>
Closes #5700 from saucam/bool_simp.
|
|
|
|
|
|
|
|
|
|
|
|
| |
The reason for this extra copy is that we iterate the array twice: calculate elements data size and copy elements to array buffer.
A simple solution is to follow `createCodeForStruct`, we can dynamically grow the buffer when needed and thus don't need to know the data size ahead.
This PR also include some typo and style fixes, and did some minor refactor to make sure `input.primitive` is always variable name not code when generate unsafe code.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8496 from cloud-fan/avoid-copy.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR is based on #8383 , thanks to viirya
JIRA: https://issues.apache.org/jira/browse/SPARK-9730
This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner.
Closes #8383
Author: Liang-Chi Hsieh <viirya@appier.com>
Author: Davies Liu <davies@databricks.com>
Closes #8579 from davies/smj_fullouter.
|
|
|
|
|
|
|
|
|
|
|
|
| |
code at `GenerateUnsafeProjection`
When we generate unsafe code inside `createCodeForXXX`, we always assign the `input.primitive` to a temp variable in case `input.primitive` is expression code.
This PR did some refactor to make sure `input.primitive` is always variable name, and some other typo and style fixes.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8613 from cloud-fan/minor.
|
|
|
|
|
|
|
|
|
|
|
| |
The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.
The remainder are some potential bugs, and deprecated syntax.
Author: Luc Bourlier <luc.bourlier@typesafe.com>
Closes #8433 from skyluc/issue/sbt-2.11.
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #8659 from marmbrus/testBuildBreak.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
its project list
```scala
import org.apache.spark.sql.hive.execution.HiveTableScan
sql("select key, value, key + 1 from src").registerTempTable("abc")
cacheTable("abc")
val sparkPlan = sql(
"""select a.key, b.key, c.key from
|abc a join abc b on a.key=b.key
|join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan
assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size === 3) // failed
assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // failed
```
The actual plan is:
```
== Parsed Logical Plan ==
'Project [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)]
'Join Inner, Some(('a.key = 'c.key))
'Join Inner, Some(('a.key = 'b.key))
'UnresolvedRelation [abc], Some(a)
'UnresolvedRelation [abc], Some(b)
'UnresolvedRelation [abc], Some(c)
== Analyzed Logical Plan ==
key: int, key: int, key: int
Project [key#14,key#61,key#66]
Join Inner, Some((key#14 = key#66))
Join Inner, Some((key#14 = key#61))
Subquery a
Subquery abc
Project [key#14,value#15,(key#14 + 1) AS _c2#16]
MetastoreRelation default, src, None
Subquery b
Subquery abc
Project [key#61,value#62,(key#61 + 1) AS _c2#58]
MetastoreRelation default, src, None
Subquery c
Subquery abc
Project [key#66,value#67,(key#66 + 1) AS _c2#63]
MetastoreRelation default, src, None
== Optimized Logical Plan ==
Project [key#14,key#61,key#66]
Join Inner, Some((key#14 = key#66))
Project [key#14,key#61]
Join Inner, Some((key#14 = key#61))
Project [key#14]
InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc)
Project [key#61]
MetastoreRelation default, src, None
Project [key#66]
MetastoreRelation default, src, None
== Physical Plan ==
TungstenProject [key#14,key#61,key#66]
BroadcastHashJoin [key#14], [key#66], BuildRight
TungstenProject [key#14,key#61]
BroadcastHashJoin [key#14], [key#61], BuildRight
ConvertToUnsafe
InMemoryColumnarTableScan [key#14], (InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc))
ConvertToUnsafe
HiveTableScan [key#61], (MetastoreRelation default, src, None)
ConvertToUnsafe
HiveTableScan [key#66], (MetastoreRelation default, src, None)
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes #8494 from chenghao-intel/weird_cache.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-10441
Author: Yin Huai <yhuai@databricks.com>
Closes #8597 from yhuai/timestampJson.
|
|
|
|
|
|
|
|
| |
We did a lot of special handling for non-deterministic expressions in `Optimizer`. However, `PhysicalOperation` just collects all Projects and Filters and mess it up. We should respect the operators order caused by non-deterministic expressions in `PhysicalOperation`.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8486 from cloud-fan/fix.
|
|
|
|
|
|
|
|
|
|
|
|
| |
ORC files
JIRA: https://issues.apache.org/jira/browse/SPARK-9170
`StandardStructObjectInspector` will implicitly lowercase column names. But I think Orc format doesn't have such requirement. In fact, there is a `OrcStructInspector` specified for Orc format. We should use it when serialize rows to Orc file. It can be case preserving when writing ORC files.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes #7520 from viirya/use_orcstruct.
|
|
|
|
|
|
|
|
|
|
|
|
| |
To keep full compatibility of Parquet write path with Spark 1.4, we should rename the innermost field name of arrays that may contain null from "array_element" to "array".
Please refer to [SPARK-10434] [1] for more details.
[1]: https://issues.apache.org/jira/browse/SPARK-10434
Author: Cheng Lian <lian@databricks.com>
Closes #8586 from liancheng/spark-10434/fix-parquet-array-type.
|
|
|
|
|
|
|
|
| |
Jenkins master builders are currently broken by a merge conflict between PR #8584 and PR #8155.
Author: Cheng Lian <lian@databricks.com>
Closes #8614 from liancheng/hotfix/fix-pr-8155-8584-conflict.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
tests
This PR fix the failed test and conflict for #8155
https://issues.apache.org/jira/browse/SPARK-9925
Closes #8155
Author: Yin Huai <yhuai@databricks.com>
Author: Davies Liu <davies@databricks.com>
Closes #8602 from davies/shuffle_partitions.
|
|
|
|
|
|
| |
Author: Andrew Or <andrew@databricks.com>
Closes #8603 from andrewor14/minor-sql-changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
analyze
This PR takes over https://github.com/apache/spark/pull/8389.
This PR improves `checkAnswer` to print the partially analyzed plan in addition to the user friendly error message, in order to aid debugging failing tests.
In doing so, I ran into a conflict with the various ways that we bring a SQLContext into the tests. Depending on the trait we refer to the current context as `sqlContext`, `_sqlContext`, `ctx` or `hiveContext` with access modifiers `public`, `protected` and `private` depending on the defining class.
I propose we refactor as follows:
1. All tests should only refer to a `protected sqlContext` when testing general features, and `protected hiveContext` when it is a method that only exists on a `HiveContext`.
2. All tests should only import `testImplicits._` (i.e., don't import `TestHive.implicits._`)
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8584 from cloud-fan/cleanupTests.
|
|
|
|
|
|
|
|
| |
https://issues.apache.org/jira/browse/SPARK-9596
Author: WangTaoTheTonic <wangtao111@huawei.com>
Closes #7931 from WangTaoTheTonic/SPARK-9596.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
by default
New screenshots after this fix:
<img width="627" alt="s1" src="https://cloud.githubusercontent.com/assets/1000778/9625782/4b2dba36-518b-11e5-9104-c713ff026e3d.png">
Default:
<img width="462" alt="s2" src="https://cloud.githubusercontent.com/assets/1000778/9625817/92366e50-518b-11e5-9981-cdfb774d66b8.png">
After clicking `+details`:
<img width="377" alt="s3" src="https://cloud.githubusercontent.com/assets/1000778/9625784/4ba24342-518b-11e5-8522-846a16a95d44.png">
Author: zsxwing <zsxwing@gmail.com>
Closes #8570 from zsxwing/SPARK-10411.
|
|
|
|
|
|
|
|
|
|
| |
clone method
https://issues.apache.org/jira/browse/SPARK-10422
Author: Yin Huai <yhuai@databricks.com>
Closes #8578 from yhuai/SPARK-10422.
|
|
|
|
|
|
|
|
|
|
| |
Aggregate
For example, we can write `SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1` in PostgreSQL, and we should support this in Spark SQL.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8548 from cloud-fan/support-order-by-non-attribute.
|
|
|
|
|
|
|
|
|
|
| |
Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name.
However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it.
Author: Wenchen Fan <cloud0fan@outlook.com>
Closes #8231 from cloud-fan/sort-agg.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR can be quite challenging to review. I'm trying to give a detailed description of the problem as well as its solution here.
When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning. This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns. However, this translation can be fairly complicated because of several reasons:
1. Requested schema must conform to the real schema of the physical file to be read.
This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema. Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231.
1. Support for schema merging.
A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas. This means we may request for a column path that doesn't exist in a physical Parquet file. All requested column paths can be nested. For example, for a Parquet file schema
```
message root {
required group f0 {
required group f00 {
required int32 f000;
required binary f001 (UTF8);
}
}
}
```
we may request for column paths defined in the following schema:
```
message root {
required group f0 {
required group f00 {
required binary f001 (UTF8);
required float f002;
}
}
optional double f1;
}
```
Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`.
The good news is that Parquet handles non-existing column paths properly and always returns null for them.
1. The map from `StructType` to `MessageType` is a one-to-many map.
This is the most unfortunate part.
Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors". For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema:
```
message m0 {
repeated int32 f;
}
```
while parquet-avro generates another version:
```
message m1 {
required group f (LIST) {
repeated int32 array;
}
}
```
and parquet-thrift spills this:
```
message m1 {
required group f (LIST) {
repeated int32 f_tuple;
}
}
```
All of them can be mapped to the following _unique_ Catalyst schema:
```
StructType(
StructField(
"f",
ArrayType(IntegerType, containsNull = false),
nullable = false))
```
This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases. To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`.
In earlier Spark versions, we didn't try to fix this issue properly. Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases. Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones. This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005]. In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005. However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea. So this PR is an attempt to fix the problem in a proper way.
For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`:
For a leaf column path `c` in `cs`:
- if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`;
- otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`;
- no other column paths should exist in `ps'`.
Then comes the most tedious part:
> Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`?
Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec. They are:
1. the standard structure of nested types, and
1. cases defined in all backwards-compatibility rules for `LIST` and `MAP`.
The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form. Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively. The column path selection algorithm is implemented in `clipParquetGroupFields()`.
With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`. Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries. This situation is illustrated by [this test case] [test-case].
[spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301
[spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005
[test-case]: https://github.com/liancheng/spark/commit/38644d8a45175cbdf20d2ace021c2c2544a50ab3#diff-a9b98e28ce3ae30641829dffd1173be2R26
Author: Cheng Lian <lian@databricks.com>
Closes #8509 from liancheng/spark-10301/fix-parquet-requested-schema.
|
|
|
|
|
|
|
|
| |
They don't bring much value since we now have better unit test coverage for hash joins. This will also help reduce the test time.
Author: Reynold Xin <rxin@databricks.com>
Closes #8542 from rxin/SPARK-10378.
|
|
|
|
|
|
|
|
|
|
| |
Data frame write to DB2 database is failing because by default JDBC data source implementation is generating a table schema with DB2 unsupported data types TEXT for String, and BIT1(1) for Boolean.
This patch registers DB2 JDBC Dialect that maps String, Boolean to valid DB2 data types.
Author: sureshthalamati <suresh.thalamati@gmail.com>
Closes #8393 from sureshthalamati/db2_dialect_spark-10170.
|
|
|
|
|
|
|
|
| |
CC rxin marmbrus
Author: Feynman Liang <fliang@databricks.com>
Closes #8523 from feynmanliang/SPARK-10351.
|
|
|
|
|
|
|
|
|
|
|
|
| |
for local operators
This PR includes the following changes:
- Add `LocalNodeTest` for local operator tests and add unit tests for FilterNode and ProjectNode.
- Add `LimitNode` and `UnionNode` and their unit tests to show how to use `LocalNodeTest`. (SPARK-9991, SPARK-9993)
Author: zsxwing <zsxwing@gmail.com>
Closes #8464 from zsxwing/local-execution.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
OOM driver and throw a better error message when users need to enable parquet schema merging
This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables.
https://issues.apache.org/jira/browse/SPARK-10339
https://issues.apache.org/jira/browse/SPARK-10334
Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do.
Author: Yin Huai <yhuai@databricks.com>
Closes #8515 from yhuai/partitionedTableScan.
|
|
|
|
|
|
|
|
|
|
| |
more places
SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places.
Author: Josh Rosen <joshrosen@databricks.com>
Closes #8499 from JoshRosen/use-hadoop-reflection-in-more-places.
|
|
|
|
|
|
|
|
| |
When I tested the latest version of spark with exclamation mark, I got some errors. Then I reseted the spark version and found that commit id "a2409d1c8e8ddec04b529ac6f6a12b5993f0eeda" brought the bug. With jline version changing from 0.9.94 to 2.12 after this commit, exclamation mark would be treated as a special character in ConsoleReader.
Author: wangwei <wangwei82@huawei.com>
Closes #8420 from small-wang/jline-SPARK-10226.
|