spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-10443] [SQL] Refactor SortMergeOuterJoin to reduce duplication	Andrew Or	2015-09-10	1	-61/+77
\| \| \| \| \| \| \| \|	`LeftOutputIterator` and `RightOutputIterator` are symmetrically identical and can share a lot of code. If someone makes a change in one but forgets to do the same thing in the other we'll end up with inconsistent behavior. This patch also adds inline comments to clarify the intention of the code. Author: Andrew Or <andrew@databricks.com> Closes #8596 from andrewor14/smoj-cleanup.
*	[SPARK-10049] [SPARKR] Support collecting data of ArraryType in DataFrame.	Sun Rui	2015-09-10	1	-4/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	this PR : 1. Enhance reflection in RBackend. Automatically matching a Java array to Scala Seq when finding methods. Util functions like seq(), listToSeq() in R side can be removed, as they will conflict with the Serde logic that transferrs a Scala seq to R side. 2. Enhance the SerDe to support transferring a Scala seq to R side. Data of ArrayType in DataFrame after collection is observed to be of Scala Seq type. 3. Support ArrayType in createDataFrame(). Author: Sun Rui <rui.sun@intel.com> Closes #8458 from sun-rui/SPARK-10049.
*	[SPARK-9990] [SQL] Create local hash join operator	zsxwing	2015-09-10	16	-24/+455
\| \| \| \| \| \| \| \| \| \| \|	This PR includes the following changes: - Add SQLConf to LocalNode - Add HashJoinNode - Add ConvertToUnsafeNode and ConvertToSafeNode.scala to test unsafe hash join. Author: zsxwing <zsxwing@gmail.com> Closes #8535 from zsxwing/SPARK-9990.
*	[SPARK-10466] [SQL] UnsafeRow SerDe exception with data spill	Cheng Hao	2015-09-10	2	-5/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Data Spill with UnsafeRow causes assert failure. ``` java.lang.AssertionError: assertion failed at scala.Predef$.assert(Predef.scala:165) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2.writeKey(UnsafeRowSerializer.scala:75) at org.apache.spark.storage.DiskBlockObjectWriter.write(DiskBlockObjectWriter.scala:180) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:688) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2$$anonfun$apply$1.apply(ExternalSorter.scala:687) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:687) at org.apache.spark.util.collection.ExternalSorter$$anonfun$writePartitionedFile$2.apply(ExternalSorter.scala:683) at scala.collection.Iterator$class.foreach(Iterator.scala:727) at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at org.apache.spark.util.collection.ExternalSorter.writePartitionedFile(ExternalSorter.scala:683) at org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:80) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) at org.apache.spark.scheduler.Task.run(Task.scala:88) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) ``` To reproduce that with code (thanks andrewor14): ```scala bin/spark-shell --master local --conf spark.shuffle.memoryFraction=0.005 --conf spark.shuffle.sort.bypassMergeThreshold=0 sc.parallelize(1 to 2 * 1000 * 1000, 10) .map { i => (i, i) }.toDF("a", "b").groupBy("b").avg().count() ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8635 from chenghao-intel/unsafe_spill.
*	[SPARK-10301] [SPARK-10428] [SQL] Addresses comments of PR #8583 and #8509 ↵	Cheng Lian	2015-09-10	4	-45/+522
\| \| \| \| \| \| \| \|	for master Author: Cheng Lian <lian@databricks.com> Closes #8670 from liancheng/spark-10301/address-pr-comments.
*	[SPARK-7142] [SQL] Minor enhancement to BooleanSimplification Optimizer rule	Yash Datta	2015-09-10	2	-0/+25
\| \| \| \| \| \| \| \| \| \| \| \|	Use these in the optimizer as well: A and (not(A) or B) => A and B not(A and B) => not(A) or not(B) not(A or B) => not(A) and not(B) Author: Yash Datta <Yash.Datta@guavus.com> Closes #5700 from saucam/bool_simp.
*	[SPARK-10065] [SQL] avoid the extra copy when generate unsafe array	Wenchen Fan	2015-09-10	1	-60/+24
\| \| \| \| \| \| \| \| \| \| \| \|	The reason for this extra copy is that we iterate the array twice: calculate elements data size and copy elements to array buffer. A simple solution is to follow `createCodeForStruct`, we can dynamically grow the buffer when needed and thus don't need to know the data size ahead. This PR also include some typo and style fixes, and did some minor refactor to make sure `input.primitive` is always variable name not code when generate unsafe code. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8496 from cloud-fan/avoid-copy.
*	[SPARK-9730] [SQL] Add Full Outer Join support for SortMergeJoin	Liang-Chi Hsieh	2015-09-09	4	-34/+248
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is based on #8383 , thanks to viirya JIRA: https://issues.apache.org/jira/browse/SPARK-9730 This patch adds the Full Outer Join support for SortMergeJoin. A new class SortMergeFullJoinScanner is added to scan rows from left and right iterators. FullOuterIterator is simply a wrapper of type RowIterator to consume joined rows from SortMergeFullJoinScanner. Closes #8383 Author: Liang-Chi Hsieh <viirya@appier.com> Author: Davies Liu <davies@databricks.com> Closes #8579 from davies/smj_fullouter.
*	[SPARK-10461] [SQL] make sure `input.primitive` is always variable name not ↵	Wenchen Fan	2015-09-09	5	-67/+75
\| \| \| \| \| \| \| \| \| \| \| \|	code at `GenerateUnsafeProjection` When we generate unsafe code inside `createCodeForXXX`, we always assign the `input.primitive` to a temp variable in case `input.primitive` is expression code. This PR did some refactor to make sure `input.primitive` is always variable name, and some other typo and style fixes. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8613 from cloud-fan/minor.
*	[SPARK-10227] fatal warnings with sbt on Scala 2.11	Luc Bourlier	2015-09-09	7	-18/+18
\| \| \| \| \| \| \| \| \| \| \|	The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary. But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations. The remainder are some potential bugs, and deprecated syntax. Author: Luc Bourlier <luc.bourlier@typesafe.com> Closes #8433 from skyluc/issue/sbt-2.11.
*	[HOTFIX] Fix build break caused by #8494	Michael Armbrust	2015-09-08	1	-2/+2
\| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #8659 from marmbrus/testBuildBreak.
*	[SPARK-10327] [SQL] Cache Table is not working while subquery has alias in ↵	Cheng Hao	2015-09-08	2	-3/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	its project list ```scala import org.apache.spark.sql.hive.execution.HiveTableScan sql("select key, value, key + 1 from src").registerTempTable("abc") cacheTable("abc") val sparkPlan = sql( """select a.key, b.key, c.key from \|abc a join abc b on a.key=b.key \|join abc c on a.key=c.key""".stripMargin).queryExecution.sparkPlan assert(sparkPlan.collect { case e: InMemoryColumnarTableScan => e }.size === 3) // failed assert(sparkPlan.collect { case e: HiveTableScan => e }.size === 0) // failed ``` The actual plan is: ``` == Parsed Logical Plan == 'Project [unresolvedalias('a.key),unresolvedalias('b.key),unresolvedalias('c.key)] 'Join Inner, Some(('a.key = 'c.key)) 'Join Inner, Some(('a.key = 'b.key)) 'UnresolvedRelation [abc], Some(a) 'UnresolvedRelation [abc], Some(b) 'UnresolvedRelation [abc], Some(c) == Analyzed Logical Plan == key: int, key: int, key: int Project [key#14,key#61,key#66] Join Inner, Some((key#14 = key#66)) Join Inner, Some((key#14 = key#61)) Subquery a Subquery abc Project [key#14,value#15,(key#14 + 1) AS _c2#16] MetastoreRelation default, src, None Subquery b Subquery abc Project [key#61,value#62,(key#61 + 1) AS _c2#58] MetastoreRelation default, src, None Subquery c Subquery abc Project [key#66,value#67,(key#66 + 1) AS _c2#63] MetastoreRelation default, src, None == Optimized Logical Plan == Project [key#14,key#61,key#66] Join Inner, Some((key#14 = key#66)) Project [key#14,key#61] Join Inner, Some((key#14 = key#61)) Project [key#14] InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc) Project [key#61] MetastoreRelation default, src, None Project [key#66] MetastoreRelation default, src, None == Physical Plan == TungstenProject [key#14,key#61,key#66] BroadcastHashJoin [key#14], [key#66], BuildRight TungstenProject [key#14,key#61] BroadcastHashJoin [key#14], [key#61], BuildRight ConvertToUnsafe InMemoryColumnarTableScan [key#14], (InMemoryRelation [key#14,value#15,_c2#16], true, 10000, StorageLevel(true, true, false, true, 1), (Project [key#14,value#15,(key#14 + 1) AS _c2#16]), Some(abc)) ConvertToUnsafe HiveTableScan [key#61], (MetastoreRelation default, src, None) ConvertToUnsafe HiveTableScan [key#66], (MetastoreRelation default, src, None) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #8494 from chenghao-intel/weird_cache.
*	[SPARK-10441] [SQL] Save data correctly to json.	Yin Huai	2015-09-08	9	-8/+205
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10441 Author: Yin Huai <yhuai@databricks.com> Closes #8597 from yhuai/timestampJson.
*	[SPARK-10316] [SQL] respect nondeterministic expressions in PhysicalOperation	Wenchen Fan	2015-09-08	2	-30/+20
\| \| \| \| \| \| \| \|	We did a lot of special handling for non-deterministic expressions in `Optimizer`. However, `PhysicalOperation` just collects all Projects and Filters and mess it up. We should respect the operators order caused by non-deterministic expressions in `PhysicalOperation`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8486 from cloud-fan/fix.
*	[SPARK-9170] [SQL] Use OrcStructInspector to be case preserving when writing ↵	Liang-Chi Hsieh	2015-09-08	2	-21/+40
\| \| \| \| \| \| \| \| \| \| \| \|	ORC files JIRA: https://issues.apache.org/jira/browse/SPARK-9170 `StandardStructObjectInspector` will implicitly lowercase column names. But I think Orc format doesn't have such requirement. In fact, there is a `OrcStructInspector` specified for Orc format. We should use it when serialize rows to Orc file. It can be case preserving when writing ORC files. Author: Liang-Chi Hsieh <viirya@appier.com> Closes #7520 from viirya/use_orcstruct.
*	[SPARK-10434] [SQL] Fixes Parquet schema of arrays that may contain null	Cheng Lian	2015-09-05	2	-9/+10
\| \| \| \| \| \| \| \| \| \| \| \|	To keep full compatibility of Parquet write path with Spark 1.4, we should rename the innermost field name of arrays that may contain null from "array_element" to "array". Please refer to [SPARK-10434] [1] for more details. [1]: https://issues.apache.org/jira/browse/SPARK-10434 Author: Cheng Lian <lian@databricks.com> Closes #8586 from liancheng/spark-10434/fix-parquet-array-type.
*	[HOTFIX] [SQL] Fixes compilation error	Cheng Lian	2015-09-04	1	-1/+1
\| \| \| \| \| \| \| \|	Jenkins master builders are currently broken by a merge conflict between PR #8584 and PR #8155. Author: Cheng Lian <lian@databricks.com> Closes #8614 from liancheng/hotfix/fix-pr-8155-8584-conflict.
*	[SPARK-9925] [SQL] [TESTS] Set SQLConf.SHUFFLE_PARTITIONS.key correctly for ↵	Yin Huai	2015-09-04	7	-21/+90
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tests This PR fix the failed test and conflict for #8155 https://issues.apache.org/jira/browse/SPARK-9925 Closes #8155 Author: Yin Huai <yhuai@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #8602 from davies/shuffle_partitions.
*	[SPARK-10450] [SQL] Minor improvements to readability / style / typos etc.	Andrew Or	2015-09-04	5	-15/+15
\| \| \| \| \| \|	Author: Andrew Or <andrew@databricks.com> Closes #8603 from andrewor14/minor-sql-changes.
*	[SPARK-10176] [SQL] Show partially analyzed plans when checkAnswer fails to ↵	Wenchen Fan	2015-09-04	90	-999/+908
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	analyze This PR takes over https://github.com/apache/spark/pull/8389. This PR improves `checkAnswer` to print the partially analyzed plan in addition to the user friendly error message, in order to aid debugging failing tests. In doing so, I ran into a conflict with the various ways that we bring a SQLContext into the tests. Depending on the trait we refer to the current context as `sqlContext`, `_sqlContext`, `ctx` or `hiveContext` with access modifiers `public`, `protected` and `private` depending on the defining class. I propose we refactor as follows: 1. All tests should only refer to a `protected sqlContext` when testing general features, and `protected hiveContext` when it is a method that only exists on a `HiveContext`. 2. All tests should only import `testImplicits._` (i.e., don't import `TestHive.implicits._`) Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8584 from cloud-fan/cleanupTests.
*	[SPARK-9596] [SQL] treat hadoop classes as shared one in IsolatedClientLoader	WangTaoTheTonic	2015-09-03	1	-0/+1
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-9596 Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #7931 from WangTaoTheTonic/SPARK-9596.
*	[SPARK-10411] [SQL] Move visualization above explain output and hide explain ↵	zsxwing	2015-09-02	1	-5/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	by default New screenshots after this fix: <img width="627" alt="s1" src="https://cloud.githubusercontent.com/assets/1000778/9625782/4b2dba36-518b-11e5-9104-c713ff026e3d.png"> Default: <img width="462" alt="s2" src="https://cloud.githubusercontent.com/assets/1000778/9625817/92366e50-518b-11e5-9981-cdfb774d66b8.png"> After clicking `+details`: <img width="377" alt="s3" src="https://cloud.githubusercontent.com/assets/1000778/9625784/4ba24342-518b-11e5-8522-846a16a95d44.png"> Author: zsxwing <zsxwing@gmail.com> Closes #8570 from zsxwing/SPARK-10411.
*	[SPARK-10422] [SQL] String column in InMemoryColumnarCache needs to override ↵	Yin Huai	2015-09-02	2	-0/+22
\| \| \| \| \| \| \| \| \| \|	clone method https://issues.apache.org/jira/browse/SPARK-10422 Author: Yin Huai <yhuai@databricks.com> Closes #8578 from yhuai/SPARK-10422.
*	[SPARK-10389] [SQL] support order by non-attribute grouping expression on ↵	Wenchen Fan	2015-09-02	2	-39/+52
\| \| \| \| \| \| \| \| \| \|	Aggregate For example, we can write `SELECT MAX(value) FROM src GROUP BY key + 1 ORDER BY key + 1` in PostgreSQL, and we should support this in Spark SQL. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8548 from cloud-fan/support-order-by-non-attribute.
*	[SPARK-10034] [SQL] add regression test for Sort on Aggregate	Wenchen Fan	2015-09-02	2	-0/+18
\| \| \| \| \| \| \| \| \| \|	Before #8371, there was a bug for `Sort` on `Aggregate` that we can't use aggregate expressions named `_aggOrdering` and can't use more than one ordering expressions which contains aggregate functions. The reason of this bug is that: The aggregate expression in `SortOrder` never get resolved, we alias it with `_aggOrdering` and call `toAttribute` which gives us an `UnresolvedAttribute`. So actually we are referencing aggregate expression by name, not by exprId like we thought. And if there is already an aggregate expression named `_aggOrdering` or there are more than one ordering expressions having aggregate functions, we will have conflict names and can't search by name. However, after #8371 got merged, the `SortOrder`s are guaranteed to be resolved and we are always referencing aggregate expression by exprId. The Bug doesn't exist anymore and this PR add regression tests for it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #8231 from cloud-fan/sort-agg.
*	[SPARK-10301] [SQL] Fixes schema merging for nested structs	Cheng Lian	2015-09-01	7	-125/+653
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR can be quite challenging to review. I'm trying to give a detailed description of the problem as well as its solution here. When reading Parquet files, we need to specify a potentially nested Parquet schema (of type `MessageType`) as requested schema for column pruning. This Parquet schema is translated from a Catalyst schema (of type `StructType`), which is generated by the query planner and represents all requested columns. However, this translation can be fairly complicated because of several reasons: 1. Requested schema must conform to the real schema of the physical file to be read. This means we have to tailor the actual file schema of every individual physical Parquet file to be read according to the given Catalyst schema. Fortunately we are already doing this in Spark 1.5 by pushing request schema conversion to executor side in PR #7231. 1. Support for schema merging. A single Parquet dataset may consist of multiple physical Parquet files come with different but compatible schemas. This means we may request for a column path that doesn't exist in a physical Parquet file. All requested column paths can be nested. For example, for a Parquet file schema ``` message root { required group f0 { required group f00 { required int32 f000; required binary f001 (UTF8); } } } ``` we may request for column paths defined in the following schema: ``` message root { required group f0 { required group f00 { required binary f001 (UTF8); required float f002; } } optional double f1; } ``` Notice that we pruned column path `f0.f00.f000`, but added `f0.f00.f002` and `f1`. The good news is that Parquet handles non-existing column paths properly and always returns null for them. 1. The map from `StructType` to `MessageType` is a one-to-many map. This is the most unfortunate part. Due to historical reasons (dark histories!), schemas of Parquet files generated by different libraries have different "flavors". For example, to handle a schema with a single non-nullable column, whose type is an array of non-nullable integers, parquet-protobuf generates the following Parquet schema: ``` message m0 { repeated int32 f; } ``` while parquet-avro generates another version: ``` message m1 { required group f (LIST) { repeated int32 array; } } ``` and parquet-thrift spills this: ``` message m1 { required group f (LIST) { repeated int32 f_tuple; } } ``` All of them can be mapped to the following _unique_ Catalyst schema: ``` StructType( StructField( "f", ArrayType(IntegerType, containsNull = false), nullable = false)) ``` This greatly complicates Parquet requested schema construction, since the path of a given column varies in different cases. To read the array elements from files with the above schemas, we must use `f` for `m0`, `f.array` for `m1`, and `f.f_tuple` for `m2`. In earlier Spark versions, we didn't try to fix this issue properly. Spark 1.4 and prior versions simply translate the Catalyst schema in a way more or less compatible with parquet-hive and parquet-avro, but is broken in many other cases. Earlier revisions of Spark 1.5 only try to tailor the Parquet file schema at the first level, and ignore nested ones. This caused [SPARK-10301] [spark-10301] as well as [SPARK-10005] [spark-10005]. In PR #8228, I tried to avoid the hard part of the problem and made a minimum change in `CatalystRowConverter` to fix SPARK-10005. However, when taking SPARK-10301 into consideration, keeping hacking `CatalystRowConverter` doesn't seem to be a good idea. So this PR is an attempt to fix the problem in a proper way. For a given physical Parquet file with schema `ps` and a compatible Catalyst requested schema `cs`, we use the following algorithm to tailor `ps` to get the result Parquet requested schema `ps'`: For a leaf column path `c` in `cs`: - if `c` exists in `cs` and a corresponding Parquet column path `c'` can be found in `ps`, `c'` should be included in `ps'`; - otherwise, we convert `c` to a Parquet column path `c"` using `CatalystSchemaConverter`, and include `c"` in `ps'`; - no other column paths should exist in `ps'`. Then comes the most tedious part: > Given `cs`, `ps`, and `c`, how to locate `c'` in `ps`? Unfortunately, there's no quick answer, and we have to enumerate all possible structures defined in parquet-format spec. They are: 1. the standard structure of nested types, and 1. cases defined in all backwards-compatibility rules for `LIST` and `MAP`. The core part of this PR is `CatalystReadSupport.clipParquetType()`, which tailors a given Parquet file schema according to a requested schema in its Catalyst form. Backwards-compatibility rules of `LIST` and `MAP` are covered in `clipParquetListType()` and `clipParquetMapType()` respectively. The column path selection algorithm is implemented in `clipParquetGroupFields()`. With this PR, we no longer need to do schema tailoring in `CatalystReadSupport` and `CatalystRowConverter`. Another benefit is that, now we can also read Parquet datasets consist of files with different physical Parquet schema but share the same logical schema, for example, files generated by different Parquet libraries. This situation is illustrated by [this test case] [test-case]. [spark-10301]: https://issues.apache.org/jira/browse/SPARK-10301 [spark-10005]: https://issues.apache.org/jira/browse/SPARK-10005 [test-case]: https://github.com/liancheng/spark/commit/38644d8a45175cbdf20d2ace021c2c2544a50ab3#diff-a9b98e28ce3ae30641829dffd1173be2R26 Author: Cheng Lian <lian@databricks.com> Closes #8509 from liancheng/spark-10301/fix-parquet-requested-schema.
*	[SPARK-10378][SQL][Test] Remove HashJoinCompatibilitySuite.	Reynold Xin	2015-08-31	1	-169/+0
\| \| \| \| \| \| \| \|	They don't bring much value since we now have better unit test coverage for hash joins. This will also help reduce the test time. Author: Reynold Xin <rxin@databricks.com> Closes #8542 from rxin/SPARK-10378.
*	[SPARK-10170] [SQL] Add DB2 JDBC dialect support.	sureshthalamati	2015-08-31	2	-0/+25
\| \| \| \| \| \| \| \| \| \|	Data frame write to DB2 database is failing because by default JDBC data source implementation is generating a table schema with DB2 unsupported data types TEXT for String, and BIT1(1) for Boolean. This patch registers DB2 JDBC Dialect that maps String, Boolean to valid DB2 data types. Author: sureshthalamati <suresh.thalamati@gmail.com> Closes #8393 from sureshthalamati/db2_dialect_spark-10170.
*	[SPARK-10351] [SQL] Fixes UTF8String.fromAddress to handle off-heap memory	Feynman Liang	2015-08-30	1	-4/+5
\| \| \| \| \| \| \| \|	CC rxin marmbrus Author: Feynman Liang <fliang@databricks.com> Closes #8523 from feynmanliang/SPARK-10351.
*	[SPARK-9986] [SPARK-9991] [SPARK-9993] [SQL] Create a simple test framework ↵	zsxwing	2015-08-29	14	-55/+509
\| \| \| \| \| \| \| \| \| \| \| \|	for local operators This PR includes the following changes: - Add `LocalNodeTest` for local operator tests and add unit tests for FilterNode and ProjectNode. - Add `LimitNode` and `UnionNode` and their unit tests to show how to use `LocalNodeTest`. (SPARK-9991, SPARK-9993) Author: zsxwing <zsxwing@gmail.com> Closes #8464 from zsxwing/local-execution.
*	[SPARK-10339] [SPARK-10334] [SPARK-10301] [SQL] Partitioned table scan can ↵	Yin Huai	2015-08-29	3	-42/+65
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	OOM driver and throw a better error message when users need to enable parquet schema merging This fixes the problem that scanning partitioned table causes driver have a high memory pressure and takes down the cluster. Also, with this fix, we will be able to correctly show the query plan of a query consuming partitioned tables. https://issues.apache.org/jira/browse/SPARK-10339 https://issues.apache.org/jira/browse/SPARK-10334 Finally, this PR squeeze in a "quick fix" for SPARK-10301. It is not a real fix, but it just throw a better error message to let user know what to do. Author: Yin Huai <yhuai@databricks.com> Closes #8515 from yhuai/partitionedTableScan.
*	[SPARK-10330] Use SparkHadoopUtil TaskAttemptContext reflection methods in ↵	Josh Rosen	2015-08-29	5	-12/+28
\| \| \| \| \| \| \| \| \| \|	more places SparkHadoopUtil contains methods that use reflection to work around TaskAttemptContext binary incompatibilities between Hadoop 1.x and 2.x. We should use these methods in more places. Author: Josh Rosen <joshrosen@databricks.com> Closes #8499 from JoshRosen/use-hadoop-reflection-in-more-places.
*	[SPARK-10226] [SQL] Fix exclamation mark issue in SparkSQL	wangwei	2015-08-29	1	-0/+1
\| \| \| \| \| \| \| \|	When I tested the latest version of spark with exclamation mark, I got some errors. Then I reseted the spark version and found that commit id "a2409d1c8e8ddec04b529ac6f6a12b5993f0eeda" brought the bug. With jline version changing from 0.9.94 to 2.12 after this commit, exclamation mark would be treated as a special character in ConsoleReader. Author: wangwei <wangwei82@huawei.com> Closes #8420 from small-wang/jline-SPARK-10226.
*	[SPARK-10344] [SQL] Add tests for extraStrategies	Michael Armbrust	2015-08-29	2	-1/+68
\| \| \| \| \| \| \| \|	Actually using this API requires access to a lot of classes that we might make private by accident. I've added some tests to prevent this. Author: Michael Armbrust <michael@databricks.com> Closes #8516 from marmbrus/extraStrategiesTests.
*	[SPARK-10289] [SQL] A direct write API for testing Parquet	Cheng Lian	2015-08-29	2	-24/+160
\| \| \| \| \| \| \| \| \| \| \| \|	This PR introduces a direct write API for testing Parquet. It's a DSL flavored version of the [`writeDirect` method] [1] comes with parquet-avro testing code. With this API, it's much easier to construct arbitrary Parquet structures. It's especially useful when adding regression tests for various compatibility corner cases. Sample usage of this API can be found in the new test case added in `ParquetThriftCompatibilitySuite`. [1]: https://github.com/apache/parquet-mr/blob/apache-parquet-1.8.1/parquet-avro/src/test/java/org/apache/parquet/avro/TestArrayCompatibility.java#L945-L972 Author: Cheng Lian <lian@databricks.com> Closes #8454 from liancheng/spark-10289/parquet-testing-direct-write-api.
*	[SPARK-10323] [SQL] fix nullability of In/InSet/ArrayContain	Davies Liu	2015-08-28	7	-97/+138
\| \| \| \| \| \| \| \|	After this PR, In/InSet/ArrayContain will return null if value is null, instead of false. They also will return null even if there is a null in the set/array. Author: Davies Liu <davies@databricks.com> Closes #8492 from davies/fix_in.
*	[SPARK-10325] Override hashCode() for public Row	Josh Rosen	2015-08-28	2	-0/+22
\| \| \| \| \| \| \| \| \| \|	This commit fixes an issue where the public SQL `Row` class did not override `hashCode`, causing it to violate the hashCode() + equals() contract. To fix this, I simply ported the `hashCode` implementation from the 1.4.x version of `Row`. Author: Josh Rosen <joshrosen@databricks.com> Closes #8500 from JoshRosen/SPARK-10325 and squashes the following commits: 51ffea1 [Josh Rosen] Override hashCode() for public Row.
*	[SPARK-SQL] [MINOR] Fixes some typos in HiveContext	Cheng Lian	2015-08-27	2	-5/+5
\| \| \| \| \| \|	Author: Cheng Lian <lian@databricks.com> Closes #8481 from liancheng/hive-context-typo.
*	[SPARK-10321] sizeInBytes in HadoopFsRelation	Davies Liu	2015-08-27	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	Having sizeInBytes in HadoopFsRelation to enable broadcast join. cc marmbrus Author: Davies Liu <davies@databricks.com> Closes #8490 from davies/sizeInByte.
*	[SPARK-10287] [SQL] Fixes JSONRelation refreshing on read path	Yin Huai	2015-08-27	3	-25/+1
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10287 After porting json to HadoopFsRelation, it seems hard to keep the behavior of picking up new files automatically for JSON. This PR removes this behavior, so JSON is consistent with others (ORC and Parquet). Author: Yin Huai <yhuai@databricks.com> Closes #8469 from yhuai/jsonRefresh.
*	[SPARK-10215] [SQL] Fix precision of division (follow the rule in Hive)	Davies Liu	2015-08-25	4	-13/+39
\| \| \| \| \| \| \| \| \| \|	Follow the rule in Hive for decimal division. see https://github.com/apache/hive/blob/ac755ebe26361a4647d53db2a28500f71697b276/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFOPDivide.java#L113 cc chenghao-intel Author: Davies Liu <davies@databricks.com> Closes #8415 from davies/decimal_div2.
*	[SPARK-10245] [SQL] Fix decimal literals with precision < scale	Davies Liu	2015-08-25	3	-6/+19
\| \| \| \| \| \| \| \|	In BigDecimal or java.math.BigDecimal, the precision could be smaller than scale, for example, BigDecimal("0.001") has precision = 1 and scale = 3. But DecimalType require that the precision should be larger than scale, so we should use the maximum of precision and scale when inferring the schema from decimal literal. Author: Davies Liu <davies@databricks.com> Closes #8428 from davies/smaller_decimal.
*	[SPARK-10048] [SPARKR] Support arbitrary nested Java array in serde.	Sun Rui	2015-08-25	1	-28/+4
\| \| \| \| \| \| \| \| \| \| \|	This PR: 1. supports transferring arbitrary nested array from JVM to R side in SerDe; 2. based on 1, collect() implemenation is improved. Now it can support collecting data of complex types from a DataFrame. Author: Sun Rui <rui.sun@intel.com> Closes #8276 from sun-rui/SPARK-10048.
*	[SPARK-10198] [SQL] Turn off partition verification by default	Michael Armbrust	2015-08-25	2	-31/+35
\| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #8404 from marmbrus/turnOffPartitionVerification.
*	[SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses ↵	Sean Owen	2015-08-25	46	-265/+282
\| \| \| \| \| \| \| \| \| \| \| \|	to JavaConverters Replace `JavaConversions` implicits with `JavaConverters` Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet. Author: Sean Owen <sowen@cloudera.com> Closes #8033 from srowen/SPARK-9613.
*	[SPARK-10197] [SQL] Add null check in wrapperFor (inside HiveInspectors).	Yin Huai	2015-08-25	2	-5/+53
\| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-10197 Author: Yin Huai <yhuai@databricks.com> Closes #8407 from yhuai/ORCSPARK-10197.
*	[SPARK-10195] [SQL] Data sources Filter should not expose internal types	Josh Rosen	2015-08-25	4	-41/+54
\| \| \| \| \| \| \| \| \| \|	Spark SQL's data sources API exposes Catalyst's internal types through its Filter interfaces. This is a problem because types like UTF8String are not stable developer APIs and should not be exposed to third-parties. This issue caused incompatibilities when upgrading our `spark-redshift` library to work against Spark 1.5.0. To avoid these issues in the future we should only expose public types through these Filter objects. This patch accomplishes this by using CatalystTypeConverters to add the appropriate conversions. Author: Josh Rosen <joshrosen@databricks.com> Closes #8403 from JoshRosen/datasources-internal-vs-external-types.
*	[SPARK-10177] [SQL] fix reading Timestamp in parquet from Hive	Davies Liu	2015-08-25	3	-8/+14
\| \| \| \| \| \| \| \| \| \| \|	We misunderstood the Julian days and nanoseconds of the day in parquet (as TimestampType) from Hive/Impala, they are overlapped, so can't be added together directly. In order to avoid the confusing rounding when do the converting, we use `2440588` as the Julian Day of epoch of unix timestamp (which should be 2440587.5). Author: Davies Liu <davies@databricks.com> Author: Cheng Lian <lian@databricks.com> Closes #8400 from davies/timestamp_parquet.
*	[SPARK-9293] [SPARK-9813] Analysis should check that set operations are only ↵	Josh Rosen	2015-08-25	6	-32/+48
\| \| \| \| \| \| \| \| \| \| \| \|	performed on tables with equal numbers of columns This patch adds an analyzer rule to ensure that set operations (union, intersect, and except) are only applied to tables with the same number of columns. Without this rule, there are scenarios where invalid queries can return incorrect results instead of failing with error messages; SPARK-9813 provides one example of this problem. In other cases, the invalid query can crash at runtime with extremely confusing exceptions. I also performed a bit of cleanup to refactor some of those logical operators' code into a common `SetOperation` base class. Author: Josh Rosen <joshrosen@databricks.com> Closes #7631 from JoshRosen/SPARK-9293.
*	[SPARK-10136] [SQL] A more robust fix for SPARK-10136	Cheng Lian	2015-08-25	1	-10/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	PR #8341 is a valid fix for SPARK-10136, but it didn't catch the real root cause. The real problem can be rather tricky to explain, and requires audiences to be pretty familiar with parquet-format spec, especially details of `LIST` backwards-compatibility rules. Let me have a try to give an explanation here. The structure of the problematic Parquet schema generated by parquet-avro is something like this: ``` message m { <repetition> group f (LIST) { // Level 1 repeated group array (LIST) { // Level 2 repeated <primitive-type> array; // Level 3 } } } ``` (The schema generated by parquet-thrift is structurally similar, just replace the `array` at level 2 with `f_tuple`, and the other one at level 3 with `f_tuple_tuple`.) This structure consists of two nested legacy 2-level `LIST`-like structures: 1. The repeated group type at level 2 is the element type of the outer array defined at level 1 This group should map to an `CatalystArrayConverter.ElementConverter` when building converters. 2. The repeated primitive type at level 3 is the element type of the inner array defined at level 2 This group should also map to an `CatalystArrayConverter.ElementConverter`. The root cause of SPARK-10136 is that, the group at level 2 isn't properly recognized as the element type of level 1. Thus, according to parquet-format spec, the repeated primitive at level 3 is left as a so called "unannotated repeated primitive type", and is recognized as a required list of required primitive type, thus a `RepeatedPrimitiveConverter` instead of a `CatalystArrayConverter.ElementConverter` is created for it. According to parquet-format spec, unannotated repeated type shouldn't appear in a `LIST`- or `MAP`-annotated group. PR #8341 fixed this issue by allowing such unannotated repeated type appear in `LIST`-annotated groups, which is a non-standard, hacky, but valid fix. (I didn't realize this when authoring #8341 though.) As for the reason why level 2 isn't recognized as a list element type, it's because of the following `LIST` backwards-compatibility rule defined in the parquet-format spec: > If the repeated field is a group with one field and is named either `array` or uses the `LIST`-annotated group's name with `_tuple` appended then the repeated type is the element type and elements are required. (The `array` part is for parquet-avro compatibility, while the `_tuple` part is for parquet-thrift.) This rule is implemented in [`CatalystSchemaConverter.isElementType`] [1], but neglected in [`CatalystRowConverter.isElementType`] [2]. This PR delivers a more robust fix by adding this rule in the latter method. Note that parquet-avro 1.7.0 also suffers from this issue. Details can be found at [PARQUET-364] [3]. [1]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala#L259-L305 [2]: https://github.com/apache/spark/blob/85f9a61357994da5023b08b0a8a2eb09388ce7f8/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala#L456-L463 [3]: https://issues.apache.org/jira/browse/PARQUET-364 Author: Cheng Lian <lian@databricks.com> Closes #8361 from liancheng/spark-10136/proper-version.