spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SQL] Minor Scaladoc format fix	Cheng Lian	2016-01-26	1	-4/+4
\| \| \| \| \| \| \| \|	Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes #10926 from liancheng/agg-doc-fix.
*	[SPARK-12682][SQL] Add support for (optionally) not storing tables in hive ↵	Sameer Agarwal	2016-01-26	2	-0/+39
\| \| \| \| \| \| \| \| \| \|	metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes #10826 from sameeragarwal/skip-hive-metadata.
*	[SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is ↵	Sean Owen	2016-01-26	2	-16/+13
\| \| \| \| \| \| \| \| \| \| \| \|	inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <sowen@cloudera.com> Closes #10413 from srowen/SPARK-3369.
*	[SQL][MINOR] A few minor tweaks to CSV reader.	Reynold Xin	2016-01-26	2	-14/+9
\| \| \| \| \| \| \| \|	This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc. Author: Reynold Xin <rxin@databricks.com> Closes #10919 from rxin/csv-minor.
*	[SPARK-12879] [SQL] improve the unsafe row writing framework	Wenchen Fan	2016-01-25	7	-78/+258
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: old version ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` new version ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenchen@databricks.com> Closes #10809 from cloud-fan/unsafe-projection.
*	[SPARK-12975][SQL] Throwing Exception when Bucketing Columns are part of ↵	gatorsmile	2016-01-25	3	-3/+83
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Partitioning Columns When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example, ``` df.write .format(source) .partitionBy("i") .bucketBy(8, "i", "k") .saveAsTable("bucketed_table") ``` However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change. Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table. Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10891 from gatorsmile/commonKeysInPartitionByBucketBy.
*	[SPARK-12901][SQL][HOT-FIX] Fix scala 2.11 compilation.	Yin Huai	2016-01-25	2	-2/+2
\|
*	[SPARK-12902] [SQL] visualization for generated operators	Davies Liu	2016-01-25	7	-30/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR brings back visualization for generated operators, they looks like: ![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png) ![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png) Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode. Author: Davies Liu <davies@databricks.com> Closes #10828 from davies/viz_codegen.
*	[SPARK-12932][JAVA API] improved error message for java type inference failure	Andy Grove	2016-01-25	1	-1/+2
\| \| \| \| \| \|	Author: Andy Grove <andygrove73@gmail.com> Closes #10865 from andygrove/SPARK-12932.
*	[SPARK-12901][SQL] Refactor options for JSON and CSV datasource (not case ↵	hyukjinkwon	2016-01-25	6	-52/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	class and same format). https://issues.apache.org/jira/browse/SPARK-12901 This PR refactors the options in JSON and CSV datasources. In more details, 1. `JSONOptions` uses the same format as `CSVOptions`. 2. Not case classes. 3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed) Author: hyukjinkwon <gurwls223@gmail.com> Closes #10895 from HyukjinKwon/SPARK-12901.
*	[SPARK-12624][PYSPARK] Checks row length when converting Java arrays to ↵	Cheng Lian	2016-01-24	1	-1/+8
\| \| \| \| \| \| \| \| \| \|	Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes #10886 from liancheng/spark-12624.
*	[SPARK-12971] Fix Hive tests which fail in Hadoop-2.3 SBT build	Josh Rosen	2016-01-24	2	-4/+22
\| \| \| \| \| \| \| \| \| \|	ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive). This patch attempts to improve the isolation of these tests in order to address this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.
*	[SPARK-12904][SQL] Strength reduction for integral and decimal literal ↵	Reynold Xin	2016-01-23	6	-139/+376
\| \| \| \| \| \| \| \| \| \|	comparisons This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size. Author: Reynold Xin <rxin@databricks.com> Closes #10882 from rxin/SPARK-12904-1.
*	[SPARK-12872][SQL] Support to specify the option for compression codec for ↵	hyukjinkwon	2016-01-22	5	-29/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	JSON datasource https://issues.apache.org/jira/browse/SPARK-12872 This PR makes the JSON datasource can compress output by option instead of manually setting Hadoop configurations. For reflecting codec by names, it is similar with https://github.com/apache/spark/pull/10805. As `CSVCompressionCodecs` can be shared with other datasources, it became a separate class to share as `CompressionCodecs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10858 from HyukjinKwon/SPARK-12872.
*	[SPARK-12959][SQL] Writing Bucketed Data with Disabled Bucketing in SQLConf	gatorsmile	2016-01-22	3	-6/+26
\| \| \| \| \| \| \| \| \| \| \| \|	When users turn off bucketing in SQLConf, we should issue some messages to tell users these operations will be converted to normal way. Also added a test case for this scenario and fixed the helper function. Do you think this PR is helpful when using bucket tables? cloud-fan Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10870 from gatorsmile/bucketTableWritingTestcases.
*	[SPARK-12747][SQL] Use correct type name for Postgres JDBC's real array	Liang-Chi Hsieh	2016-01-21	2	-0/+4
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-12747 Postgres JDBC driver uses "FLOAT4" or "FLOAT8" not "real". Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10695 from viirya/fix-postgres-jdbc.
*	[SPARK-8968] [SQL] [HOT-FIX] Fix scala 2.11 build.	Yin Huai	2016-01-20	1	-1/+1
\|
*	[SPARK-8968][SQL] external sort by the partition clomns when dynamic ↵	wangfei	2016-01-20	2	-99/+166
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	partitioning to optimize the memory overhead Now the hash based writer dynamic partitioning show the bad performance for big data and cause many small files and high GC. This patch we do external sort first so that each time we only need open one writer. before this patch: ![gc](https://cloud.githubusercontent.com/assets/7018048/9149788/edc48c6e-3dec-11e5-828c-9995b56e4d65.PNG) after this patch: ![gc-optimize-externalsort](https://cloud.githubusercontent.com/assets/7018048/9149794/60f80c9c-3ded-11e5-8a56-7ae18ddc7a2f.png) Author: wangfei <wangfei_hello@126.com> Author: scwf <wangfei1@huawei.com> Closes #7336 from scwf/dynamic-optimize-basedon-apachespark.
*	[SPARK-12797] [SQL] Generated TungstenAggregate (without grouping keys)	Davies Liu	2016-01-20	5	-12/+111
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As discussed in #10786, the generated TungstenAggregate does not support imperative functions. For a query ``` sqlContext.range(10).filter("id > 1").groupBy().count() ``` The generated code will looks like: ``` /* 032 / if (!initAgg0) { / 033 / initAgg0 = true; / 034 / / 035 / // initialize aggregation buffer / 037 / long bufValue2 = 0L; / 038 / / 039 / / 040 / // initialize Range / 041 / if (!range_initRange5) { / 042 / range_initRange5 = true; ... / 071 / } / 072 / / 073 / while (!range_overflow8 && range_number7 < range_partitionEnd6) { / 074 / long range_value9 = range_number7; / 075 / range_number7 += 1L; / 076 / if (range_number7 < range_value9 ^ 1L < 0) { / 077 / range_overflow8 = true; / 078 / } / 079 / / 085 / boolean primitive11 = false; / 086 / primitive11 = range_value9 > 1L; / 087 / if (!false && primitive11) { / 092 / // do aggregate and update aggregation buffer / 099 / long primitive17 = -1L; / 100 / primitive17 = bufValue2 + 1L; / 101 / bufValue2 = primitive17; / 105 / } / 107 / } / 109 / / 110 / // output the result / 112 / bufferHolder25.reset(); / 114 / rowWriter26.initialize(bufferHolder25, 1); / 118 / rowWriter26.write(0, bufValue2); / 120 / result24.pointTo(bufferHolder25.buffer, bufferHolder25.totalSize()); / 121 / currentRow = result24; / 122 / return; / 124 / } / 125 */ ``` cc nongli Author: Davies Liu <davies@databricks.com> Closes #10840 from davies/gen_agg.
*	[SPARK-12848][SQL] Change parsed decimal literal datatype from Double to Decimal	Herman van Hovell	2016-01-20	27	-58/+82
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```. The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double. This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D``` cc davies rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10796 from hvanhovell/SPARK-12848.
*	[SPARK-12888][SQL] benchmark the new hash expression	Wenchen Fan	2016-01-20	1	-0/+104
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Benchmark it on 4 different schemas, the result: ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For simple: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 31.47 266.54 1.00 X codegen version 64.52 130.01 0.49 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For normal: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 4068.11 0.26 1.00 X codegen version 1175.92 0.89 3.46 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For array: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 9276.70 0.06 1.00 X codegen version 14762.23 0.04 0.63 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For map: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 58869.79 0.01 1.00 X codegen version 9285.36 0.06 6.34 X ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10816 from cloud-fan/hash-benchmark.
*	[SPARK-12616][SQL] Making Logical Operator `Union` Support Arbitrary Number ↵	gatorsmile	2016-01-20	20	-122/+322
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	of Children The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one. `Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10577 from gatorsmile/unionAllMultiChildren.
*	[SPARK-12898] Consider having dummyCallSite for HiveTableScan	Rajesh Balamohan	2016-01-20	1	-3/+10
\| \| \| \| \| \| \| \|	Currently, HiveTableScan runs with getCallSite which is really expensive and shows up when scanning through large table with partitions (e.g TPC-DS) which slows down the overall runtime of the job. It would be good to consider having dummyCallSite in HiveTableScan. Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10825 from rajeshbalamohan/SPARK-12898.
*	[SPARK-12925][SQL] Improve HiveInspectors.unwrap for StringObjectIns…	Rajesh Balamohan	2016-01-20	1	-1/+3
\| \| \| \| \| \| \| \|	Text is in UTF-8 and converting it via "UTF8String.fromString" incurs decoding and encoding, which turns out to be expensive and redundant. Profiler snapshot details is attached in the JIRA (ref:https://issues.apache.org/jira/secure/attachment/12783331/SPARK-12925_profiler_cpu_samples.png) Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10848 from rajeshbalamohan/SPARK-12925.
*	[SPARK-12881] [SQL] subexpress elimination in mutable projection	Davies Liu	2016-01-20	11	-27/+80
\| \| \| \| \| \|	Author: Davies Liu <davies@databricks.com> Closes #10814 from davies/mutable_subexpr.
*	[SPARK-12912][SQL] Add a test suite for EliminateSubQueries	Reynold Xin	2016-01-20	4	-26/+103
\| \| \| \| \| \| \| \|	Also updated documentation to explain why ComputeCurrentTime and EliminateSubQueries are in the optimizer rather than analyzer. Author: Reynold Xin <rxin@databricks.com> Closes #10837 from rxin/optimizer-analyzer-comment.
*	[SPARK-12871][SQL] Support to specify the option for compression codec.	hyukjinkwon	2016-01-19	3	-2/+70
\| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-12871 This PR added an option to support to specify compression codec. This adds the option `codec` as an alias `compression` as filed in [SPARK-12668 ](https://issues.apache.org/jira/browse/SPARK-12668). Note that I did not add configurations for Hadoop 1.x as this `CsvRelation` is using Hadoop 2.x API and I guess it is going to drop Hadoop 1.x support. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10805 from HyukjinKwon/SPARK-12420.
*	[SPARK-12770][SQL] Implement rules for branch elimination for CaseWhen	Reynold Xin	2016-01-19	2	-0/+55
\| \| \| \| \| \| \| \| \| \| \| \|	The three optimization cases are: 1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch's condition is a false or null literal, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. Author: Reynold Xin <rxin@databricks.com> Closes #10827 from rxin/SPARK-12770.
*	[SPARK-12816][SQL] De-alias type when generating schemas	Jakob Odersky	2016-01-19	2	-1/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Call `dealias` on local types to fix schema generation for abstract type members, such as ```scala type KeyValue = (Int, String) ``` Add simple test Author: Jakob Odersky <jodersky@gmail.com> Closes #10749 from jodersky/aliased-schema.
*	[SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8strings	Imran Rashid	2016-01-19	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	See https://issues.apache.org/jira/browse/SPARK-12560 This isn't causing any problems currently because the tests for string predicate pushdown are currently disabled. I ran into this while trying to turn them back on with a different version of parquet. Figure it was good to fix now in any case. Author: Imran Rashid <irashid@cloudera.com> Closes #10510 from squito/SPARK-12560.
*	[SPARK-12867][SQL] Nullability of Intersect can be stricter	gatorsmile	2016-01-19	2	-6/+33
\| \| \| \| \| \| \| \| \| \| \| \|	JIRA: https://issues.apache.org/jira/browse/SPARK-12867 When intersecting one nullable column with one non-nullable column, the result will not contain any null. Thus, we can make nullability of `intersect` stricter. liancheng Could you please check if the code changes are appropriate? Also added test cases to verify the results. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10812 from gatorsmile/nullabilityIntersect.
*	[SPARK-12887] Do not expose var's in TaskMetrics	Andrew Or	2016-01-19	2	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug. Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Author: andrewor14 <andrew@databricks.com> Closes #10815 from andrewor14/get-or-create-metrics.
*	[SPARK-12870][SQL] better format bucket id in file name	Wenchen Fan	2016-01-19	4	-7/+13
\| \| \| \| \| \| \| \|	for normal parquet file without bucket, it's file name ends with a jobUUID which maybe all numbers and mistakeny regarded as bucket id. This PR improves the format of bucket id in file name by using a different seperator, `_`, so that the regex is more robust. Author: Wenchen Fan <wenchen@databricks.com> Closes #10799 from cloud-fan/fix-bucket.
*	[SQL][MINOR] Fix one little mismatched comment according to the codes in ↵	proflin	2016-01-19	1	-1/+1
\| \| \| \| \| \| \| \|	interface.scala Author: proflin <proflin.me@gmail.com> Closes #10824 from proflin/master.
*	[SPARK-12668][SQL] Providing aliases for CSV options to be similar to Pandas ↵	hyukjinkwon	2016-01-18	3	-6/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	and R https://issues.apache.org/jira/browse/SPARK-12668 Spark CSV datasource has been being merged (filed in [SPARK-12420](https://issues.apache.org/jira/browse/SPARK-12420)). This is a quicky PR that simply renames several CSV options to similar Pandas and R. - Alias for delimiter -> sep - charset -> encoding Author: hyukjinkwon <gurwls223@gmail.com> Closes #10800 from HyukjinKwon/SPARK-12668.
*	[HOT][BUILD] Changed the import order	gatorsmile	2016-01-18	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is to fix the master's build break. The following tests failed due to the import order issues in the master. https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49651/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49652/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49653/consoleFull Author: gatorsmile <gatorsmile@gmail.com> Closes #10823 from gatorsmile/importOrder.
*	[SPARK-12700] [SQL] embed condition into SMJ and BroadcastHashJoin	Davies Liu	2016-01-18	6	-72/+96
\| \| \| \| \| \| \| \| \| \| \| \|	Currently SortMergeJoin and BroadcastHashJoin do not support condition, the need a followed Filter for that, the result projection to generate UnsafeRow could be very expensive if they generate lots of rows and could be filtered mostly by condition. This PR brings the support of condition for SortMergeJoin and BroadcastHashJoin, just like other outer joins do. This could improve the performance of Q72 by 7x (from 120s to 16.5s). Author: Davies Liu <davies@databricks.com> Closes #10653 from davies/filter_join.
*	[SPARK-12889][SQL] Rename ParserDialect -> ParserInterface.	Reynold Xin	2016-01-18	7	-10/+10
\| \| \| \| \| \| \| \|	Based on discussions in #10801, I'm submitting a pull request to rename ParserDialect to ParserInterface. Author: Reynold Xin <rxin@databricks.com> Closes #10817 from rxin/SPARK-12889.
*	[SPARK-12882][SQL] simplify bucket tests and add more comments	Wenchen Fan	2016-01-18	2	-46/+78
\| \| \| \| \| \| \| \|	Right now, the bucket tests are kind of hard to understand, this PR simplifies them and add more commetns. Author: Wenchen Fan <wenchen@databricks.com> Closes #10813 from cloud-fan/bucket-comment.
*	[SPARK-12841][SQL] fix cast in filter	Wenchen Fan	2016-01-18	3	-8/+18
\| \| \| \| \| \| \| \|	In SPARK-10743 we wrap cast with `UnresolvedAlias` to give `Cast` a better alias if possible. However, for cases like `filter`, the `UnresolvedAlias` can't be resolved and actually we don't need a better alias for this case. This PR move the cast wrapping logic to `Column.named` so that we will only do it when we need a alias name. Author: Wenchen Fan <wenchen@databricks.com> Closes #10781 from cloud-fan/bug.
*	[SPARK-12855][SQL] Remove parser dialect developer API	Reynold Xin	2016-01-18	12	-138/+13
\| \| \| \| \| \| \| \|	This pull request removes the public developer parser API for external parsers. Given everything a parser depends on (e.g. logical plans and expressions) are internal and not stable, external parsers will break with every release of Spark. It is a bad idea to create the illusion that Spark actually supports pluggable parsers. In addition, this also reduces incentives for 3rd party projects to contribute parse improvements back to Spark. Author: Reynold Xin <rxin@databricks.com> Closes #10801 from rxin/SPARK-12855.
*	[SPARK-12873][SQL] Add more comment in HiveTypeCoercion for type widening	Reynold Xin	2016-01-18	2	-40/+49
\| \| \| \| \| \| \| \| \| \|	I was reading this part of the analyzer code again and got confused by the difference between findWiderTypeForTwo and findTightestCommonTypeOfTwo. I also simplified WidenSetOperationTypes to make it a lot simpler. The easiest way to review this one is to just read the original code, and the new code. The logic is super simple. Author: Reynold Xin <rxin@databricks.com> Closes #10802 from rxin/SPARK-12873.
*	[SPARK-12558][FOLLOW-UP] AnalysisException when multiple functions applied ↵	Dilip Biswal	2016-01-18	1	-5/+9
\| \| \| \| \| \| \| \| \| \| \|	in GROUP BY clause Addresses the comments from Yin. https://github.com/apache/spark/pull/10520 Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #10758 from dilipbiswal/spark-12558-followup.
*	[SPARK-12860] [SQL] speed up safe projection for primitive types	Wenchen Fan	2016-01-17	1	-2/+3
\| \| \| \| \| \| \| \| \| \|	The idea is simple, use `SpecificMutableRow` instead of `GenericMutableRow` as result row for safe projection. A simple benchmark shows about 1.5x speed up for primitive types, code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-safeprojectionbenchmark-scala Author: Wenchen Fan <wenchen@databricks.com> Closes #10790 from cloud-fan/safe-projection.
*	[SPARK-12796] [SQL] Whole stage codegen	Davies Liu	2016-01-16	37	-107/+694
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is the initial work for whole stage codegen, it support Projection/Filter/Range, we will continue work on this to support more physical operators. A micro benchmark show that a query with range, filter and projection could be 3X faster then before. It's turned on by default. For a tree that have at least two chained plans, a WholeStageCodegen will be inserted into it, for example, the following plan ``` Limit 10 +- Project [(id#5L + 1) AS (id + 1)#6L] +- Filter ((id#5L & 1) = 1) +- Range 0, 1, 4, 10, [id#5L] ``` will be translated into ``` Limit 10 +- WholeStageCodegen +- Project [(id#1L + 1) AS (id + 1)#2L] +- Filter ((id#1L & 1) = 1) +- Range 0, 1, 4, 10, [id#1L] ``` Here is the call graph to generate Java source for A and B (A support codegen, but B does not): ``` * WholeStageCodegen Plan A FakeInput Plan B * ========================================================================= * * -> execute() * \| * doExecute() --------> produce() * \| * doProduce() -------> produce() * \| * doProduce() ---> execute() * \| * consume() * doConsume() ------------\| * \| * doConsume() <----- consume() ``` A SparkPlan that support codegen need to implement doProduce() and doConsume(): ``` def doProduce(ctx: CodegenContext): (RDD[InternalRow], String) def doConsume(ctx: CodegenContext, child: SparkPlan, input: Seq[ExprCode]): String ``` Author: Davies Liu <davies@databricks.com> Closes #10735 from davies/whole2.
*	[SPARK-12856] [SQL] speed up hashCode of unsafe array	Wenchen Fan	2016-01-16	1	-5/+2
\| \| \| \| \| \| \| \| \| \|	We iterate the bytes to calculate hashCode before, but now we have `Murmur3_x86_32.hashUnsafeBytes` that don't require the bytes to be word algned, we should use that instead. A simple benchmark shows it's about 3 X faster, benchmark code: https://gist.github.com/cloud-fan/fa77713ccebf0823b2ab#file-arrayhashbenchmark-scala Author: Wenchen Fan <wenchen@databricks.com> Closes #10784 from cloud-fan/array-hashcode.
*	[SPARK-12840] [SQL] Support passing arbitrary objects (not just expressions) ↵	Davies Liu	2016-01-15	11	-49/+48
\| \| \| \| \| \| \| \| \| \|	into code generated classes This is a refactor to support codegen for aggregation and broadcast join. Author: Davies Liu <davies@databricks.com> Closes #10777 from davies/rename2.
*	[SPARK-12644][SQL] Update parquet reader to be vectorized.	Nong Li	2016-01-15	11	-53/+622
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This inlines a few of the Parquet decoders and adds vectorized APIs to support decoding in batch. There are a few particulars in the Parquet encodings that make this much more efficient. In particular, RLE encodings are very well suited for batch decoding. The Parquet 2.0 encodings are also very suited for this. This is a work in progress and does not affect the current execution. In subsequent patches, we will support more encodings and types before enabling this. Simple benchmarks indicate this can decode single ints about > 3x faster. Author: Nong Li <nong@databricks.com> Author: Nong <nongli@gmail.com> Closes #10593 from nongli/spark-12644.
*	[SPARK-12649][SQL] support reading bucketed table	Wenchen Fan	2016-01-15	18	-45/+314
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds the support to read bucketed tables, and correctly populate `outputPartitioning`, so that we can avoid shuffle for some cases. TODO(follow-up PRs): * bucket pruning * avoid shuffle for bucketed table join when use any super-set of the bucketing key. (we should re-visit it after https://issues.apache.org/jira/browse/SPARK-12704 is fixed) * recognize hive bucketed table Author: Wenchen Fan <wenchen@databricks.com> Closes #10604 from cloud-fan/bucket-read.
*	[SPARK-12833][HOT-FIX] Reset the locale after we set it.	Yin Huai	2016-01-15	1	-4/+9
\| \| \| \| \| \|	Author: Yin Huai <yhuai@databricks.com> Closes #10778 from yhuai/resetLocale.