aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-13031][SQL] cleanup codegen and improve test coverageDavies Liu2016-01-2911-205/+350
| | | | | | | | | | | | | | 1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. This PR re-open #10944 and fix the bug. Author: Davies Liu <davies@databricks.com> Closes #10977 from davies/gen_refactor.
* [SPARK-13067] [SQL] workaround for a weird scala reflection problemWenchen Fan2016-01-282-6/+23
| | | | | | | | | A simple workaround to avoid getting parameter types when convert a logical plan to json. Author: Wenchen Fan <wenchen@databricks.com> Closes #10970 from cloud-fan/reflection.
* [SPARK-12968][SQL] Implement command to set current databaseLiang-Chi Hsieh2016-01-289-3/+50
| | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12968 Implement command to set current database. Author: Liang-Chi Hsieh <viirya@gmail.com> Author: Liang-Chi Hsieh <viirya@appier.com> Closes #10916 from viirya/ddl-use-database.
* Revert "[SPARK-13031] [SQL] cleanup codegen and improve test coverage"Davies Liu2016-01-289-334/+202
| | | | This reverts commit cc18a7199240bf3b03410c1ba6704fe7ce6ae38e.
* [SPARK-11955][SQL] Mark optional fields in merging schema for safely ↵Liang-Chi Hsieh2016-01-286-29/+117
| | | | | | | | | | | | | | | pushdowning filters in Parquet JIRA: https://issues.apache.org/jira/browse/SPARK-11955 Currently we simply skip pushdowning filters in parquet if we enable schema merging. However, we can actually mark particular fields in merging schema for safely pushdowning filters in parquet. Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #9940 from viirya/safe-pushdown-parquet-filters.
* [SPARK-12749][SQL] add json option to parse floating-point types as DecimalTypeBrandon Bradley2016-01-284-2/+38
| | | | | | | | | | I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success. Added test for non-complex types. Should I add a test for complex types? Author: Brandon Bradley <bradleytastic@gmail.com> Closes #10936 from blbradley/spark-12749.
* [SPARK-13031] [SQL] cleanup codegen and improve test coverageDavies Liu2016-01-289-202/+334
| | | | | | | | | | | | 1. enable whole stage codegen during tests even there is only one operator supports that. 2. split doProduce() into two APIs: upstream() and doProduce() 3. generate prefix for fresh names of each operator 4. pass UnsafeRow to parent directly (avoid getters and create UnsafeRow again) 5. fix bugs and tests. Author: Davies Liu <davies@databricks.com> Closes #10944 from davies/gen_refactor.
* [SPARK-12926][SQL] SQLContext to display warning message when non-sql ↵Tejas Patil2016-01-281-3/+11
| | | | | | | | | | configs are being set Users unknowingly try to set core Spark configs in SQLContext but later realise that it didn't work. eg. sqlContext.sql("SET spark.shuffle.memoryFraction=0.4"). This PR adds a warning message when such operations are done. Author: Tejas Patil <tejasp@fb.com> Closes #10849 from tejasapatil/SPARK-12926.
* [SPARK-12818][SQL] Specialized integral and string types for Count-min SketchCheng Lian2016-01-281-26/+39
| | | | | | | | This PR is a follow-up of #10911. It adds specialized update methods for `CountMinSketch` so that we can avoid doing internal/external row format conversion in `DataFrame.countMinSketch()`. Author: Cheng Lian <lian@databricks.com> Closes #10968 from liancheng/cms-specialized.
* [SPARK-13045] [SQL] Remove ColumnVector.Struct in favor of ColumnarBatch.RowNong Li2016-01-273-120/+32
| | | | | | | | These two classes became identical as the implementation progressed. Author: Nong Li <nong@databricks.com> Closes #10952 from nongli/spark-13045.
* [SPARK-12865][SPARK-12866][SQL] Migrate SparkSQLParser/ExtendedHiveQlParser ↵Herman van Hovell2016-01-2716-226/+161
| | | | | | | | | | | | | | | | commands to new Parser This PR moves all the functionality provided by the SparkSQLParser/ExtendedHiveQlParser to the new Parser hierarchy (SparkQl/HiveQl). This also improves the current SET command parsing: the current implementation swallows ```set role ...``` and ```set autocommit ...``` commands, this PR respects these commands (and passes them on to Hive). This PR and https://github.com/apache/spark/pull/10723 end the use of Parser-Combinator parsers for SQL parsing. As a result we can also remove the ```AbstractSQLParser``` in Catalyst. The PR is marked WIP as long as it doesn't pass all tests. cc rxin viirya winningsix (this touches https://github.com/apache/spark/pull/10144) Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10905 from hvanhovell/SPARK-12866.
* [SPARK-12938][SQL] DataFrame API for Bloom filterWenchen Fan2016-01-273-2/+127
| | | | | | | | | | This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs. This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`. Author: Wenchen Fan <wenchen@databricks.com> Closes #10937 from cloud-fan/bloom-filter.
* [SPARK-12895][SPARK-12896] Migrate TaskMetrics to accumulatorsAndrew Or2016-01-2715-66/+92
| | | | | | | | | | | | | | | | The high level idea is that instead of having the executors send both accumulator updates and TaskMetrics, we should have them send only accumulator updates. This eliminates the need to maintain both code paths since one can be implemented in terms of the other. This effort is split into two parts: **SPARK-12895: Implement TaskMetrics using accumulators.** TaskMetrics is basically just a bunch of accumulable fields. This patch makes TaskMetrics a syntactic wrapper around a collection of accumulators so we don't need to send TaskMetrics from the executors to the driver. **SPARK-12896: Send only accumulator updates to the driver.** Now that TaskMetrics are expressed in terms of accumulators, we can capture all TaskMetrics values if we just send accumulator updates from the executors to the driver. This completes the parent issue SPARK-10620. While an effort has been made to preserve as much of the public API as possible, there were a few known breaking DeveloperApi changes that would be very awkward to maintain. I will gather the full list shortly and post it here. Note: This was once part of #10717. This patch is split out into its own patch from there to make it easier for others to review. Other smaller pieces of already been merged into master. Author: Andrew Or <andrew@databricks.com> Closes #10835 from andrewor14/task-metrics-use-accums.
* [SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata with ↵Jason Lee2016-01-271-1/+6
| | | | | | | | | | `None` triggers cryptic failure The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works. Author: Jason Lee <cjlee@us.ibm.com> Closes #8969 from jasoncl/SPARK-10847.
* [SPARK-12728][SQL] Integrates SQL generation with native viewCheng Lian2016-01-266-95/+200
| | | | | | | | | | | | | | | | | | | | | | | This PR is a follow-up of PR #10541. It integrates the newly introduced SQL generation feature with native view to make native view canonical. In this PR, a new SQL option `spark.sql.nativeView.canonical` is added. When this option and `spark.sql.nativeView` are both `true`, Spark SQL tries to handle `CREATE VIEW` DDL statements using SQL query strings generated from view definition logical plans. If we failed to map the plan to SQL, we fallback to the original native view approach. One important issue this PR fixes is that, now we can use CTE when defining a view. Originally, when native view is turned on, we wrap the view definition text with an extra `SELECT`. However, HiveQL parser doesn't allow CTE appearing as a subquery. Namely, something like this is disallowed: ```sql SELECT n FROM ( WITH w AS (SELECT 1 AS n) SELECT * FROM w ) v ``` This PR fixes this issue because the extra `SELECT` is no longer needed (also, CTE expressions are inlined as subqueries during analysis phase, thus there won't be CTE expressions in the generated SQL query string). Author: Cheng Lian <lian@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #10733 from liancheng/spark-12728.integrate-sql-gen-with-native-view.
* [SPARK-12935][SQL] DataFrame API for Count-Min SketchCheng Lian2016-01-264-1/+149
| | | | | | | | This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs. Author: Cheng Lian <lian@databricks.com> Closes #10911 from liancheng/cms-df-api.
* [SPARK-12854][SQL] Implement complex types support in ColumnarBatchNong Li2016-01-2615-90/+1660
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds support for complex types for ColumnarBatch. ColumnarBatch supports structs and arrays. There is a simple mapping between the richer catalyst types to these two. Strings are treated as an array of bytes. ColumnarBatch will contain a column for each node of the schema. Non-complex schemas consists of just leaf nodes. Structs represent an internal node with one child for each field. Arrays are internal nodes with one child. Structs just contain nullability. Arrays contain offsets and lengths into the child array. This structure is able to handle arbitrary nesting. It has the key property that we maintain columnar throughout and that primitive types are only stored in the leaf nodes and contiguous across rows. For example, if the schema is ``` array<array<int>> ``` There are three columns in the schema. The internal nodes each have one children. The leaf node contains all the int data stored consecutively. As part of this, this patch adds append APIs in addition to the Put APIs (e.g. putLong(rowid, v) vs appendLong(v)). These APIs are necessary when the batch contains variable length elements. The vectors are not fixed length and will grow as necessary. This should make the usage a lot simpler for the writer. Author: Nong Li <nong@databricks.com> Closes #10820 from nongli/spark-12854.
* [SQL] Minor Scaladoc format fixCheng Lian2016-01-261-4/+4
| | | | | | | | Otherwise the `^` character is always marked as error in IntelliJ since it represents an unclosed superscript markup tag. Author: Cheng Lian <lian@databricks.com> Closes #10926 from liancheng/agg-doc-fix.
* [SPARK-12682][SQL] Add support for (optionally) not storing tables in hive ↵Sameer Agarwal2016-01-262-0/+39
| | | | | | | | | | metadata format This PR adds a new table option (`skip_hive_metadata`) that'd allow the user to skip storing the table metadata in hive metadata format. While this could be useful in general, the specific use-case for this change is that Hive doesn't handle wide schemas well (see https://issues.apache.org/jira/browse/SPARK-12682 and https://issues.apache.org/jira/browse/SPARK-6024) which in turn prevents such tables from being queried in SparkSQL. Author: Sameer Agarwal <sameer@databricks.com> Closes #10826 from sameeragarwal/skip-hive-metadata.
* [SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is ↵Sean Owen2016-01-262-16/+13
| | | | | | | | | | | | inconsistent with Scala's Iterator->Iterator Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable. CC rxin pwendell for API change; tdas since it also touches streaming. Author: Sean Owen <sowen@cloudera.com> Closes #10413 from srowen/SPARK-3369.
* [SQL][MINOR] A few minor tweaks to CSV reader.Reynold Xin2016-01-262-14/+9
| | | | | | | | This pull request simply fixes a few minor coding style issues in csv, as I was reviewing the change post-hoc. Author: Reynold Xin <rxin@databricks.com> Closes #10919 from rxin/csv-minor.
* [SPARK-12879] [SQL] improve the unsafe row writing frameworkWenchen Fan2016-01-257-78/+258
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As we begin to use unsafe row writing framework(`BufferHolder` and `UnsafeRowWriter`) in more and more places(`UnsafeProjection`, `UnsafeRowParquetRecordReader`, `GenerateColumnAccessor`, etc.), we should add more doc to it and make it easier to use. This PR abstract the technique used in `UnsafeRowParquetRecordReader`: avoid unnecessary operatition as more as possible. For example, do not always point the row to the buffer at the end, we only need to update the size of row. If all fields are of primitive type, we can even save the row size updating. Then we can apply this technique to more places easily. a local benchmark shows `UnsafeProjection` is up to 1.7x faster after this PR: **old version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 2616.04 102.61 1.00 X single nullable long 3032.54 88.52 0.86 X primitive types 9121.05 29.43 0.29 X nullable primitive types 12410.60 21.63 0.21 X ``` **new version** ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz unsafe projection: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- single long 1533.34 175.07 1.00 X single nullable long 2306.73 116.37 0.66 X primitive types 8403.93 31.94 0.18 X nullable primitive types 12448.39 21.56 0.12 X ``` For single non-nullable long(the best case), we can have about 1.7x speed up. Even it's nullable, we can still have 1.3x speed up. For other cases, it's not such a boost as the saved operations only take a little proportion of the whole process. The benchmark code is included in this PR. Author: Wenchen Fan <wenchen@databricks.com> Closes #10809 from cloud-fan/unsafe-projection.
* [SPARK-12975][SQL] Throwing Exception when Bucketing Columns are part of ↵gatorsmile2016-01-253-3/+83
| | | | | | | | | | | | | | | | | | | | | | Partitioning Columns When users are using `partitionBy` and `bucketBy` at the same time, some bucketing columns might be part of partitioning columns. For example, ``` df.write .format(source) .partitionBy("i") .bucketBy(8, "i", "k") .saveAsTable("bucketed_table") ``` However, in the above case, adding column `i` into `bucketBy` is useless. It is just wasting extra CPU when reading or writing bucket tables. Thus, like Hive, we can issue an exception and let users do the change. Also added a test case for checking if the information of `sortBy` and `bucketBy` columns are correctly saved in the metastore table. Could you check if my understanding is correct? cloud-fan rxin marmbrus Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10891 from gatorsmile/commonKeysInPartitionByBucketBy.
* [SPARK-12901][SQL][HOT-FIX] Fix scala 2.11 compilation.Yin Huai2016-01-252-2/+2
|
* [SPARK-12902] [SQL] visualization for generated operatorsDavies Liu2016-01-257-30/+98
| | | | | | | | | | | | | | This PR brings back visualization for generated operators, they looks like: ![sql](https://cloud.githubusercontent.com/assets/40902/12460920/0dc7956a-bf6b-11e5-9c3f-8389f452526e.png) ![stage](https://cloud.githubusercontent.com/assets/40902/12460923/11806ac4-bf6b-11e5-9c72-e84a62c5ea93.png) Note: SQL metrics are not supported right now, because they are very slow, will be supported once we have batch mode. Author: Davies Liu <davies@databricks.com> Closes #10828 from davies/viz_codegen.
* [SPARK-12932][JAVA API] improved error message for java type inference failureAndy Grove2016-01-251-1/+2
| | | | | | Author: Andy Grove <andygrove73@gmail.com> Closes #10865 from andygrove/SPARK-12932.
* [SPARK-12901][SQL] Refactor options for JSON and CSV datasource (not case ↵hyukjinkwon2016-01-256-52/+40
| | | | | | | | | | | | | | | | | class and same format). https://issues.apache.org/jira/browse/SPARK-12901 This PR refactors the options in JSON and CSV datasources. In more details, 1. `JSONOptions` uses the same format as `CSVOptions`. 2. Not case classes. 3. `CSVRelation` that does not have to be serializable (it was `with Serializable` but I removed) Author: hyukjinkwon <gurwls223@gmail.com> Closes #10895 from HyukjinKwon/SPARK-12901.
* [SPARK-12624][PYSPARK] Checks row length when converting Java arrays to ↵Cheng Lian2016-01-241-1/+8
| | | | | | | | | | Python rows When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`. Author: Cheng Lian <lian@databricks.com> Closes #10886 from liancheng/spark-12624.
* [SPARK-12971] Fix Hive tests which fail in Hadoop-2.3 SBT buildJosh Rosen2016-01-242-4/+22
| | | | | | | | | | ErrorPositionSuite and one of the HiveComparisonTest tests have been consistently failing on the Hadoop 2.3 SBT build (but on no other builds). I believe that this is due to test isolation issues (e.g. tests sharing state via the sets of temporary tables that are registered to TestHive). This patch attempts to improve the isolation of these tests in order to address this issue. Author: Josh Rosen <joshrosen@databricks.com> Closes #10884 from JoshRosen/fix-failing-hadoop-2.3-hive-tests.
* [SPARK-12904][SQL] Strength reduction for integral and decimal literal ↵Reynold Xin2016-01-236-139/+376
| | | | | | | | | | comparisons This pull request implements strength reduction for comparing integral expressions and decimal literals, which is more common now because we switch to parsing fractional literals as decimal types (rather than doubles). I added the rules to the existing DecimalPrecision rule with some refactoring to simplify the control flow. I also moved DecimalPrecision rule into its own file due to the growing size. Author: Reynold Xin <rxin@databricks.com> Closes #10882 from rxin/SPARK-12904-1.
* [SPARK-12872][SQL] Support to specify the option for compression codec for ↵hyukjinkwon2016-01-225-29/+96
| | | | | | | | | | | | | | | JSON datasource https://issues.apache.org/jira/browse/SPARK-12872 This PR makes the JSON datasource can compress output by option instead of manually setting Hadoop configurations. For reflecting codec by names, it is similar with https://github.com/apache/spark/pull/10805. As `CSVCompressionCodecs` can be shared with other datasources, it became a separate class to share as `CompressionCodecs`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10858 from HyukjinKwon/SPARK-12872.
* [SPARK-12959][SQL] Writing Bucketed Data with Disabled Bucketing in SQLConfgatorsmile2016-01-223-6/+26
| | | | | | | | | | | | When users turn off bucketing in SQLConf, we should issue some messages to tell users these operations will be converted to normal way. Also added a test case for this scenario and fixed the helper function. Do you think this PR is helpful when using bucket tables? cloud-fan Thank you! Author: gatorsmile <gatorsmile@gmail.com> Closes #10870 from gatorsmile/bucketTableWritingTestcases.
* [SPARK-12747][SQL] Use correct type name for Postgres JDBC's real arrayLiang-Chi Hsieh2016-01-212-0/+4
| | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12747 Postgres JDBC driver uses "FLOAT4" or "FLOAT8" not "real". Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #10695 from viirya/fix-postgres-jdbc.
* [SPARK-8968] [SQL] [HOT-FIX] Fix scala 2.11 build.Yin Huai2016-01-201-1/+1
|
* [SPARK-8968][SQL] external sort by the partition clomns when dynamic ↵wangfei2016-01-202-99/+166
| | | | | | | | | | | | | | | | | partitioning to optimize the memory overhead Now the hash based writer dynamic partitioning show the bad performance for big data and cause many small files and high GC. This patch we do external sort first so that each time we only need open one writer. before this patch: ![gc](https://cloud.githubusercontent.com/assets/7018048/9149788/edc48c6e-3dec-11e5-828c-9995b56e4d65.PNG) after this patch: ![gc-optimize-externalsort](https://cloud.githubusercontent.com/assets/7018048/9149794/60f80c9c-3ded-11e5-8a56-7ae18ddc7a2f.png) Author: wangfei <wangfei_hello@126.com> Author: scwf <wangfei1@huawei.com> Closes #7336 from scwf/dynamic-optimize-basedon-apachespark.
* [SPARK-12797] [SQL] Generated TungstenAggregate (without grouping keys)Davies Liu2016-01-205-12/+111
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As discussed in #10786, the generated TungstenAggregate does not support imperative functions. For a query ``` sqlContext.range(10).filter("id > 1").groupBy().count() ``` The generated code will looks like: ``` /* 032 */ if (!initAgg0) { /* 033 */ initAgg0 = true; /* 034 */ /* 035 */ // initialize aggregation buffer /* 037 */ long bufValue2 = 0L; /* 038 */ /* 039 */ /* 040 */ // initialize Range /* 041 */ if (!range_initRange5) { /* 042 */ range_initRange5 = true; ... /* 071 */ } /* 072 */ /* 073 */ while (!range_overflow8 && range_number7 < range_partitionEnd6) { /* 074 */ long range_value9 = range_number7; /* 075 */ range_number7 += 1L; /* 076 */ if (range_number7 < range_value9 ^ 1L < 0) { /* 077 */ range_overflow8 = true; /* 078 */ } /* 079 */ /* 085 */ boolean primitive11 = false; /* 086 */ primitive11 = range_value9 > 1L; /* 087 */ if (!false && primitive11) { /* 092 */ // do aggregate and update aggregation buffer /* 099 */ long primitive17 = -1L; /* 100 */ primitive17 = bufValue2 + 1L; /* 101 */ bufValue2 = primitive17; /* 105 */ } /* 107 */ } /* 109 */ /* 110 */ // output the result /* 112 */ bufferHolder25.reset(); /* 114 */ rowWriter26.initialize(bufferHolder25, 1); /* 118 */ rowWriter26.write(0, bufValue2); /* 120 */ result24.pointTo(bufferHolder25.buffer, bufferHolder25.totalSize()); /* 121 */ currentRow = result24; /* 122 */ return; /* 124 */ } /* 125 */ ``` cc nongli Author: Davies Liu <davies@databricks.com> Closes #10840 from davies/gen_agg.
* [SPARK-12848][SQL] Change parsed decimal literal datatype from Double to DecimalHerman van Hovell2016-01-2027-58/+82
| | | | | | | | | | | | | | The current parser turns a decimal literal, for example ```12.1```, into a Double. The problem with this approach is that we convert an exact literal into a non-exact ```Double```. The PR changes this behavior, a Decimal literal is now converted into an extact ```BigDecimal```. The behavior for scientific decimals, for example ```12.1e01```, is unchanged. This will be converted into a Double. This PR replaces the ```BigDecimal``` literal by a ```Double``` literal, because the ```BigDecimal``` is the default now. You can use the double literal by appending a 'D' to the value, for instance: ```3.141527D``` cc davies rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #10796 from hvanhovell/SPARK-12848.
* [SPARK-12888][SQL] benchmark the new hash expressionWenchen Fan2016-01-201-0/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Benchmark it on 4 different schemas, the result: ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For simple: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 31.47 266.54 1.00 X codegen version 64.52 130.01 0.49 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For normal: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 4068.11 0.26 1.00 X codegen version 1175.92 0.89 3.46 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For array: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 9276.70 0.06 1.00 X codegen version 14762.23 0.04 0.63 X ``` ``` Intel(R) Core(TM) i7-4960HQ CPU 2.60GHz Hash For map: Avg Time(ms) Avg Rate(M/s) Relative Rate ------------------------------------------------------------------------------- interpreted version 58869.79 0.01 1.00 X codegen version 9285.36 0.06 6.34 X ``` Author: Wenchen Fan <wenchen@databricks.com> Closes #10816 from cloud-fan/hash-benchmark.
* [SPARK-12616][SQL] Making Logical Operator `Union` Support Arbitrary Number ↵gatorsmile2016-01-2020-122/+322
| | | | | | | | | | | | | | of Children The existing `Union` logical operator only supports two children. Thus, adding a new logical operator `Unions` which can have arbitrary number of children to replace the existing one. `Union` logical plan is a binary node. However, a typical use case for union is to union a very large number of input sources (DataFrames, RDDs, or files). It is not uncommon to union hundreds of thousands of files. In this case, our optimizer can become very slow due to the large number of logical unions. We should change the Union logical plan to support an arbitrary number of children, and add a single rule in the optimizer to collapse all adjacent `Unions` into a single `Unions`. Note that this problem doesn't exist in physical plan, because the physical `Unions` already supports arbitrary number of children. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #10577 from gatorsmile/unionAllMultiChildren.
* [SPARK-12898] Consider having dummyCallSite for HiveTableScanRajesh Balamohan2016-01-201-3/+10
| | | | | | | | Currently, HiveTableScan runs with getCallSite which is really expensive and shows up when scanning through large table with partitions (e.g TPC-DS) which slows down the overall runtime of the job. It would be good to consider having dummyCallSite in HiveTableScan. Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10825 from rajeshbalamohan/SPARK-12898.
* [SPARK-12925][SQL] Improve HiveInspectors.unwrap for StringObjectIns…Rajesh Balamohan2016-01-201-1/+3
| | | | | | | | Text is in UTF-8 and converting it via "UTF8String.fromString" incurs decoding and encoding, which turns out to be expensive and redundant. Profiler snapshot details is attached in the JIRA (ref:https://issues.apache.org/jira/secure/attachment/12783331/SPARK-12925_profiler_cpu_samples.png) Author: Rajesh Balamohan <rbalamohan@apache.org> Closes #10848 from rajeshbalamohan/SPARK-12925.
* [SPARK-12881] [SQL] subexpress elimination in mutable projectionDavies Liu2016-01-2011-27/+80
| | | | | | Author: Davies Liu <davies@databricks.com> Closes #10814 from davies/mutable_subexpr.
* [SPARK-12912][SQL] Add a test suite for EliminateSubQueriesReynold Xin2016-01-204-26/+103
| | | | | | | | Also updated documentation to explain why ComputeCurrentTime and EliminateSubQueries are in the optimizer rather than analyzer. Author: Reynold Xin <rxin@databricks.com> Closes #10837 from rxin/optimizer-analyzer-comment.
* [SPARK-12871][SQL] Support to specify the option for compression codec.hyukjinkwon2016-01-193-2/+70
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-12871 This PR added an option to support to specify compression codec. This adds the option `codec` as an alias `compression` as filed in [SPARK-12668 ](https://issues.apache.org/jira/browse/SPARK-12668). Note that I did not add configurations for Hadoop 1.x as this `CsvRelation` is using Hadoop 2.x API and I guess it is going to drop Hadoop 1.x support. Author: hyukjinkwon <gurwls223@gmail.com> Closes #10805 from HyukjinKwon/SPARK-12420.
* [SPARK-12770][SQL] Implement rules for branch elimination for CaseWhenReynold Xin2016-01-192-0/+55
| | | | | | | | | | | | The three optimization cases are: 1. If the first branch's condition is a true literal, remove the CaseWhen and use the value from that branch. 2. If a branch's condition is a false or null literal, remove that branch. 3. If only the else branch is left, remove the CaseWhen and use the value from the else branch. Author: Reynold Xin <rxin@databricks.com> Closes #10827 from rxin/SPARK-12770.
* [SPARK-12816][SQL] De-alias type when generating schemasJakob Odersky2016-01-192-1/+12
| | | | | | | | | | | | | | Call `dealias` on local types to fix schema generation for abstract type members, such as ```scala type KeyValue = (Int, String) ``` Add simple test Author: Jakob Odersky <jodersky@gmail.com> Closes #10749 from jodersky/aliased-schema.
* [SPARK-12560][SQL] SqlTestUtils.stripSparkFilter needs to copy utf8stringsImran Rashid2016-01-191-1/+1
| | | | | | | | | | See https://issues.apache.org/jira/browse/SPARK-12560 This isn't causing any problems currently because the tests for string predicate pushdown are currently disabled. I ran into this while trying to turn them back on with a different version of parquet. Figure it was good to fix now in any case. Author: Imran Rashid <irashid@cloudera.com> Closes #10510 from squito/SPARK-12560.
* [SPARK-12867][SQL] Nullability of Intersect can be strictergatorsmile2016-01-192-6/+33
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-12867 When intersecting one nullable column with one non-nullable column, the result will not contain any null. Thus, we can make nullability of `intersect` stricter. liancheng Could you please check if the code changes are appropriate? Also added test cases to verify the results. Thanks! Author: gatorsmile <gatorsmile@gmail.com> Closes #10812 from gatorsmile/nullabilityIntersect.
* [SPARK-12887] Do not expose var's in TaskMetricsAndrew Or2016-01-192-3/+1
| | | | | | | | | | | | | | | | This is a step in implementing SPARK-10620, which migrates TaskMetrics to accumulators. TaskMetrics has a bunch of var's, some are fully public, some are `private[spark]`. This is bad coding style that makes it easy to accidentally overwrite previously set metrics. This has happened a few times in the past and caused bugs that were difficult to debug. Instead, we should have get-or-create semantics, which are more readily understandable. This makes sense in the case of TaskMetrics because these are just aggregated metrics that we want to collect throughout the task, so it doesn't matter who's incrementing them. Parent PR: #10717 Author: Andrew Or <andrew@databricks.com> Author: Josh Rosen <joshrosen@databricks.com> Author: andrewor14 <andrew@databricks.com> Closes #10815 from andrewor14/get-or-create-metrics.
* [SPARK-12870][SQL] better format bucket id in file nameWenchen Fan2016-01-194-7/+13
| | | | | | | | for normal parquet file without bucket, it's file name ends with a jobUUID which maybe all numbers and mistakeny regarded as bucket id. This PR improves the format of bucket id in file name by using a different seperator, `_`, so that the regex is more robust. Author: Wenchen Fan <wenchen@databricks.com> Closes #10799 from cloud-fan/fix-bucket.