aboutsummaryrefslogtreecommitdiff
path: root/sql/core/src/main
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-16879][SQL] unify logical plans for CREATE TABLE and CTASWenchen Fan2016-08-059-250/+273
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? we have various logical plans for CREATE TABLE and CTAS: `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the complexity and centralize the error handling. ## How was this patch tested? existing tests Author: Wenchen Fan <wenchen@databricks.com> Closes #14482 from cloud-fan/table.
* [SPARK-16907][SQL] Fix performance regression for parquet table when ↵Sean Zhong2016-08-051-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vectorized parquet record reader is not being used ## What changes were proposed in this pull request? For non-partitioned parquet table, if the vectorized parquet record reader is not being used, Spark 2.0 adds an extra unnecessary memory copy to append partition values for each row. There are several typical cases that vectorized parquet record reader is not being used: 1. When the table schema is not flat, like containing nested fields. 2. When `spark.sql.parquet.enableVectorizedReader = false` By fixing this bug, we get about 20% - 30% performance gain in test case like this: ``` // Generates parquet table with nested columns spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4") def time[R](block: => R): Long = { val t0 = System.nanoTime() val result = block // call-by-name val t1 = System.nanoTime() println("Elapsed time: " + (t1 - t0)/1000000 + "ms") (t1 - t0)/1000000 } val x = ((0 until 20).toList.map(x => time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20 ``` ## How was this patch tested? After a few times warm up, we get 26% performance improvement Before fix: ``` Average: 4584ms, raw data (10 tries): 4726ms 4509ms 4454ms 4879ms 4586ms 4733ms 4500ms 4361ms 4456ms 4640ms ``` After fix: ``` Average: 3614ms, raw data(10 tries): 3554ms 3740ms 4019ms 3439ms 3460ms 3664ms 3557ms 3584ms 3612ms 3531ms ``` Test env: Intel(R) Core(TM) i7-6700 CPU 3.40GHz, Intel SSD SC2KW24 Author: Sean Zhong <seanzhong@databricks.com> Closes #14445 from clockfly/fix_parquet_regression_2.
* [SPARK-16875][SQL] Add args checking for DataSet randomSplit and sampleZheng RuiFeng2016-08-041-2/+12
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Add the missing args-checking for randomSplit and sample ## How was this patch tested? unit tests Author: Zheng RuiFeng <ruifengz@foxmail.com> Closes #14478 from zhengruifeng/fix_randomSplit.
* [SPARK-16884] Move DataSourceScanExec out of ExistingRDD.scala fileEric Liang2016-08-043-504/+525
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This moves DataSourceScanExec out so it's more discoverable, and now that it doesn't necessarily depend on an existing RDD. cc davies ## How was this patch tested? Existing tests. Author: Eric Liang <ekl@databricks.com> Closes #14487 from ericl/split-scan.
* [SPARK-16802] [SQL] fix overflow in LongToUnsafeRowMapDavies Liu2016-08-041-6/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch fix the overflow in LongToUnsafeRowMap when the range of key is very wide (the key is much much smaller then minKey, for example, key is Long.MinValue, minKey is > 0). ## How was this patch tested? Added regression test (also for SPARK-16740) Author: Davies Liu <davies@databricks.com> Closes #14464 from davies/fix_overflow.
* [SPARK-16853][SQL] fixes encoder error in DataSet typed selectSean Zhong2016-08-041-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? For DataSet typed select: ``` def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1] ``` If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException. ### Before change: ``` scala> case class A(a: Int, b: Int) scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]) org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2]; .. ``` ### After change: ``` scala> case class A(a: Int, b: Int) scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show +---+---+ | a| b| +---+---+ | 1| 2| +---+---+ ``` ## How was this patch tested? Unit test. Author: Sean Zhong <seanzhong@databricks.com> Closes #14474 from clockfly/SPARK-16853.
* [MINOR][SQL] Fix minor formatting issue of SortAggregateExec.toStringCheng Lian2016-08-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR fixes a minor formatting issue (missing space after comma) of `SorgAggregateExec.toString`. Before: ``` SortAggregate(key=[a#76,b#77], functions=[max(c#78),min(c#78)], output=[a#76,b#77,max(c)#89,min(c)#90]) +- *Sort [a#76 ASC, b#77 ASC], false, 0 +- Exchange hashpartitioning(a#76, b#77, 200) +- SortAggregate(key=[a#76,b#77], functions=[partial_max(c#78),partial_min(c#78)], output=[a#76,b#77,max#99,min#100]) +- *Sort [a#76 ASC, b#77 ASC], false, 0 +- LocalTableScan <empty>, [a#76, b#77, c#78] ``` After: ``` SortAggregate(key=[a#76, b#77], functions=[max(c#78), min(c#78)], output=[a#76, b#77, max(c)#89, min(c)#90]) +- *Sort [a#76 ASC, b#77 ASC], false, 0 +- Exchange hashpartitioning(a#76, b#77, 200) +- SortAggregate(key=[a#76, b#77], functions=[partial_max(c#78), partial_min(c#78)], output=[a#76, b#77, max#99, min#100]) +- *Sort [a#76 ASC, b#77 ASC], false, 0 +- LocalTableScan <empty>, [a#76, b#77, c#78] ``` ## How was this patch tested? Manually tested. Author: Cheng Lian <lian@databricks.com> Closes #14480 from liancheng/fix-sort-based-agg-string-format.
* [SPARK-14204][SQL] register driverClass rather than user-specified classKevin McHale2016-08-031-1/+1
| | | | | | | | | | | | | | This is a pull request that was originally merged against branch-1.6 as #12000, now being merged into master as well. srowen zzcclp JoshRosen This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an IllegalStateException. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204 My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user. This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed. Author: Kevin McHale <kevin@premise.com> Closes #14420 from mchalek/mchalek-jdbc_driver_registration.
* [SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at ↵Eric Liang2016-08-033-275/+341
| | | | | | | | | | | | | | | | | | | | execution instead of planning time ## What changes were proposed in this pull request? Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time. This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by `FileSourceScanExec`, which handles both batched and non-batched file scans. This requires some duplication with `RowDataSourceScanExec`, but is probably worth it so that `FileSourceScanExec` does not need to depend on an input RDD. TODO: In another pr, move DataSourceScanExec to it's own file. ## How was this patch tested? Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so). Author: Eric Liang <ekl@databricks.com> Closes #14241 from ericl/refactor.
* [SQL][MINOR] use stricter type parameter to make it clear that parquet ↵Wenchen Fan2016-08-033-10/+10
| | | | | | | | | | | | | | | | reader returns UnsafeRow ## What changes were proposed in this pull request? a small code style change, it's better to make the type parameter more accurate. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14458 from cloud-fan/parquet.
* [SPARK-16731][SQL] use StructType in CatalogTable and remove CatalogColumnWenchen Fan2016-07-316-61/+33
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? `StructField` has very similar semantic with `CatalogColumn`, except that `CatalogColumn` use string to express data type. I think it's reasonable to use `StructType` as the `CatalogTable.schema` and remove `CatalogColumn`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14363 from cloud-fan/column.
* [SPARK-16818] Exchange reuse incorrectly reuses scans over different sets of ↵Eric Liang2016-07-301-0/+2
| | | | | | | | | | | | | | | | | | | | partitions ## What changes were proposed in this pull request? This fixes a bug wherethe file scan operator does not take into account partition pruning in its implementation of `sameResult()`. As a result, executions may be incorrect on self-joins over the same base file relation. The patch here is minimal, but we should reconsider relying on `metadata` for implementing sameResult() in the future, as string representations may not be uniquely identifying. cc rxin ## How was this patch tested? Unit tests. Author: Eric Liang <ekl@databricks.com> Closes #14425 from ericl/spark-16818.
* [SPARK-16664][SQL] Fix persist call on Data frames with more than 200…Wesley Tang2016-07-291-2/+2
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? f12f11e578169b47e3f8b18b299948c0670ba585 introduced this bug, missed foreach as map ## How was this patch tested? Test added Author: Wesley Tang <tangmingjun@mininglamp.com> Closes #14324 from breakdawn/master.
* [SPARK-16764][SQL] Recommend disabling vectorized parquet reader on ↵Sameer Agarwal2016-07-281-5/+19
| | | | | | | | | | | | | | | | OutOfMemoryError ## What changes were proposed in this pull request? We currently don't bound or manage the data array size used by column vectors in the vectorized reader (they're just bound by INT.MAX) which may lead to OOMs while reading data. As a short term fix, this patch intercepts the OutOfMemoryError exception and suggest the user to disable the vectorized parquet reader. ## How was this patch tested? Existing Tests Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14387 from sameeragarwal/oom.
* [SPARK-16740][SQL] Fix Long overflow in LongToUnsafeRowMapSylvain Zimmer2016-07-281-1/+2
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Avoid overflow of Long type causing a NegativeArraySizeException a few lines later. ## How was this patch tested? Unit tests for HashedRelationSuite still pass. I can confirm the python script I included in https://issues.apache.org/jira/browse/SPARK-16740 works fine with this patch. Unfortunately I don't have the knowledge/time to write a Scala test case for HashedRelationSuite right now. As the patch is pretty obvious I hope it can be included without this. Thanks! Author: Sylvain Zimmer <sylvain@sylvainzimmer.com> Closes #14373 from sylvinus/master.
* [SPARK-16552][SQL] Store the Inferred Schemas into External Catalog Tables ↵gatorsmile2016-07-285-70/+80
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when Creating Tables #### What changes were proposed in this pull request? Currently, in Spark SQL, the initial creation of schema can be classified into two groups. It is applicable to both Hive tables and Data Source tables: **Group A. Users specify the schema.** _Case 1 CREATE TABLE AS SELECT_: the schema is determined by the result schema of the SELECT clause. For example, ```SQL CREATE TABLE tab STORED AS TEXTFILE AS SELECT * from input ``` _Case 2 CREATE TABLE_: users explicitly specify the schema. For example, ```SQL CREATE TABLE jsonTable (_1 string, _2 string) USING org.apache.spark.sql.json ``` **Group B. Spark SQL infers the schema at runtime.** _Case 3 CREATE TABLE_. Users do not specify the schema but the path to the file location. For example, ```SQL CREATE TABLE jsonTable USING org.apache.spark.sql.json OPTIONS (path '${tempDir.getCanonicalPath}') ``` Before this PR, Spark SQL does not store the inferred schema in the external catalog for the cases in Group B. When users refreshing the metadata cache, accessing the table at the first time after (re-)starting Spark, Spark SQL will infer the schema and store the info in the metadata cache for improving the performance of subsequent metadata requests. However, the runtime schema inference could cause undesirable schema changes after each reboot of Spark. This PR is to store the inferred schema in the external catalog when creating the table. When users intend to refresh the schema after possible changes on external files (table location), they issue `REFRESH TABLE`. Spark SQL will infer the schema again based on the previously specified table location and update/refresh the schema in the external catalog and metadata cache. In this PR, we do not use the inferred schema to replace the user specified schema for avoiding external behavior changes . Based on the design, user-specified schemas (as described in Group A) can be changed by ALTER TABLE commands, although we do not support them now. #### How was this patch tested? TODO: add more cases to cover the changes. Author: gatorsmile <gatorsmile@gmail.com> Closes #14207 from gatorsmile/userSpecifiedSchema.
* [MINOR][DOC][SQL] Fix two documents regarding size in bytesLiang-Chi Hsieh2016-07-271-5/+7
| | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix two places in SQLConf documents regarding size in bytes and statistics. ## How was this patch tested? No. Just change document. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #14341 from viirya/fix-doc-size-in-bytes.
* [SPARK-16621][SQL] Generate stable SQLs in SQLBuilderDongjoon Hyun2016-07-271-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, the generated SQLs have not-stable IDs for generated attributes. The stable generated SQL will give more benefit for understanding or testing the queries. This PR provides stable SQL generation by the followings. - Provide unique ids for generated subqueries, `gen_subquery_xxx`. - Provide unique and stable ids for generated attributes, `gen_attr_xxx`. **Before** ```scala scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL res0: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0 scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL res1: String = SELECT `gen_attr_4` AS `1` FROM (SELECT 1 AS `gen_attr_4`) AS gen_subquery_0 ``` **After** ```scala scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL res1: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0 scala> new org.apache.spark.sql.catalyst.SQLBuilder(sql("select 1")).toSQL res2: String = SELECT `gen_attr_0` AS `1` FROM (SELECT 1 AS `gen_attr_0`) AS gen_subquery_0 ``` ## How was this patch tested? Pass the existing Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14257 from dongjoon-hyun/SPARK-16621.
* [SPARK-16524][SQL] Add RowBatch and RowBasedHashMapGeneratorQifan Pu2016-07-263-126/+390
| | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR is the first step for the following feature: For hash aggregation in Spark SQL, we use a fast aggregation hashmap to act as a "cache" in order to boost aggregation performance. Previously, the hashmap is backed by a `ColumnarBatch`. This has performance issues when we have wide schema for the aggregation table (large number of key fields or value fields). In this JIRA, we support another implementation of fast hashmap, which is backed by a `RowBasedKeyValueBatch`. We then automatically pick between the two implementations based on certain knobs. In this first-step PR, implementations for `RowBasedKeyValueBatch` and `RowBasedHashMapGenerator` are added. ## How was this patch tested? Unit tests: `RowBasedKeyValueBatchSuite` Author: Qifan Pu <qifan.pu@gmail.com> Closes #14349 from ooq/SPARK-16524.
* [SPARK-16663][SQL] desc table should be consistent between data source and ↵Wenchen Fan2016-07-261-6/+6
| | | | | | | | | | | | | | | | | | | hive serde tables ## What changes were proposed in this pull request? Currently there are 2 inconsistence: 1. for data source table, we only print partition names, for hive table, we also print partition schema. After this PR, we will always print schema 2. if column doesn't have comment, data source table will print empty string, hive table will print null. After this PR, we will always print null ## How was this patch tested? new test in `HiveDDLSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #14302 from cloud-fan/minor3.
* [SPARK-16675][SQL] Avoid per-record type dispatch in JDBC when writinghyukjinkwon2016-07-262-36/+88
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, `JdbcUtils.savePartition` is doing type-based dispatch for each row to write appropriate values. So, appropriate setters for `PreparedStatement` can be created first according to the schema, and then apply them to each row. This approach is similar with `CatalystWriteSupport`. This PR simply make the setters to avoid this. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14323 from HyukjinKwon/SPARK-16675.
* [SPARK-16633][SPARK-16642][SPARK-16721][SQL] Fixes three issues related to ↵Yin Huai2016-07-251-7/+27
| | | | | | | | | | | | | | | | | | | | | | | lead and lag functions ## What changes were proposed in this pull request? This PR contains three changes. First, this PR changes the behavior of lead/lag back to Spark 1.6's behavior, which is described as below: 1. lead/lag respect null input values, which means that if the offset row exists and the input value is null, the result will be null instead of the default value. 2. If the offset row does not exist, the default value will be used. 3. OffsetWindowFunction's nullable setting also considers the nullability of its input (because of the first change). Second, this PR fixes the evaluation of lead/lag when the input expression is a literal. This fix is a result of the first change. In current master, if a literal is used as the input expression of a lead or lag function, the result will be this literal even if the offset row does not exist. Third, this PR makes ResolveWindowFrame not fire if a window function is not resolved. ## How was this patch tested? New tests in SQLWindowFunctionSuite Author: Yin Huai <yhuai@databricks.com> Closes #14284 from yhuai/lead-lag.
* [SPARK-16672][SQL] SQLBuilder should not raise exceptions on EXISTS queriesDongjoon Hyun2016-07-251-2/+7
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, `SQLBuilder` raises `empty.reduceLeft` exceptions on *unoptimized* `EXISTS` queries. We had better prevent this. ```scala scala> sql("CREATE TABLE t1(a int)") scala> val df = sql("select * from t1 b where exists (select * from t1 a)") scala> new org.apache.spark.sql.catalyst.SQLBuilder(df).toSQL java.lang.UnsupportedOperationException: empty.reduceLeft ``` ## How was this patch tested? Pass the Jenkins tests with a new test suite. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14307 from dongjoon-hyun/SPARK-16672.
* [SPARK-16678][SPARK-16677][SQL] Fix two View-related bugsgatorsmile2016-07-262-15/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? **Issue 1: Disallow Creating/Altering a View when the same-name Table Exists (without IF NOT EXISTS)** When we create OR alter a view, we check whether the view already exists. In the current implementation, if a table with the same name exists, we treat it as a view. However, this is not the right behavior. We should follow what Hive does. For example, ``` hive> CREATE TABLE tab1 (id int); OK Time taken: 0.196 seconds hive> CREATE OR REPLACE VIEW tab1 AS SELECT * FROM t1; FAILED: SemanticException [Error 10218]: Existing table is not a view The following is an existing table, not a view: default.tab1 hive> ALTER VIEW tab1 AS SELECT * FROM t1; FAILED: SemanticException [Error 10218]: Existing table is not a view The following is an existing table, not a view: default.tab1 hive> CREATE VIEW IF NOT EXISTS tab1 AS SELECT * FROM t1; OK Time taken: 0.678 seconds ``` **Issue 2: Strange Error when Issuing Load Table Against A View** Users should not be allowed to issue LOAD DATA against a view. Currently, when users doing it, we got a very strange runtime error. For example, ```SQL LOAD DATA LOCAL INPATH "$testData" INTO TABLE $viewName ``` ``` java.lang.reflect.InvocationTargetException was thrown. java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.sql.hive.client.Shim_v0_14.loadTable(HiveShim.scala:680) ``` ## How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14314 from gatorsmile/tableDDLAgainstView.
* [SPARK-14131][STREAMING] SQL Improved fix for avoiding potential deadlocks ↵Tathagata Das2016-07-252-29/+30
| | | | | | | | | | | | | | in HDFSMetadataLog ## What changes were proposed in this pull request? Current fix for deadlock disables interrupts in the StreamExecution which getting offsets for all sources, and when writing to any metadata log, to avoid potential deadlocks in HDFSMetadataLog(see JIRA for more details). However, disabling interrupts can have unintended consequences in other sources. So I am making the fix more narrow, by disabling interrupt it only in the HDFSMetadataLog. This is a narrower fix for something risky like disabling interrupt. ## How was this patch tested? Existing tests. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #14292 from tdas/SPARK-14131.
* [SPARK-16691][SQL] move BucketSpec to catalyst module and use it in CatalogTableWenchen Fan2016-07-2511-44/+24
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? It's weird that we have `BucketSpec` to abstract bucket info, but don't use it in `CatalogTable`. This PR moves `BucketSpec` into catalyst module. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14331 from cloud-fan/check.
* [SPARK-16660][SQL] CreateViewCommand should not take CatalogTableWenchen Fan2016-07-253-90/+99
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? `CreateViewCommand` only needs some information of a `CatalogTable`, but not all of them. We have some tricks(e.g. we need to check the table type is `VIEW`, we need to make `CatalogColumn.dataType` nullable) to allow it to take a `CatalogTable`. This PR cleans it up and only pass in necessary information to `CreateViewCommand`. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #14297 from cloud-fan/minor2.
* [SPARK-16674][SQL] Avoid per-record type dispatch in JDBC when readinghyukjinkwon2016-07-251-116/+129
| | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, `JDBCRDD.compute` is doing type dispatch for each row to read appropriate values. It might not have to be done like this because the schema is already kept in `JDBCRDD`. So, appropriate converters can be created first according to the schema, and then apply them to each row. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14313 from HyukjinKwon/SPARK-16674.
* [SPARK-16699][SQL] Fix performance bug in hash aggregate on long string keysQifan Pu2016-07-241-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In the following code in `VectorizedHashMapGenerator.scala`: ``` def hashBytes(b: String): String = { val hash = ctx.freshName("hash") s""" |int $result = 0; |for (int i = 0; i < $b.length; i++) { | ${genComputeHash(ctx, s"$b[i]", ByteType, hash)} | $result = ($result ^ (0x9e3779b9)) + $hash + ($result << 6) + ($result >>> 2); |} """.stripMargin } ``` when b=input.getBytes(), the current 2.0 code results in getBytes() being called n times, n being length of input. getBytes() involves memory copy is thus expensive and causes a performance degradation. Fix is to evaluate getBytes() before the for loop. Performance bug, no additional test added. Author: Qifan Pu <qifan.pu@gmail.com> Closes #14337 from ooq/SPARK-16699. (cherry picked from commit d226dce12babcd9f30db033417b2b9ce79f44312) Signed-off-by: Reynold Xin <rxin@databricks.com>
* [SPARK-16645][SQL] rename CatalogStorageFormat.serdeProperties to propertiesWenchen Fan2016-07-255-16/+16
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? we also store data source table options in this field, it's unreasonable to call it `serdeProperties`. ## How was this patch tested? N/A Author: Wenchen Fan <wenchen@databricks.com> Closes #14283 from cloud-fan/minor1.
* [SPARK-16463][SQL] Support `truncate` option in Overwrite mode for JDBC ↵Dongjoon Hyun2016-07-248-2/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DataFrameWriter ## What changes were proposed in this pull request? This PR adds a boolean option, `truncate`, for `SaveMode.Overwrite` of JDBC DataFrameWriter. If this option is `true`, it try to take advantage of `TRUNCATE TABLE` instead of `DROP TABLE`. This is a trivial option, but will provide great **convenience** for BI tool users based on RDBMS tables generated by Spark. **Goal** - Without `CREATE/DROP` privilege, we can save dataframe to database. Sometime these are not allowed for security. - It will preserve the existing table information, so users can add and keep some additional `INDEX` and `CONSTRAINT`s for the table. - Sometime, `TRUNCATE` is faster than the combination of `DROP/CREATE`. **Supported DBMS** The following is `truncate`-option support table. Due to the different behavior of `TRUNCATE TABLE` among DBMSs, it's not always safe to use `TRUNCATE TABLE`. Spark will ignore the `truncate` option for **unknown** and **some** DBMS with **default CASCADING** behavior. Newly added JDBCDialect should implement corresponding function to support `truncate` option additionally. Spark Dialects | `truncate` OPTION SUPPORT ---------------|------------------------------- MySQLDialect | O PostgresDialect | X DB2Dialect | O MsSqlServerDialect | O DerbyDialect | O OracleDialect | O **Before (TABLE with INDEX case)**: SparkShell & MySQL CLI are interleaved intentionally. ```scala scala> val (url, prop)=("jdbc:mysql://localhost:3306/temp?useSSL=false", new java.util.Properties) scala> prop.setProperty("user","root") scala> df.write.mode("overwrite").jdbc(url, "table_with_index", prop) scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop) mysql> DESC table_with_index; +-------+------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------------+------+-----+---------+-------+ | id | bigint(20) | NO | | NULL | | +-------+------------+------+-----+---------+-------+ mysql> CREATE UNIQUE INDEX idx_id ON table_with_index(id); mysql> DESC table_with_index; +-------+------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------------+------+-----+---------+-------+ | id | bigint(20) | NO | PRI | NULL | | +-------+------------+------+-----+---------+-------+ scala> spark.range(10).write.mode("overwrite").jdbc(url, "table_with_index", prop) mysql> DESC table_with_index; +-------+------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------------+------+-----+---------+-------+ | id | bigint(20) | NO | | NULL | | +-------+------------+------+-----+---------+-------+ ``` **After (TABLE with INDEX case)** ```scala scala> spark.range(10).write.mode("overwrite").option("truncate", true).jdbc(url, "table_with_index", prop) mysql> DESC table_with_index; +-------+------------+------+-----+---------+-------+ | Field | Type | Null | Key | Default | Extra | +-------+------------+------+-----+---------+-------+ | id | bigint(20) | NO | PRI | NULL | | +-------+------------+------+-----+---------+-------+ ``` **Error Handling** - In case of exceptions, Spark will not retry. Users should turn off the `truncate` option. - In case of schema change: - If one of the column names changes, this will raise exceptions intuitively. - If there exists only type difference, this will work like Append mode. ## How was this patch tested? Pass the Jenkins tests with a updated testcase. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14086 from dongjoon-hyun/SPARK-16410.
* [SPARK-16556][SPARK-16559][SQL] Fix Two Bugs in Bucket Specificationgatorsmile2016-07-222-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? **Issue 1: Silent Ignorance of Bucket Specification When Creating Table Using Schema Inference** When creating a data source table without explicit specification of schema or SELECT clause, we silently ignore the bucket specification (CLUSTERED BY... SORTED BY...) in [the code](https://github.com/apache/spark/blob/ce3b98bae28af72299722f56e4e4ef831f471ec0/sql/core/src/main/scala/org/apache/spark/sql/execution/command/createDataSourceTables.scala#L339-L354). For example, ```SQL CREATE TABLE jsonTable USING org.apache.spark.sql.json OPTIONS ( path '${tempDir.getCanonicalPath}' ) CLUSTERED BY (inexistentColumnA) SORTED BY (inexistentColumnB) INTO 2 BUCKETS ``` This PR captures it and issues an error message. **Issue 2: Got a run-time `java.lang.ArithmeticException` when num of buckets is set to zero.** For example, ```SQL CREATE TABLE t USING PARQUET OPTIONS (PATH '${path.toString}') CLUSTERED BY (a) SORTED BY (b) INTO 0 BUCKETS AS SELECT 1 AS a, 2 AS b ``` The exception we got is ``` ERROR org.apache.spark.executor.Executor: Exception in task 0.0 in stage 1.0 (TID 2) java.lang.ArithmeticException: / by zero ``` This PR captures the misuse and issues an appropriate error message. ### How was this patch tested? Added a test case in DDLSuite Author: gatorsmile <gatorsmile@gmail.com> Closes #14210 from gatorsmile/createTableWithoutSchema.
* [SPARK-16334] Maintain single dictionary per row-batch in vectorized parquet ↵Sameer Agarwal2016-07-211-8/+13
| | | | | | | | | | | | | | | | reader ## What changes were proposed in this pull request? As part of the bugfix in https://github.com/apache/spark/pull/12279, if a row batch consist of both dictionary encoded and non-dictionary encoded pages, we explicitly decode the dictionary for the values that are already dictionary encoded. Currently we reset the dictionary while reading every page that can potentially cause ` java.lang.ArrayIndexOutOfBoundsException` while decoding older pages. This patch fixes the problem by maintaining a single dictionary per row-batch in vectorized parquet reader. ## How was this patch tested? Manual Tests against a number of hand-generated parquet files. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #14225 from sameeragarwal/vectorized.
* [SPARK-16632][SQL] Revert PR #14272: Respect Hive schema when merging ↵Cheng Lian2016-07-211-18/+0
| | | | | | | | | | | | | | | | | | parquet schema ## What changes were proposed in this pull request? PR #14278 is a more general and simpler fix for SPARK-16632 than PR #14272. After merging #14278, we no longer need changes made in #14272. So here I revert them. This PR targets both master and branch-2.0. ## How was this patch tested? Existing tests. Author: Cheng Lian <lian@databricks.com> Closes #14300 from liancheng/revert-pr-14272.
* [SPARK-16632][SQL] Use Spark requested schema to guide vectorized Parquet ↵Cheng Lian2016-07-211-1/+4
| | | | | | | | | | | | | | | | | | | | reader initialization ## What changes were proposed in this pull request? In `SpecificParquetRecordReaderBase`, which is used by the vectorized Parquet reader, we convert the Parquet requested schema into a Spark schema to guide column reader initialization. However, the Parquet requested schema is tailored from the schema of the physical file being scanned, and may have inaccurate type information due to bugs of other systems (e.g. HIVE-14294). On the other hand, we already set the real Spark requested schema into Hadoop configuration in [`ParquetFileFormat`][1]. This PR simply reads out this schema to replace the converted one. ## How was this patch tested? New test case added in `ParquetQuerySuite`. [1]: https://github.com/apache/spark/blob/v2.0.0-rc5/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala#L292-L294 Author: Cheng Lian <lian@databricks.com> Closes #14278 from liancheng/spark-16632-simpler-fix.
* [SPARK-16226][SQL] Weaken JDBC isolation level to avoid locking when writing ↵Sean Owen2016-07-212-14/+46
| | | | | | | | | | | | | | | | partitions ## What changes were proposed in this pull request? Saving partitions to JDBC in transaction can use a weaker transaction isolation level to reduce locking. Use better method to check if transactions are supported. ## How was this patch tested? Existing Jenkins tests. Author: Sean Owen <sowen@cloudera.com> Closes #14054 from srowen/SPARK-16226.
* [SPARK-16272][CORE] Allow config values to reference conf, env, system props.Marcelo Vanzin2016-07-201-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This allows configuration to be more flexible, for example, when the cluster does not have a homogeneous configuration (e.g. packages are installed on different paths in different nodes). By allowing one to reference the environment from the conf, it becomes possible to work around those in certain cases. As part of the implementation, ConfigEntry now keeps track of all "known" configs (i.e. those created through the use of ConfigBuilder), since that list is used by the resolution code. This duplicates some code in SQLConf, which could potentially be merged with this now. It will also make it simpler to implement some missing features such as filtering which configs show up in the UI or in event logs - which are not part of this change. Another change is in the way ConfigEntry reads config data; it now takes a string map and a function that reads env variables, so that it can be called both from SparkConf and SQLConf. This makes it so both places follow the same read path, instead of having to replicate certain logic in SQLConf. There are still a couple of methods in SQLConf that peek into fields of ConfigEntry directly, though. Tested via unit tests, and by using the new variable expansion functionality in a shell session with a custom spark.sql.hive.metastore.jars value. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14022 from vanzin/SPARK-16272.
* [SPARK-16344][SQL] Decoding Parquet array of struct with a single field ↵Cheng Lian2016-07-204-44/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | named "element" ## What changes were proposed in this pull request? Due to backward-compatibility reasons, the following Parquet schema is ambiguous: ``` optional group f (LIST) { repeated group list { optional group element { optional int32 element; } } } ``` According to the parquet-format spec, when interpreted as a standard 3-level layout, this type is equivalent to the following SQL type: ``` ARRAY<STRUCT<element: INT>> ``` However, when interpreted as a legacy 2-level layout, it's equivalent to ``` ARRAY<STRUCT<element: STRUCT<element: INT>>> ``` Historically, to disambiguate these cases, we employed two methods: - `ParquetSchemaConverter.isElementType()` Used to disambiguate the above cases while converting Parquet types to Spark types. - `ParquetRowConverter.isElementType()` Used to disambiguate the above cases while instantiating row converters that convert Parquet records to Spark rows. Unfortunately, these two methods make different decision about the above problematic Parquet type, and caused SPARK-16344. `ParquetRowConverter.isElementType()` is necessary for Spark 1.4 and earlier versions because Parquet requested schemata are directly converted from Spark schemata in these versions. The converted Parquet schemata may be incompatible with actual schemata of the underlying physical files when the files are written by a system/library that uses a schema conversion scheme that is different from Spark when writing Parquet LIST and MAP fields. In Spark 1.5, Parquet requested schemata are always properly tailored from schemata of physical files to be read. Thus `ParquetRowConverter.isElementType()` is no longer necessary. This PR replaces this method with a simply yet accurate scheme: whenever an ambiguous Parquet type is hit, convert the type in question back to a Spark type using `ParquetSchemaConverter` and check whether it matches the corresponding Spark type. ## How was this patch tested? New test cases added in `ParquetHiveCompatibilitySuite` and `ParquetQuerySuite`. Author: Cheng Lian <lian@databricks.com> Closes #14014 from liancheng/spark-16344-for-master-and-2.0.
* [SPARK-16632][SQL] Respect Hive schema when merging parquet schema.Marcelo Vanzin2016-07-201-0/+18
| | | | | | | | | | | | | | | | | | | | | | | When Hive (or at least certain versions of Hive) creates parquet files containing tinyint or smallint columns, it stores them as int32, but doesn't annotate the parquet field as containing the corresponding int8 / int16 data. When Spark reads those files using the vectorized reader, it follows the parquet schema for these fields, but when actually reading the data it tries to use the type fetched from the metastore, and then fails because data has been loaded into the wrong fields in OnHeapColumnVector. So instead of blindly trusting the parquet schema, check whether the Catalyst-provided schema disagrees with it, and adjust the types so that the necessary metadata is present when loading the data into the ColumnVector instance. Tested with unit tests and with tests that create byte / short columns in Hive and try to read them from Spark. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #14272 from vanzin/SPARK-16632.
* [SPARK-16615][SQL] Expose sqlContext in SparkSessionReynold Xin2016-07-181-1/+3
| | | | | | | | | | | | ## What changes were proposed in this pull request? This patch removes the private[spark] qualifier for SparkSession.sqlContext, as discussed in http://apache-spark-developers-list.1001551.n3.nabble.com/Re-transtition-SQLContext-to-SparkSession-td18342.html ## How was this patch tested? N/A - this is a visibility change. Author: Reynold Xin <rxin@databricks.com> Closes #14252 from rxin/SPARK-16615.
* [SPARK-16515][SQL] set default record reader and writer for script ↵Daoyuan Wang2016-07-181-5/+11
| | | | | | | | | | | | | | | | | transformation ## What changes were proposed in this pull request? In ScriptInputOutputSchema, we read default RecordReader and RecordWriter from conf. Since Spark 2.0 has deleted those config keys from hive conf, we have to set default reader/writer class name by ourselves. Otherwise we will get None for LazySimpleSerde, the data written would not be able to read by script. The test case added worked fine with previous version of Spark, but would fail now. ## How was this patch tested? added a test case in SQLQuerySuite. Closes #14169 Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Yin Huai <yhuai@databricks.com> Closes #14249 from yhuai/scriptTransformation.
* [SPARK-16351][SQL] Avoid per-record type dispatch in JSON when writinghyukjinkwon2016-07-183-64/+163
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently, `JacksonGenerator.apply` is doing type-based dispatch for each row to write appropriate values. It might not have to be done like this because the schema is already kept. So, appropriate writers can be created first according to the schema once, and then apply them to each row. This approach is similar with `CatalystWriteSupport`. This PR corrects `JacksonGenerator` so that it creates all writers for the schema once and then applies them to each row rather than type dispatching for every row. Benchmark was proceeded with the codes below: ```scala test("Benchmark for JSON writer") { val N = 500 << 8 val row = """{"struct":{"field1": true, "field2": 92233720368547758070}, "structWithArrayFields":{"field1":[4, 5, 6], "field2":["str1", "str2"]}, "arrayOfString":["str1", "str2"], "arrayOfInteger":[1, 2147483647, -2147483648], "arrayOfLong":[21474836470, 9223372036854775807, -9223372036854775808], "arrayOfBigInteger":[922337203685477580700, -922337203685477580800], "arrayOfDouble":[1.2, 1.7976931348623157E308, 4.9E-324, 2.2250738585072014E-308], "arrayOfBoolean":[true, false, true], "arrayOfNull":[null, null, null, null], "arrayOfStruct":[{"field1": true, "field2": "str1"}, {"field1": false}, {"field3": null}], "arrayOfArray1":[[1, 2, 3], ["str1", "str2"]], "arrayOfArray2":[[1, 2, 3], [1.1, 2.1, 3.1]] }""" val df = spark.sqlContext.read.json(spark.sparkContext.parallelize(List.fill(N)(row))) val benchmark = new Benchmark("JSON writer", N) benchmark.addCase("writing JSON file", 10) { _ => withTempPath { path => df.write.format("json").save(path.getCanonicalPath) } } benchmark.run() } ``` This produced the results below - **Before** ``` JSON writer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ writing JSON file 1675 / 1767 0.1 13087.5 1.0X ``` - **After** ``` JSON writer: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------ writing JSON file 1597 / 1686 0.1 12477.1 1.0X ``` In addition, I ran this benchmark 10 times for each and calculated the average elapsed time as below: | **Before** | **After**| |---------------|------------| |17478ms |16669ms | It seems roughly ~5% is improved. ## How was this patch tested? Existing tests should cover this. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14028 from HyukjinKwon/SPARK-16351.
* [SPARK-16588][SQL] Deprecate monotonicallyIncreasingId in Scala/JavaReynold Xin2016-07-171-0/+1
| | | | | | This patch deprecates monotonicallyIncreasingId in Scala/Java, as done in Python. This patch was originally written by HyukjinKwon. Closes #14236.
* [SPARK-16543][SQL] Rename the columns of `SHOW PARTITION/COLUMNS` commandsDongjoon Hyun2016-07-141-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR changes the name of columns returned by `SHOW PARTITION` and `SHOW COLUMNS` commands. Currently, both commands uses `result` as a column name. **Comparison: Column Name** Command|Spark(Before)|Spark(After)|Hive ----------|--------------|------------|----- SHOW PARTITIONS|result|partition|partition SHOW COLUMNS|result|col_name|field Note that Spark/Hive uses `col_name` in `DESC TABLES`. So, this PR chooses `col_name` for consistency among Spark commands. **Before** ```scala scala> sql("show partitions p").show() +------+ |result| +------+ | b=2| +------+ scala> sql("show columns in p").show() +------+ |result| +------+ | a| | b| +------+ ``` **After** ```scala scala> sql("show partitions p").show +---------+ |partition| +---------+ | b=2| +---------+ scala> sql("show columns in p").show +--------+ |col_name| +--------+ | a| | b| +--------+ ``` ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #14199 from dongjoon-hyun/SPARK-16543.
* [SPARK-16503] SparkSession should provide Spark versionLiwei Lin2016-07-131-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This patch enables SparkSession to provide spark version. ## How was this patch tested? Manual test: ``` scala> sc.version res0: String = 2.1.0-SNAPSHOT scala> spark.version res1: String = 2.1.0-SNAPSHOT ``` ``` >>> sc.version u'2.1.0-SNAPSHOT' >>> spark.version u'2.1.0-SNAPSHOT' ``` Author: Liwei Lin <lwlin7@gmail.com> Closes #14165 from lw-lin/add-version.
* [SPARK-16482][SQL] Describe Table Command for Tables Requiring Runtime ↵gatorsmile2016-07-131-15/+13
| | | | | | | | | | | | | | | | | | Inferred Schema #### What changes were proposed in this pull request? If we create a table pointing to a parquet/json datasets without specifying the schema, describe table command does not show the schema at all. It only shows `# Schema of this table is inferred at runtime`. In 1.6, describe table does show the schema of such a table. ~~For data source tables, to infer the schema, we need to load the data source tables at runtime. Thus, this PR calls the function `lookupRelation`.~~ For data source tables, we infer the schema before table creation. Thus, this PR set the inferred schema as the table schema when table creation. #### How was this patch tested? Added test cases Author: gatorsmile <gatorsmile@gmail.com> Closes #14148 from gatorsmile/describeSchema.
* [SPARK-16439] Fix number formatting in SQL UIMaciej Brynski2016-07-131-1/+3
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Spark SQL UI display numbers greater than 1000 with u00A0 as grouping separator. Problem exists when server locale has no-breaking space as separator. (for example pl_PL) This patch turns off grouping and remove this separator. The problem starts with this PR. https://github.com/apache/spark/pull/12425/files#diff-803f475b01acfae1c5c96807c2ea9ddcR125 ## How was this patch tested? Manual UI tests. Screenshot attached. ![image](https://cloud.githubusercontent.com/assets/4006010/16749556/5cb5a372-47cb-11e6-9a95-67fd3f9d1c71.png) Author: Maciej Brynski <maciej.brynski@adpilot.pl> Closes #14142 from maver1ck/master.
* [MINOR] Fix Java style errors and remove unused importsXin Ren2016-07-131-2/+1
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Fix Java style errors and remove unused imports, which are randomly found ## How was this patch tested? Tested on my local machine. Author: Xin Ren <iamshrek@126.com> Closes #14161 from keypointt/SPARK-16437.
* [SPARK-16119][SQL] Support PURGE option to drop table / partition.Marcelo Vanzin2016-07-123-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | This option is used by Hive to directly delete the files instead of moving them to the trash. This is needed in certain configurations where moving the files does not work. For non-Hive tables and partitions, Spark already behaves as if the PURGE option was set, so there's no need to do anything. Hive support for PURGE was added in 0.14 (for tables) and 1.2 (for partitions), so the code reflects that: trying to use the option with older versions of Hive will cause an exception to be thrown. The change is a little noisier than I would like, because of the code to propagate the new flag through all the interfaces and implementations; the main changes are in the parser and in HiveShim, aside from the tests (DDLCommandSuite, VersionsSuite). Tested by running sql and catalyst unit tests, plus VersionsSuite which has been updated to test the version-specific behavior. I also ran an internal test suite that uses PURGE and would not pass previously. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #13831 from vanzin/SPARK-16119.
* [SPARK-15752][SQL] Optimize metadata only query that has an aggregate whose ↵Lianhui Wang2016-07-123-0/+163
| | | | | | | | | | | | | | | | children are deterministic project or filter operators. ## What changes were proposed in this pull request? when query only use metadata (example: partition key), it can return results based on metadata without scanning files. Hive did it in HIVE-1003. ## How was this patch tested? add unit tests Author: Lianhui Wang <lianhuiwang09@gmail.com> Author: Wenchen Fan <wenchen@databricks.com> Author: Lianhui Wang <lianhuiwang@users.noreply.github.com> Closes #13494 from lianhuiwang/metadata-only.