| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #2147 from marmbrus/inMemDefaultSize and squashes the following commits:
5390360 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into inMemDefaultSize
14204d3 [Michael Armbrust] Set the context before creating SparkLogicalPlans.
8da4414 [Michael Armbrust] Make sure we throw errors when leaf nodes fail to provide statistcs
18ce029 [Michael Armbrust] Ensure in-memory tables don't always broadcast.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
udf_unix_timestamp format "yyyy MMM dd h:mm:ss a" run with not "America/Los_Angeles" TimeZone in HiveCompatibilitySuite
When run the udf_unix_timestamp of org.apache.spark.sql.hive.execution.HiveCompatibilitySuite testcase
with not "America/Los_Angeles" TimeZone throws error. [https://issues.apache.org/jira/browse/SPARK-3065]
add locale setting on beforeAll and afterAll method to fix the bug of HiveCompatibilitySuite testcase
Author: luogankun <luogankun@gmail.com>
Closes #1968 from luogankun/SPARK-3065 and squashes the following commits:
c167832 [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite
0a25e3a [luogankun] [SPARK-3065][SQL] Add Locale setting to HiveCompatibilitySuite
|
|
|
|
|
|
|
|
|
|
| |
Currently we do `relation.hiveQlTable.getDataLocation.getPath`, which returns the path-part of the URI (e.g., "s3n://my-bucket/my-path" => "/my-path"). We should do `relation.hiveQlTable.getDataLocation.toString` instead, as a URI's toString returns a faithful representation of the full URI, which can later be passed into a Hadoop Path.
Author: Aaron Davidson <aaron@databricks.com>
Closes #2150 from aarondav/parquet-location and squashes the following commits:
459f72c [Aaron Davidson] [SQL] [SPARK-3236] Reading Parquet tables from Metastore mangles location
|
|
|
|
|
|
|
|
|
|
| |
According to the text message, both relations should be tested. So add the missing condition.
Author: viirya <viirya@gmail.com>
Closes #2159 from viirya/fix_test and squashes the following commits:
b1c0f52 [viirya] add missing condition.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
file as parameter
```if (!fs.getFileStatus(path).isDir) throw Exception``` make no sense after this commit #1370
be careful if someone is working on SPARK-2551, make sure the new change passes test case ```test("Read a parquet file instead of a directory")```
Author: chutium <teng.qiu@gmail.com>
Closes #2044 from chutium/parquet-singlefile and squashes the following commits:
4ae477f [chutium] [SPARK-3138][SQL] sqlContext.parquetFile should be able to take a single file as parameter
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
aggregation function (min/max)
Aggregation function min/max in catalyst will create expression tree for each single row, however, the expression tree creation is quite expensive in a multithreading env currently. Hence we got a very bad performance for the min/max.
Here is the benchmark that I've done in my local.
Master | Previous Result (ms) | Current Result (ms)
------------ | ------------- | -------------
local | 3645 | 3416
local[6] | 3602 | 1002
The Benchmark source code.
```
case class Record(key: Int, value: Int)
object TestHive2 extends HiveContext(new SparkContext("local[6]", "TestSQLContext", new SparkConf()))
object DataPrepare extends App {
import TestHive2._
val rdd = sparkContext.parallelize((1 to 10000000).map(i => Record(i % 3000, i)), 12)
runSqlHive("SHOW TABLES")
runSqlHive("DROP TABLE if exists a")
runSqlHive("DROP TABLE if exists result")
rdd.registerAsTable("records")
runSqlHive("""CREATE TABLE a (key INT, value INT)
| ROW FORMAT SERDE
| 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
| STORED AS RCFILE
""".stripMargin)
runSqlHive("""CREATE TABLE result (key INT, value INT)
| ROW FORMAT SERDE
| 'org.apache.hadoop.hive.serde2.columnar.LazyBinaryColumnarSerDe'
| STORED AS RCFILE
""".stripMargin)
hql(s"""from records
| insert into table a
| select key, value
""".stripMargin)
}
object PerformanceTest extends App {
import TestHive2._
hql("SHOW TABLES")
hql("set spark.sql.shuffle.partitions=12")
val cmd = "select min(value), max(value) from a group by key"
val results = ("Result1", benchmark(cmd)) ::
("Result2", benchmark(cmd)) ::
("Result3", benchmark(cmd)) :: Nil
results.foreach { case (prompt, result) => {
println(s"$prompt: took ${result._1} ms (${result._2} records)")
}
}
def benchmark(cmd: String) = {
val begin = System.currentTimeMillis()
val count = hql(cmd).count
val end = System.currentTimeMillis()
((end - begin), count)
}
}
```
Author: Cheng Hao <hao.cheng@intel.com>
Closes #2113 from chenghao-intel/aggregation_expression_optimization and squashes the following commits:
db40395 [Cheng Hao] remove the transient and add val for the expression property
d56167d [Cheng Hao] Reduce the Expressions creation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
(FROM|IN) table_name [(FROM|IN) db_name]" support
JIRA issue: [SPARK-3118] https://issues.apache.org/jira/browse/SPARK-3118
eg:
> SHOW TBLPROPERTIES test;
SHOW TBLPROPERTIES test;
numPartitions 0
numFiles 1
transient_lastDdlTime 1407923642
numRows 0
totalSize 82
rawDataSize 0
eg:
> SHOW COLUMNS in test;
SHOW COLUMNS in test;
OK
Time taken: 0.304 seconds
id
stid
bo
Author: u0jing <u9jing@gmail.com>
Closes #2034 from u0jing/spark-3118 and squashes the following commits:
b231d87 [u0jing] add golden answer files
35f4885 [u0jing] add 'show columns' and 'show tblproperties' support
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #2153 from marmbrus/parquetFilters and squashes the following commits:
712731a [Michael Armbrust] Use closure serializer for sending filters.
1e83f80 [Michael Armbrust] Clean udf functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
support to Parquet.
JIRA:
- https://issues.apache.org/jira/browse/SPARK-3036
- https://issues.apache.org/jira/browse/SPARK-3037
Currently this uses the following Parquet schema for `MapType` when `valueContainsNull` is `true`:
```
message root {
optional group a (MAP) {
repeated group map (MAP_KEY_VALUE) {
required int32 key;
optional int32 value;
}
}
}
```
for `ArrayType` when `containsNull` is `true`:
```
message root {
optional group a (LIST) {
repeated group bag {
optional int32 array;
}
}
}
```
We have to think about compatibilities with older version of Spark or Hive or others I mentioned in the JIRA issues.
Notice:
This PR is based on #1963 and #1889.
Please check them first.
/cc marmbrus, yhuai
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #2032 from ueshin/issues/SPARK-3036_3037 and squashes the following commits:
4e8e9e7 [Takuya UESHIN] Add ArrayType containing null value support to Parquet.
013c2ca [Takuya UESHIN] Add MapType containing null value support to Parquet.
62989de [Takuya UESHIN] Merge branch 'issues/SPARK-2969' into issues/SPARK-3036_3037
8e38b53 [Takuya UESHIN] Merge branch 'issues/SPARK-3063' into issues/SPARK-3036_3037
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
AttributeReferences
It is common to want to describe sets of attributes that are in various parts of a query plan. However, the semantics of putting `AttributeReference` objects into a standard Scala `Set` result in subtle bugs when references differ cosmetically. For example, with case insensitive resolution it is possible to have two references to the same attribute whose names are not equal.
In this PR I introduce a new abstraction, an `AttributeSet`, which performs all comparisons using the globally unique `ExpressionId` instead of case class equality. (There is already a related class, [`AttributeMap`](https://github.com/marmbrus/spark/blob/inMemStats/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/AttributeMap.scala#L32)) This new type of set is used to fix a bug in the optimizer where needed attributes were getting projected away underneath join operators.
I also took this opportunity to refactor the expression and query plan base classes. In all but one instance the logic for computing the `references` of an `Expression` were the same. Thus, I moved this logic into the base class.
For query plans the semantics of the `references` method were ill defined (is it the references output? or is it those used by expression evaluation? or what?). As a result, this method wasn't really used very much. So, I removed it.
TODO:
- [x] Finish scala doc for `AttributeSet`
- [x] Scan the code for other instances of `Set[Attribute]` and refactor them.
- [x] Finish removing `references` from `QueryPlan`
Author: Michael Armbrust <michael@databricks.com>
Closes #2109 from marmbrus/attributeSets and squashes the following commits:
1c0dae5 [Michael Armbrust] work on serialization bug.
9ba868d [Michael Armbrust] Merge remote-tracking branch 'origin/master' into attributeSets
3ae5288 [Michael Armbrust] review comments
40ce7f6 [Michael Armbrust] style
d577cc7 [Michael Armbrust] Scaladoc
cae5d22 [Michael Armbrust] remove more references implementations
d6e16be [Michael Armbrust] Remove more instances of "def references" and normal sets of attributes.
fc26b49 [Michael Armbrust] Add AttributeSet class, remove references from Expression.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently `ExistingRdd.convertToCatalyst` doesn't convert `Map` value.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1963 from ueshin/issues/SPARK-3063 and squashes the following commits:
3ba41f2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
4d7bae2 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
9321379 [Takuya UESHIN] Merge branch 'master' into issues/SPARK-3063
d8a900a [Takuya UESHIN] Make ExistingRdd.convertToCatalyst be able to convert Map value.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ArrayType.containsNull and MapType.valueContainsNull.
Make `ScalaReflection` be able to handle like:
- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:
24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ParquetFile in SQLContext
There are 4 different compression codec available for ```ParquetOutputFormat```
in Spark SQL, it was set as a hard-coded value in ```ParquetRelation.defaultCompression```
original discuss:
https://github.com/apache/spark/pull/195#discussion-diff-11002083
i added a new config property in SQLConf to allow user to change this compression codec, and i used similar short names syntax as described in SPARK-2953 #1873 (https://github.com/apache/spark/pull/1873/files#diff-0)
btw, which codec should we use as default? it was set to GZIP (https://github.com/apache/spark/pull/195/files#diff-4), but i think maybe we should change this to SNAPPY, since SNAPPY is already the default codec for shuffling in spark-core (SPARK-2469, #1415), and parquet-mr supports Snappy codec natively (https://github.com/Parquet/parquet-mr/commit/e440108de57199c12d66801ca93804086e7f7632).
Author: chutium <teng.qiu@gmail.com>
Closes #2039 from chutium/parquet-compression and squashes the following commits:
2f44964 [chutium] [SPARK-3131][SQL] parquet compression default codec set to snappy, also in test suite
e578e21 [chutium] [SPARK-3131][SQL] compression codec config property name and default codec set to snappy
21235dc [chutium] [SPARK-3131][SQL] Allow user to set parquet compression codec for writing ParquetFile in SQLContext
|
|
|
|
|
|
|
|
|
|
|
|
| |
We can simple treat cross join as inner join without join conditions.
Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: adrian-wang <daoyuanwong@gmail.com>
Closes #2124 from adrian-wang/crossjoin and squashes the following commits:
8c9b7c5 [Daoyuan Wang] add a test
7d47bbb [adrian-wang] add cross join support for hql
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
sqlContext.parquetFile
fix compile error on hadoop 0.23 for the pull request #1924.
Author: Chia-Yung Su <chiayung@appier.com>
Closes #1959 from joesu/bugfix-spark3011 and squashes the following commits:
be30793 [Chia-Yung Su] remove .* and _* except _metadata
8fe2398 [Chia-Yung Su] add note to explain
40ea9bd [Chia-Yung Su] fix hadoop-0.23 compile error
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
|
|
|
|
|
|
|
|
| |
Author: wangfei <wangfei_hello@126.com>
Closes #1939 from scwf/patch-5 and squashes the following commits:
f952d10 [wangfei] [SQL] logWarning should be logInfo in getResultSetSchema
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Provide `extended` keyword support for `explain` command in SQL. e.g.
```
explain extended select key as a1, value as a2 from src where key=1;
== Parsed Logical Plan ==
Project ['key AS a1#3,'value AS a2#4]
Filter ('key = 1)
UnresolvedRelation None, src, None
== Analyzed Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = CAST(1, DoubleType))
MetastoreRelation default, src, None
== Optimized Logical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = 1.0)
MetastoreRelation default, src, None
== Physical Plan ==
Project [key#8 AS a1#3,value#9 AS a2#4]
Filter (CAST(key#8, DoubleType) = 1.0)
HiveTableScan [key#8,value#9], (MetastoreRelation default, src, None), None
Code Generation: false
== RDD ==
(2) MappedRDD[14] at map at HiveContext.scala:350
MapPartitionsRDD[13] at mapPartitions at basicOperators.scala:42
MapPartitionsRDD[12] at mapPartitions at basicOperators.scala:57
MapPartitionsRDD[11] at mapPartitions at TableReader.scala:112
MappedRDD[10] at map at TableReader.scala:240
HadoopRDD[9] at HadoopRDD at TableReader.scala:230
```
It's the sub task of #1847. But can go without any dependency.
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1962 from chenghao-intel/explain_extended and squashes the following commits:
295db74 [Cheng Hao] Fix bug in printing the simple execution plan
48bc989 [Cheng Hao] Support EXTENDED for EXPLAIN
|
|
|
|
|
|
|
|
|
|
|
|
| |
Removed most hard coded timeout, timing assumptions and all `Thread.sleep`. Simplified IPC and synchronization with `scala.sys.process` and future/promise so that the test suites can run more robustly and faster.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #1856 from liancheng/thriftserver-tests and squashes the following commits:
2d914ca [Cheng Lian] Minor refactoring
0e12e71 [Cheng Lian] Cleaned up test output
0ee921d [Cheng Lian] Refactored Thrift server and CLI suites
|
|
|
|
|
|
|
|
| |
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #2116 from ueshin/issues/SPARK-3204 and squashes the following commits:
7d9b107 [Takuya UESHIN] Make MaxOf foldable if both left and right are foldable.
|
|
|
|
|
|
|
|
|
|
|
|
| |
shuffle fix.
Follow-up to #2066
Author: Michael Armbrust <michael@databricks.com>
Closes #2072 from marmbrus/sortShuffle and squashes the following commits:
2ff8114 [Michael Armbrust] Fix bug
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
improvements
Author: Michael Armbrust <michael@databricks.com>
Author: Gregory Owen <greowen@gmail.com>
Closes #1935 from marmbrus/countDistinctPartial and squashes the following commits:
5c7848d [Michael Armbrust] turn off caching in the constructor
8074a80 [Michael Armbrust] fix tests
32d216f [Michael Armbrust] reynolds comments
c122cca [Michael Armbrust] Address comments, add tests
b2e8ef3 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
fae38f4 [Michael Armbrust] Fix style
fdca896 [Michael Armbrust] cleanup
93d0f64 [Michael Armbrust] metastore concurrency fix.
db44a30 [Michael Armbrust] JIT hax.
3868f6c [Michael Armbrust] Merge pull request #9 from GregOwen/countDistinctPartial
c9e67de [Gregory Owen] Made SpecificRow and types serializable by Kryo
2b46c4b [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
8ff6402 [Michael Armbrust] Add specific row.
58d15f1 [Michael Armbrust] disable codegen logging
87d101d [Michael Armbrust] Fix isNullAt bug
abee26d [Michael Armbrust] WIP
27984d0 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into countDistinctPartial
57ae3b1 [Michael Armbrust] Fix order dependent test
b3d0f64 [Michael Armbrust] Add golden files.
c1f7114 [Michael Armbrust] Improve tests / fix serialization.
f31b8ad [Michael Armbrust] more fixes
38c7449 [Michael Armbrust] comments and style
9153652 [Michael Armbrust] better toString
d494598 [Michael Armbrust] Fix tests now that the planner is better
41fbd1d [Michael Armbrust] Never try and create an empty hash set.
050bb97 [Michael Armbrust] Skip no-arg constructors for kryo,
bd08239 [Michael Armbrust] WIP
213ada8 [Michael Armbrust] First draft of partially aggregated and code generated count distinct / max
|
|
|
|
|
|
|
|
|
|
|
|
| |
Seems we missed `transient` for the `functionRegistry` in `HiveContext`.
cc: marmbrus
Author: Yin Huai <huaiyin.thu@gmail.com>
Closes #2074 from yhuai/makeFunctionRegistryTransient and squashes the following commits:
6534e7d [Yin Huai] Make functionRegistry transient.
|
|
|
|
|
|
|
|
|
|
|
|
| |
initialization of job conf
...al job conf
Author: Alex Liu <alex_liu68@yahoo.com>
Closes #1927 from alexliu68/SPARK-SQL-2846 and squashes the following commits:
e4bdc4c [Alex Liu] SPARK-SQL-2846 add configureInputJobPropertiesForStorageHandler to initial job conf
|
|
|
|
|
|
|
|
|
|
| |
Add explicit row copies when sort based shuffle is on.
Author: Michael Armbrust <michael@databricks.com>
Closes #2066 from marmbrus/sortShuffle and squashes the following commits:
fcd7bb2 [Michael Armbrust] Fix sort based shuffle for spark sql.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This PR fixes two issues:
1. Fixes wrongly quoted command line option in `HiveThriftServer2Suite` that makes test cases hang until timeout.
1. Asks `dev/run-test` to run Spark SQL tests when `bin/spark-sql` and/or `sbin/start-thriftserver.sh` are modified.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #2036 from liancheng/fix-thriftserver-test and squashes the following commits:
f38c4eb [Cheng Lian] Fixed the same quotation issue in CliSuite
26b82a0 [Cheng Lian] Run SQL tests when dff contains bin/spark-sql and/or sbin/start-thriftserver.sh
a87f83d [Cheng Lian] Extended timeout
e5aa31a [Cheng Lian] Fixed metastore JDBC URI quotation
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Refer to:
http://stackoverflow.com/questions/510632/whats-the-difference-between-concurrenthashmap-and-collections-synchronizedmap
Collections.synchronizedMap(map) creates a blocking Map which will degrade performance, albeit ensure consistency. So use ConcurrentHashMap(a more effective thread-safe hashmap) instead.
also update HiveQuerySuite to fix test error when changed to ConcurrentHashMap.
Author: wangfei <wangfei_hello@126.com>
Author: scwf <wangfei1@huawei.com>
Closes #1996 from scwf/sqlconf and squashes the following commits:
93bc0c5 [wangfei] revert change of HiveQuerySuite
0cc05dd [wangfei] add note for use synchronizedMap
3c224d31 [scwf] fix formate
a7bcb98 [scwf] use ConcurrentHashMap in sql conf, intead synchronizedMap
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
HiveMetaStore tables.
This PR adds an experimental flag `spark.sql.hive.convertMetastoreParquet` that when true causes the planner to detects tables that use Hive's Parquet SerDe and instead plans them using Spark SQL's native `ParquetTableScan`.
Author: Michael Armbrust <michael@databricks.com>
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1819 from marmbrus/parquetMetastore and squashes the following commits:
1620079 [Michael Armbrust] Revert "remove hive parquet bundle"
cc30430 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4f3d54f [Michael Armbrust] fix style
41ebc5f [Michael Armbrust] remove hive parquet bundle
a43e0da [Michael Armbrust] Merge remote-tracking branch 'origin/master' into parquetMetastore
4c4dc19 [Michael Armbrust] Fix bug with tree splicing.
ebb267e [Michael Armbrust] include parquet hive to tests pass (Remove this later).
c0d9b72 [Michael Armbrust] Avoid creating a HadoopRDD per partition. Add dirty hacks to retrieve partition values from the InputSplit.
8cdc93c [Michael Armbrust] Merge pull request #8 from yhuai/parquetMetastore
a0baec7 [Yin Huai] Partitioning columns can be resolved.
1161338 [Michael Armbrust] Add a test to make sure conversion is actually happening
212d5cd [Michael Armbrust] Initial support for using ParquetTableScan to read HiveMetaStore tables.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
For larger Parquet files, reading the file footers (which is done in parallel on up to 5 threads) and HDFS block locations (which is serial) can take multiple seconds. We can add an option to cache this data within FilteringParquetInputFormat. Unfortunately ParquetInputFormat only caches footers within each instance of ParquetInputFormat, not across them.
Note: this PR leaves this turned off by default for 1.1, but I believe it's safe to turn it on after. The keys in the hash maps are FileStatus objects that include a modification time, so this will work fine if files are modified. The location cache could become invalid if files have moved within HDFS, but that's rare so I just made it invalidate entries every 15 minutes.
Author: Matei Zaharia <matei@databricks.com>
Closes #2005 from mateiz/parquet-cache and squashes the following commits:
dae8efe [Matei Zaharia] Bug fix
c71e9ed [Matei Zaharia] Handle empty statuses directly
22072b0 [Matei Zaharia] Use Guava caches and add a config option for caching metadata
8fb56ce [Matei Zaharia] Cache file block locations too
453bd21 [Matei Zaharia] Bug fix
4094df6 [Matei Zaharia] First attempt at caching Parquet footers
|
|
|
|
|
|
|
|
|
|
|
| |
This definitely needs review as I am not familiar with this part of Spark.
I tested this locally and it did seem to work.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #1937 from pwendell/scheduler and squashes the following commits:
b858e33 [Patrick Wendell] SPARK-3025: Allow JDBC clients to set a fair scheduler pool
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reuses the CompactBuffer from Spark Core to save memory and pointer
dereferences. I also tried AppendOnlyMap instead of java.util.HashMap
but unfortunately that slows things down because it seems to do more
equals() calls and the equals on GenericRow, and especially JoinedRow,
is pretty expensive.
Author: Matei Zaharia <matei@databricks.com>
Closes #1993 from mateiz/spark-3085 and squashes the following commits:
188221e [Matei Zaharia] Remove unneeded import
5f903ee [Matei Zaharia] [SPARK-3085] [SQL] Use compact data structures in SQL joins
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
BroadcastHashJoin has a broadcastFuture variable that tries to collect
the broadcasted table in a separate thread, but this doesn't help
because it's a lazy val that only gets initialized when you attempt to
build the RDD. Thus queries that broadcast multiple tables would collect
and broadcast them sequentially. I changed this to a val to let it start
collecting right when the operator is created.
Author: Matei Zaharia <matei@databricks.com>
Closes #1990 from mateiz/spark-3084 and squashes the following commits:
f468766 [Matei Zaharia] [SPARK-3084] Collect broadcasted tables in parallel in joins
|
|
|
|
|
|
|
|
|
|
| |
A small change - we should just add this dependency. It doesn't have any recursive deps and it's needed for reading have parquet tables.
Author: Patrick Wendell <pwendell@gmail.com>
Closes #2009 from pwendell/parquet and squashes the following commits:
e411f9f [Patrick Wendell] SPARk-309: Include parquet hive serde by default in build
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #2004 from marmbrus/codgenDebugging and squashes the following commits:
b7a7e41 [Michael Armbrust] Improve debug logging and toStrings.
|
|
|
|
|
|
|
|
|
|
|
|
| |
EventLogging is enabled"
Revert #1891 due to issues with hadoop 1 compatibility.
Author: Michael Armbrust <michael@databricks.com>
Closes #2007 from marmbrus/revert1891 and squashes the following commits:
68706c0 [Michael Armbrust] Revert "[SPARK-2970] [SQL] spark-sql script ends with IOException when EventLogging is enabled"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"install"
(This is the corrected follow-up to https://issues.apache.org/jira/browse/SPARK-2903)
Right now, `mvn compile test-compile` fails to compile Spark. (Don't worry; `mvn package` works, so this is not major.) The issue stems from test code in some modules depending on test code in other modules. That is perfectly fine and supported by Maven.
It takes extra work to get this to work with scalatest, and this has been attempted: https://github.com/apache/spark/blob/master/sql/catalyst/pom.xml#L86
This formulation is not quite enough, since the SQL Core module's tests fail to compile for lack of finding test classes in SQL Catalyst, and likewise for most Streaming integration modules depending on core Streaming test code. Example:
```
[error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/QueryTest.scala:23: not found: type PlanTest
[error] class QueryTest extends PlanTest {
[error] ^
[error] /Users/srowen/Documents/spark/sql/core/src/test/scala/org/apache/spark/sql/CachedTableSuite.scala:28: package org.apache.spark.sql.test is not a value
[error] test("SPARK-1669: cacheTable should be idempotent") {
[error] ^
...
```
The issue I believe is that generation of a `test-jar` is bound here to the `compile` phase, but the test classes are not being compiled in this phase. It should bind to the `test-compile` phase.
It works when executing `mvn package` or `mvn install` since test-jar artifacts are actually generated available through normal Maven mechanisms as each module is built. They are then found normally, regardless of scalatest configuration.
It would be nice for a simple `mvn compile test-compile` to work since the test code is perfectly compilable given the Maven declarations.
On the plus side, this change is low-risk as it only affects tests.
yhuai made the original scalatest change and has glanced at this and thinks it makes sense.
Author: Sean Owen <srowen@gmail.com>
Closes #1879 from srowen/SPARK-2955 and squashes the following commits:
ad8242f [Sean Owen] Generate test-jar on test-compile for modules whose tests are needed by others' tests
|
|
|
|
|
|
|
|
|
|
|
|
| |
sqlContext.parquetFile
Reverts #1924 due to build failures with hadoop 0.23.
Author: Michael Armbrust <michael@databricks.com>
Closes #1949 from marmbrus/revert1924 and squashes the following commits:
6bff940 [Michael Armbrust] Revert "[SPARK-3011][SQL] _temporary directory should be filtered out by sqlContext.parquetFile"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
stored in Parquet as String columns
This PR adds a new conf flag `spark.sql.parquet.binaryAsString`. When it is `true`, if there is no parquet metadata file available to provide the schema of the data, we will always treat binary fields stored in parquet as string fields. This conf is used to provide a way to read string fields generated without UTF8 decoration.
JIRA: https://issues.apache.org/jira/browse/SPARK-2927
Author: Yin Huai <huai@cse.ohio-state.edu>
Closes #1855 from yhuai/parquetBinaryAsString and squashes the following commits:
689ffa9 [Yin Huai] Add missing "=".
80827de [Yin Huai] Unit test.
1765ca4 [Yin Huai] Use .toBoolean.
9d3f199 [Yin Huai] Merge remote-tracking branch 'upstream/master' into parquetBinaryAsString
5d436a1 [Yin Huai] The initial support of adding a conf to treat binary columns stored in Parquet as string columns.
|
|
|
|
|
|
|
|
|
|
|
| |
sqlContext.parquetFile
Author: Chia-Yung Su <chiayung@appier.com>
Closes #1924 from joesu/bugfix-spark3011 and squashes the following commits:
c7e44f2 [Chia-Yung Su] match syntax
f8fc32a [Chia-Yung Su] filter out tmp dir
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
it seems that set command does not run by SparkSQLDriver. it runs on hive api.
user can not change reduce number by setting spark.sql.shuffle.partitions
but i think setting hive properties seems just a role to spark sql.
Author: guowei <guowei@upyoo.com>
Closes #1904 from guowei2/temp-branch and squashes the following commits:
7d47dde [guowei] fixed: setting properties like spark.sql.shuffle.partitions does not effective
|
|
|
|
|
|
|
|
|
|
|
|
| |
is enabled
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Closes #1891 from sarutak/SPARK-2970 and squashes the following commits:
4a2d2fe [Kousuke Saruta] Modified comment style
8bd833c [Kousuke Saruta] Modified style
6c0997c [Kousuke Saruta] Modified the timing of shutdown hook execution. It should be executed before shutdown hook of o.a.h.f.FileSystem
|
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1863 from marmbrus/parquetPredicates and squashes the following commits:
10ad202 [Michael Armbrust] left <=> right
f249158 [Michael Armbrust] quiet parquet tests.
802da5b [Michael Armbrust] Add test case.
eab2eda [Michael Armbrust] Fix parquet predicate push down bug
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
column buffer
This is a follow up of #1880.
Since the row number within a single batch is known, we can estimate a much more precise initial buffer size when building an in-memory column buffer.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #1901 from liancheng/precise-init-buffer-size and squashes the following commits:
d5501fa [Cheng Lian] More precise initial buffer size estimation for in-memory column buffer
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1915 from marmbrus/arrayUDF and squashes the following commits:
a1c503d [Michael Armbrust] Support for udfs that take complex types
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In spark sql component, the "show create table" syntax had been disabled.
We thought it is a useful funciton to describe a hive table.
Author: tianyi <tianyi@asiainfo-linkage.com>
Author: tianyi <tianyi@asiainfo.com>
Author: tianyi <tianyi.asiainfo@gmail.com>
Closes #1760 from tianyi/spark-2817 and squashes the following commits:
7d28b15 [tianyi] [SPARK-2817] fix too short prefix problem
cbffe8b [tianyi] [SPARK-2817] fix the case problem
565ec14 [tianyi] [SPARK-2817] fix the case problem
60d48a9 [tianyi] [SPARK-2817] use system temporary folder instead of temporary files in the source tree, and also clean some empty line
dbe1031 [tianyi] [SPARK-2817] move some code out of function rewritePaths, as it may be called multiple times
9b2ba11 [tianyi] [SPARK-2817] fix the line length problem
9f97586 [tianyi] [SPARK-2817] remove test.tmp.dir from pom.xml
bfc2999 [tianyi] [SPARK-2817] add "File.separator" support, create a "testTmpDir" outside the rewritePaths
bde800a [tianyi] [SPARK-2817] add "${system:test.tmp.dir}" support add "last_modified_by" to nonDeterministicLineIndicators in HiveComparisonTest
bb82726 [tianyi] [SPARK-2817] remove test which requires a system from the whitelist.
bbf6b42 [tianyi] [SPARK-2817] add a systemProperties named "test.tmp.dir" to pass the test which contains "${system:test.tmp.dir}"
a337bd6 [tianyi] [SPARK-2817] add "show create table" support
a03db77 [tianyi] [SPARK-2817] add "show create table" support
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
JIRA issue: [SPARK-3004](https://issues.apache.org/jira/browse/SPARK-3004)
HiveThriftServer2 throws exception when the result set contains `NULL`. Should check `isNullAt` in `SparkSQLOperationManager.getNextRowSet`.
Note that simply using `row.addColumnValue(null)` doesn't work, since Hive set the column type of a null `ColumnValue` to String by default.
Author: Cheng Lian <lian.cs.zju@gmail.com>
Closes #1920 from liancheng/spark-3004 and squashes the following commits:
1b1db1c [Cheng Lian] Adding NULL column values in the Hive way
2217722 [Cheng Lian] Fixed SPARK-3004: added null checking when retrieving row set
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
HashOuterJoin
This is a follow up for #1147 , this PR will improve the performance about 10% - 15% in my local tests.
```
Before:
LeftOuterJoin: took 16750 ms ([3000000] records)
LeftOuterJoin: took 15179 ms ([3000000] records)
RightOuterJoin: took 15515 ms ([3000000] records)
RightOuterJoin: took 15276 ms ([3000000] records)
FullOuterJoin: took 19150 ms ([6000000] records)
FullOuterJoin: took 18935 ms ([6000000] records)
After:
LeftOuterJoin: took 15218 ms ([3000000] records)
LeftOuterJoin: took 13503 ms ([3000000] records)
RightOuterJoin: took 13663 ms ([3000000] records)
RightOuterJoin: took 14025 ms ([3000000] records)
FullOuterJoin: took 16624 ms ([6000000] records)
FullOuterJoin: took 16578 ms ([6000000] records)
```
Besides the performance improvement, I also do some clean up as suggested in #1147
Author: Cheng Hao <hao.cheng@intel.com>
Closes #1765 from chenghao-intel/hash_outer_join_fixing and squashes the following commits:
ab1f9e0 [Cheng Hao] Reduce the memory copy while building the hashmap
|
|
|
|
|
|
|
|
|
|
| |
Author: Michael Armbrust <michael@databricks.com>
Closes #1880 from marmbrus/columnBatches and squashes the following commits:
0649987 [Michael Armbrust] add test
4756fad [Michael Armbrust] fix compilation
2314532 [Michael Armbrust] Build column buffers in smaller batches
|
|
|
|
|
|
|
|
|
|
| |
Output nullabilities of `Explode` could be detemined by `ArrayType.containsNull` or `MapType.valueContainsNull`.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1888 from ueshin/issues/SPARK-2968 and squashes the following commits:
d128c95 [Takuya UESHIN] Fix nullability of Explode.
|
|
|
|
|
|
|
|
|
|
| |
Output attributes of opposite side of `OuterJoin` should be nullable.
Author: Takuya UESHIN <ueshin@happy-camper.st>
Closes #1887 from ueshin/issues/SPARK-2965 and squashes the following commits:
bcb2d37 [Takuya UESHIN] Fix HashOuterJoin output nullabilities.
|
|
|
|
|
|
|
|
|
|
| |
I should use `EliminateAnalysisOperators` in `analyze` instead of manually pattern matching.
Author: Yin Huai <huaiyin.thu@gmail.com>
Closes #1881 from yhuai/useEliminateAnalysisOperators and squashes the following commits:
f3e1e7f [Yin Huai] Use EliminateAnalysisOperators.
|