aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-12610][SQL] Left Anti JoinHerman van Hovell2016-04-0616-108/+231
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ### What changes were proposed in this pull request? This PR adds support for `LEFT ANTI JOIN` to Spark SQL. A `LEFT ANTI JOIN` is the exact opposite of a `LEFT SEMI JOIN` and can be used to identify rows in one dataset that are not in another dataset. Note that `nulls` on the left side of the join cannot match a row on the right hand side of the join; the result is that left anti join will always select a row with a `null` in one or more of its keys. We currently add support for the following SQL join syntax: SELECT * FROM tbl1 A LEFT ANTI JOIN tbl2 B ON A.Id = B.Id Or using a dataframe: tbl1.as("a").join(tbl2.as("b"), $"a.id" === $"b.id", "left_anti) This PR provides serves as the basis for implementing `NOT EXISTS` and `NOT IN (...)` correlated sub-queries. It would also serve as good basis for implementing an more efficient `EXCEPT` operator. The PR has been (losely) based on PR's by both davies (https://github.com/apache/spark/pull/10706) and chenghao-intel (https://github.com/apache/spark/pull/10563); credit should be given where credit is due. This PR adds supports for `LEFT ANTI JOIN` to `BroadcastHashJoin` (including codegeneration), `ShuffledHashJoin` and `BroadcastNestedLoopJoin`. ### How was this patch tested? Added tests to `JoinSuite` and ported `ExistenceJoinSuite` from https://github.com/apache/spark/pull/10563. cc davies chenghao-intel rxin Author: Herman van Hovell <hvanhovell@questtec.nl> Closes #12214 from hvanhovell/SPARK-12610.
* [SPARK-12555][SQL] Result should not be corrupted after input columns are ↵Luciano Resende2016-04-071-0/+19
| | | | | | | | | | reordered This PR add test case described in SPARK-12555 to validate that correct data is returned when input data is reordered and to avoid future regressions. Author: Luciano Resende <lresende@apache.org> Closes #11623 from lresende/SPARK-12555.
* [SPARK-14436][SQL] Make JavaDatasetAggregatorSuiteBase public.Marcelo Vanzin2016-04-062-53/+83
| | | | | | | | | | Without this, unit tests that extend that class fail for me locally on maven, because JUnit tries to run methods in that class and gets an IllegalAccessError. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #12212 from vanzin/SPARK-14436.
* [SPARK-14444][BUILD] Add a new scalastyle `NoScalaDoc` to prevent ↵Dongjoon Hyun2016-04-067-26/+28
| | | | | | | | | | | | | | | | | | | | | ScalaDoc-style multiline comments ## What changes were proposed in this pull request? According to the [Spark Code Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide#SparkCodeStyleGuide-Indentation), this PR adds a new scalastyle rule to prevent the followings. ``` /** In Spark, we don't use the ScalaDoc style so this * is not correct. */ ``` ## How was this patch tested? Pass the Jenkins tests (including `lint-scala`). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12221 from dongjoon-hyun/SPARK-14444.
* [SPARK-14224] [SPARK-14223] [SPARK-14310] [SQL] fix RowEncoder and parquet ↵Davies Liu2016-04-0613-234/+267
| | | | | | | | | | | | | | | | | | | | | reader for wide table ## What changes were proposed in this pull request? 1) fix the RowEncoder for wide table (many columns) by splitting the generate code into multiple functions. 2) Separate DataSourceScan as RowDataSourceScan and BatchedDataSourceScan 3) Disable the returning columnar batch in parquet reader if there are many columns. 4) Added a internal config for maximum number of fields (nested) columns supported by whole stage codegen. Closes #12098 ## How was this patch tested? Add a tests for table with 1000 columns. Author: Davies Liu <davies@databricks.com> Closes #12047 from davies/many_columns.
* [SPARK-14382][SQL] QueryProgress should be post after committedOffsets is ↵Shixiong Zhu2016-04-062-12/+6
| | | | | | | | | | | | | | | | | | updated ## What changes were proposed in this pull request? Make sure QueryProgress is post after committedOffsets is updated. If QueryProgress is post before committedOffsets is updated, the listener may see a wrong sinkStatus (created from committedOffsets). See https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.2/644/testReport/junit/org.apache.spark.sql.util/ContinuousQueryListenerSuite/single_listener/ for an example of the failure. ## How was this patch tested? Existing unit tests. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12155 from zsxwing/SPARK-14382.
* [SPARK-14320][SQL] Make ColumnarBatch.Row mutableSameer Agarwal2016-04-065-8/+135
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? In order to leverage a data structure like `AggregateHashMap` (https://github.com/apache/spark/pull/12055) to speed up aggregates with keys, we need to make `ColumnarBatch.Row` mutable. ## How was this patch tested? Unit test in `ColumnarBatchSuite`. Also, tested via `BenchmarkWholeStageCodegen`. Author: Sameer Agarwal <sameer@databricks.com> Closes #12103 from sameeragarwal/mutable-row.
* [SPARK-14383][SQL] missing "|" in the g4 filebomeng2016-04-062-1/+8
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? A very trivial one. It missed "|" between DISTRIBUTE and UNSET. ## How was this patch tested? I do not think it is really needed. Author: bomeng <bmeng@us.ibm.com> Closes #12156 from bomeng/SPARK-14383.
* [SPARK-14429][SQL] Improve LIKE pattern in "SHOW TABLES / FUNCTIONS LIKE ↵bomeng2016-04-065-24/+46
| | | | | | | | | | | | | | | | | | | | | | <pattern>" DDL LIKE <pattern> is commonly used in SHOW TABLES / FUNCTIONS etc DDL. In the pattern, user can use `|` or `*` as wildcards. 1. Currently, we used `replaceAll()` to replace `*` with `.*`, but the replacement was scattered in several places; I have created an utility method and use it in all the places; 2. Consistency with Hive: the pattern is case insensitive in Hive and white spaces will be trimmed, but current pattern matching does not do that. For example, suppose we have tables (t1, t2, t3), `SHOW TABLES LIKE ' T* ' ` will list all the t-tables. Please use Hive to verify it. 3. Combined with `|`, the result will be sorted. For pattern like `' B*|a* '`, it will list the result in a-b order. I've made some changes to the utility method to make sure we will get the same result as Hive does. A new method was created in StringUtil and test cases were added. andrewor14 Author: bomeng <bmeng@us.ibm.com> Closes #12206 from bomeng/SPARK-14429.
* [SPARK-14426][SQL] Merge PerserUtils and ParseUtilsKousuke Saruta2016-04-064-137/+143
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We have ParserUtils and ParseUtils which are both utility collections for use during the parsing process. Those names and what they are used for is very similar so I think we can merge them. Also, the original unescapeSQLString method may have a fault. When "\u0061" style character literals are passed to the method, it's not unescaped successfully. This patch fix the bug. ## How was this patch tested? Added a new test case. Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp> Closes #12199 from sarutak/merge-ParseUtils-and-ParserUtils.
* [SPARK-14288][SQL] Memory Sink for streamingMichael Armbrust2016-04-065-18/+159
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR exposes the internal testing `MemorySink` though the data source API. This will allow users to easily test streaming applications in the Spark shell or other local tests. Usage: ```scala inputStream.write .format("memory") .queryName("memStream") .startStream() // Now you can query the result of the stream here. sqlContext.table("memStream") ``` The most complicated part of the logic is choosing the checkpoint directory. There are a few requirements we are attempting to satisfy here: - when working in the shell locally, it should just work with no extra configuration. - when working on a cluster you should be able to make it easily create the checkpoint on a distributed file system so you can test aggregation (state checkpoints are also stored in this directory and must be accessible from workers). - it should be clear that you can't resume since the data is just in memory. The chosen algorithm proceeds as follows: - the user gives a checkpoint directory, use it - if the conf has a checkpoint location, use `$location/$queryName` - if neither, create a local directory - always check to make sure there are no offsets written to the directory Author: Michael Armbrust <michael@databricks.com> Closes #12119 from marmbrus/memorySink.
* [SPARK-14396][BUILD][HOT] Fix compilation against Scala 2.10gatorsmile2016-04-061-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? This PR is to fix the compilation errors in Scala 2.10 build, as shown in the link: https://amplab.cs.berkeley.edu/jenkins/job/spark-master-compile-maven-scala-2.10/735/console ``` [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:266: value contains is not a member of Option[String] [error] assert(desc.viewText.contains("SELECT * FROM tab1")) [error] ^ [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:267: value contains is not a member of Option[String] [error] assert(desc.viewOriginalText.contains("SELECT * FROM tab1")) [error] ^ [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:293: value contains is not a member of Option[String] [error] assert(desc.viewText.contains("SELECT * FROM tab1")) [error] ^ [error] /home/jenkins/workspace/spark-master-compile-maven-scala-2.10/sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveDDLCommandSuite.scala:294: value contains is not a member of Option[String] [error] assert(desc.viewOriginalText.contains("SELECT * FROM tab1")) [error] ^ [error] four errors found [error] Compile failed at Apr 5, 2016 10:59:09 PM [10.502s] ``` #### How was this patch tested? Not sure how to trigger Scala 2.10 compilation in the test environment. Author: gatorsmile <gatorsmile@gmail.com> Closes #12201 from gatorsmile/buildBreak2.10.
* [SPARK-14396][SQL] Throw Exceptions for DDLs of Partitioned Viewsgatorsmile2016-04-055-44/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | #### What changes were proposed in this pull request? Because the concept of partitioning is associated with physical tables, we disable all the supports of partitioned views, which are defined in the following three commands in [Hive DDL Manual](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Create/Drop/AlterView): ``` ALTER VIEW view DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, ...]; ALTER VIEW view ADD [IF NOT EXISTS] PARTITION spec; CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ] [COMMENT view_comment] [TBLPROPERTIES (property_name = property_value, ...)] AS SELECT ...; ``` An exception is thrown when users issue any of these three DDL commands. #### How was this patch tested? Added test cases for parsing create view and changed the existing test cases to verify if the exceptions are thrown. Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12169 from gatorsmile/viewPartition.
* [SPARK-14128][SQL] Alter table DDL followupAndrew Or2016-04-052-5/+21
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This is just a followup to #12121, which implemented the alter table DDLs using the `SessionCatalog`. Specially, this corrects the behavior of setting the location of a datasource table. For datasource tables, we need to set the `locationUri` in addition to the `path` entry in the serde properties. Additionally, changing the location of a datasource table partition is not allowed. ## How was this patch tested? `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #12186 from andrewor14/alter-table-ddl-followup.
* [SPARK-14296][SQL] whole stage codegen support for Dataset.mapWenchen Fan2016-04-0611-41/+247
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds a new operator `MapElements` for `Dataset.map`, it's a 1-1 mapping and is easier to adapt to whole stage codegen framework. ## How was this patch tested? new test in `WholeStageCodegenSuite` Author: Wenchen Fan <wenchen@databricks.com> Closes #12087 from cloud-fan/map.
* [SPARK-14359] Unit tests for java 8 lambda syntax with typed aggregatesEric Liang2016-04-051-41/+45
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Adds unit tests for java 8 lambda syntax with typed aggregates as a follow-up to #12168 ## How was this patch tested? Unit tests. Author: Eric Liang <ekl@databricks.com> Closes #12181 from ericl/sc-2794-2.
* [SPARK-529][SQL] Modify SQLConf to use new config API from core.Marcelo Vanzin2016-04-055-495/+420
| | | | | | | | | | | | Because SQL keeps track of all known configs, some customization was needed in SQLConf to allow that, since the core API does not have that feature. Tested via existing (and slightly updated) unit tests. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11570 from vanzin/SPARK-529-sql.
* [SPARK-14411][SQL] Add a note to warn that onQueryProgress is asynchronousShixiong Zhu2016-04-051-2/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? onQueryProgress is asynchronous so the user may see some future status of `ContinuousQuery`. This PR just updated comments to warn it. ## How was this patch tested? Only updated comments. Author: Shixiong Zhu <shixiong@databricks.com> Closes #12180 from zsxwing/ContinuousQueryListener-doc.
* [SPARK-14129][SPARK-14128][SQL] Alter table DDL commandsAndrew Or2016-04-059-300/+562
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? In Spark 2.0, we want to handle the most common `ALTER TABLE` commands ourselves instead of passing the entire query text to Hive. This is done using the new `SessionCatalog` API introduced recently. The commands supported in this patch include: ``` ALTER TABLE ... RENAME TO ... ALTER TABLE ... SET TBLPROPERTIES ... ALTER TABLE ... UNSET TBLPROPERTIES ... ALTER TABLE ... SET LOCATION ... ALTER TABLE ... SET SERDE ... ``` The commands we explicitly do not support are: ``` ALTER TABLE ... CLUSTERED BY ... ALTER TABLE ... SKEWED BY ... ALTER TABLE ... NOT CLUSTERED ALTER TABLE ... NOT SORTED ALTER TABLE ... NOT SKEWED ALTER TABLE ... NOT STORED AS DIRECTORIES ``` For these we throw exceptions complaining that they are not supported. ## How was this patch tested? `DDLSuite` Author: Andrew Or <andrew@databricks.com> Closes #12121 from andrewor14/alter-table-ddl.
* [SPARK-14402][SQL] initcap UDF doesn't match Hive/Oracle behavior in ↵Dongjoon Hyun2016-04-053-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | lowercasing rest of string ## What changes were proposed in this pull request? Current, SparkSQL `initCap` is using `toTitleCase` function. However, `UTF8String.toTitleCase` implementation changes only the first letter and just copy the other letters: e.g. sParK --> SParK. This is the correct implementation `toTitleCase`. ``` hive> select initcap('sParK'); Spark ``` ``` scala> sql("select initcap('sParK')").head res0: org.apache.spark.sql.Row = [SParK] ``` This PR updates the implementation of `initcap` using `toLowerCase` and `toTitleCase`. ## How was this patch tested? Pass the Jenkins tests (including new testcase). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12175 from dongjoon-hyun/SPARK-14402.
* [SPARK-14353] Dataset Time Window `window` API for Python, and SQLBurak Yavuz2016-04-056-15/+155
| | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008). This PR adds the Python, and SQL, API for this function. With this PR, SQL, Java, and Scala will share the same APIs as in users can use: - `window(timeColumn, windowDuration)` - `window(timeColumn, windowDuration, slideDuration)` - `window(timeColumn, windowDuration, slideDuration, startTime)` In Python, users can access all APIs above, but in addition they can do - In Python: `window(timeColumn, windowDuration, startTime=...)` that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows. ## How was this patch tested? Unit tests + manual tests Author: Burak Yavuz <brkyvz@gmail.com> Closes #12136 from brkyvz/python-windows.
* [SPARK-14123][SPARK-14384][SQL] Handle CreateFunction/DropFunctionYin Huai2016-04-0539-513/+1100
| | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR implements CreateFunction and DropFunction commands. Besides implementing these two commands, we also change how to manage functions. Here are the main changes. * `FunctionRegistry` will be a container to store all functions builders and it will not actively load any functions. Because of this change, we do not need to maintain a separate registry for HiveContext. So, `HiveFunctionRegistry` is deleted. * SessionCatalog takes care the job of loading a function if this function is not in the `FunctionRegistry` but its metadata is stored in the external catalog. For this case, SessionCatalog will (1) load the metadata from the external catalog, (2) load all needed resources (i.e. jars and files), (3) create a function builder based on the function definition, (4) register the function builder in the `FunctionRegistry`. * A `UnresolvedGenerator` is created. So, the parser will not need to call `FunctionRegistry` directly during parsing, which is not a good time to create a Hive UDTF. In the analysis phase, we will resolve `UnresolvedGenerator`. This PR is based on viirya's https://github.com/apache/spark/pull/12036/ ## How was this patch tested? Existing tests and new tests. ## TODOs [x] Self-review [x] Cleanup [x] More tests for create/drop functions (we need to more tests for permanent functions). [ ] File JIRAs for all TODOs [x] Standardize the error message when a function does not exist. Author: Yin Huai <yhuai@databricks.com> Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #12117 from yhuai/function.
* [SPARK-14257][SQL] Allow multiple continuous queries to be started from the ↵Shixiong Zhu2016-04-0510-26/+118
| | | | | | | | | | | | | | | | same DataFrame ## What changes were proposed in this pull request? Make StreamingRelation store the closure to create the source in StreamExecution so that we can start multiple continuous queries from the same DataFrame. ## How was this patch tested? `test("DataFrame reuse")` Author: Shixiong Zhu <shixiong@databricks.com> Closes #12049 from zsxwing/df-reuse.
* [SPARK-14345][SQL] Decouple deserializer expression resolution from ↵Wenchen Fan2016-04-055-126/+153
| | | | | | | | | | | | | | | | ObjectOperator ## What changes were proposed in this pull request? This PR decouples deserializer expression resolution from `ObjectOperator`, so that we can use deserializer expression in normal operators. This is needed by #12061 and #12067 , I abstracted the logic out and put them in this PR to reduce code change in the future. ## How was this patch tested? existing tests. Author: Wenchen Fan <wenchen@databricks.com> Closes #12131 from cloud-fan/separate.
* [SPARK-14349][SQL] Issue Error Messages for Unsupported Operators/DML/DDL in ↵gatorsmile2016-04-056-79/+133
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SQL Context. #### What changes were proposed in this pull request? Currently, the weird error messages are issued if we use Hive Context-only operations in SQL Context. For example, - When calling `Drop Table` in SQL Context, we got the following message: ``` Expected exception org.apache.spark.sql.catalyst.parser.ParseException to be thrown, but java.lang.ClassCastException was thrown. ``` - When calling `Script Transform` in SQL Context, we got the message: ``` assertion failed: No plan for ScriptTransformation [key#9,value#10], cat, [tKey#155,tValue#156], null +- LogicalRDD [key#9,value#10], MapPartitionsRDD[3] at beforeAll at BeforeAndAfterAll.scala:187 ``` Updates: Based on the investigation from hvanhovell , the root cause is `visitChildren`, which is the default implementation. It always returns the result of the last defined context child. After merging the code changes from hvanhovell , it works! Thank you hvanhovell ! #### How was this patch tested? A few test cases are added. Not sure if the same issue exist for the other operators/DDL/DML. hvanhovell Author: gatorsmile <gatorsmile@gmail.com> Author: xiaoli <lixiao1983@gmail.com> Author: Herman van Hovell <hvanhovell@questtec.nl> Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local> Closes #12134 from gatorsmile/hiveParserCommand.
* [SPARK-14348][SQL] Support native execution of SHOW TBLPROPERTIES commandDilip Biswal2016-04-057-30/+219
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR adds Native execution of SHOW TBLPROPERTIES command. Command Syntax: ``` SQL SHOW TBLPROPERTIES table_name[(property_key_literal)] ``` ## How was this patch tested? Tests added in HiveComandSuiie and DDLCommandSuite Author: Dilip Biswal <dbiswal@us.ibm.com> Closes #12133 from dilipbiswal/dkb_show_tblproperties.
* [SPARK-14359] Create built-in functions for typed aggregates in JavaEric Liang2016-04-053-0/+124
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This adds the corresponding Java static functions for built-in typed aggregates already exposed in Scala. ## How was this patch tested? Unit tests. rxin Author: Eric Liang <ekl@databricks.com> Closes #12168 from ericl/sc-2794.
* [SPARK-14287] isStreaming method for DatasetBurak Yavuz2016-04-042-0/+33
| | | | | | | | | | | | | | | | | | With the addition of StreamExecution (ContinuousQuery) to Datasets, data will become unbounded. With unbounded data, the execution of some methods and operations will not make sense, e.g. `Dataset.count()`. A simple API is required to check whether the data in a Dataset is bounded or unbounded. This will allow users to check whether their Dataset is in streaming mode or not. ML algorithms may check if the data is unbounded and throw an exception for example. The implementation of this method is simple, however naming it is the challenge. Some possible names for this method are: - isStreaming - isContinuous - isBounded - isUnbounded I've gone with `isStreaming` for now. We can change it before Spark 2.0 if we decide to come up with a different name. For that reason I've marked it as `Experimental` Author: Burak Yavuz <brkyvz@gmail.com> Closes #12080 from brkyvz/is-streaming.
* [SPARK-13579][BUILD] Stop building the main Spark assembly.Marcelo Vanzin2016-04-042-24/+4
| | | | | | | | | | | | | | | | | | | | This change modifies the "assembly/" module to just copy needed dependencies to its build directory, and modifies the packaging script to pick those up (and remove duplicate jars packages in the examples module). I also made some minor adjustments to dependencies to remove some test jars from the final packaging, and remove jars that conflict with each other when packaged separately (e.g. servlet api). Also note that this change restores guava in applications' classpaths, even though it's still shaded inside Spark. This is now needed for the Hadoop libraries that are packaged with Spark, which now are not processed by the shade plugin. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #11796 from vanzin/SPARK-13579.
* [SPARK-14259] [SQL] Merging small files together based on the cost of openingDavies Liu2016-04-043-19/+21
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR basically re-do the things in #12068 but with a different model, which should work better in case of small files with different sizes. ## How was this patch tested? Updated existing tests. Ran a query on thousands of partitioned small files locally, with all default settings (the cost to open a file should be over estimated), the durations of tasks become smaller and smaller, which is good (the last few tasks will be shortest). Author: Davies Liu <davies@databricks.com> Closes #12095 from davies/file_cost.
* [SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrameDavies Liu2016-04-045-12/+61
| | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer). This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds. The JDBC server has been updated to use DataFrame.toIterator. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12114 from davies/local_iterator.
* [SPARK-12981] [SQL] extract Pyhton UDF in physical planDavies Liu2016-04-047-70/+55
| | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning). We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan. This PR extract Python UDFs in physical plan. Closes #10935 ## How was this patch tested? Added regression tests. Author: Davies Liu <davies@databricks.com> Closes #12127 from davies/py_udf.
* [SPARK-14176][SQL] Add DataFrameWriter.trigger to set the stream batch periodShixiong Zhu2016-04-049-13/+413
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Add a processing time trigger to control the batch processing speed ## How was this patch tested? Unit tests Author: Shixiong Zhu <shixiong@databricks.com> Closes #11976 from zsxwing/trigger.
* [SPARK-14137] [SQL] Cleanup hash joinDavies Liu2016-04-046-401/+268
| | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR did a few cleanup on HashedRelation and HashJoin: 1) Merge HashedRelation and UniqueHashedRelation together 2) Return an iterator from HashedRelation, so we donot need a create many UnsafeRow objects. 3) Return a copy of HashedRelation for thread-safety in BroadcastJoin, so we can re-use the UnafeRow objects. 4) Cleanup HashJoin, share most of the code between BroadcastHashJoin and ShuffleHashJoin 5) Removed UniqueLongHashedRelation, which will be replaced by LongUnsafeMap (another PR). 6) Update benchmark, before this patch, the selectivity of joins are too high. ## How was this patch tested? Existing tests. Author: Davies Liu <davies@databricks.com> Closes #12102 from davies/cleanup_hash.
* [SPARK-14360][SQL] QueryExecution.debug.codegen() to dump codegenReynold Xin2016-04-041-0/+16
| | | | | | | | | | | | ## What changes were proposed in this pull request? We recently added the ability to dump the generated code for a given query. However, the method is only available through an implicit after an import. It'd slightly simplify things if it can be called directly in queryExecution. ## How was this patch tested? Manually tested in spark-shell. Author: Reynold Xin <rxin@databricks.com> Closes #12144 from rxin/SPARK-14360.
* [SPARK-14356] Update spark.sql.execution.debug to work on DatasetsMatei Zaharia2016-04-032-2/+8
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Update DebugQuery to work on Datasets of any type, not just DataFrames. ## How was this patch tested? Added unit tests, checked in spark-shell. Author: Matei Zaharia <matei@databricks.com> Closes #12140 from mateiz/debug-dataset.
* [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static ↵Dongjoon Hyun2016-04-0329-39/+39
| | | | | | | | | | | | | | | | | | | | | analysis results ## What changes were proposed in this pull request? This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines). - Fix typos(exception/log strings, testcase name, comments) in 44 lines. - Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011) - Use diamond operators in 40 lines. (New codes after SPARK-13702) - Fix redundant semicolon in 5 lines. - Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala. ## How was this patch tested? Manual and pass the Jenkins tests. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12139 from dongjoon-hyun/SPARK-14355.
* [SPARK-14341][SQL] Throw exception on unsupported create / drop macro ddlbomeng2016-04-033-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? We throw an AnalysisException that looks like this: ``` scala> sqlContext.sql("CREATE TEMPORARY MACRO SIGMOID (x DOUBLE) 1.0 / (1.0 + EXP(-x))") org.apache.spark.sql.catalyst.parser.ParseException: Unsupported SQL statement == SQL == CREATE TEMPORARY MACRO SIGMOID (x DOUBLE) 1.0 / (1.0 + EXP(-x)) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.nativeCommand(ParseDriver.scala:66) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:56) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser$$anonfun$parsePlan$1.apply(ParseDriver.scala:53) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:86) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:53) at org.apache.spark.sql.SQLContext.parseSql(SQLContext.scala:198) at org.apache.spark.sql.SQLContext.sql(SQLContext.scala:749) ... 48 elided ``` ## How was this patch tested? Add test cases in HiveQuerySuite.scala Author: bomeng <bmeng@us.ibm.com> Closes #12125 from bomeng/SPARK-14341.
* [SPARK-14350][SQL] EXPLAIN output should be in a single cellDongjoon Hyun2016-04-032-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? EXPLAIN output should be in a single cell. **Before** ``` scala> sql("explain select 1").collect() res0: Array[org.apache.spark.sql.Row] = Array([== Physical Plan ==], [WholeStageCodegen], [: +- Project [1 AS 1#1]], [: +- INPUT], [+- Scan OneRowRelation[]]) ``` **After** ``` scala> sql("explain select 1").collect() res1: Array[org.apache.spark.sql.Row] = Array([== Physical Plan == WholeStageCodegen : +- Project [1 AS 1#4] : +- INPUT +- Scan OneRowRelation[]]) ``` Or, ``` scala> sql("explain select 1").head res1: org.apache.spark.sql.Row = [== Physical Plan == WholeStageCodegen : +- Project [1 AS 1#5] : +- INPUT +- Scan OneRowRelation[]] ``` Please note that `Spark-shell(Scala-shell)` trims long string output. So, you may need to use `println` to get full strings. ``` scala> println(sql("explain codegen select 'a' as a group by 1").head) [Found 2 WholeStageCodegen subtrees. == Subtree 1 / 2 == WholeStageCodegen ... /* 059 */ } /* 060 */ } ] ``` ## How was this patch tested? Pass the Jenkins tests. (Testcases are updated.) Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12137 from dongjoon-hyun/SPARK-14350.
* [SPARK-14231] [SQL] JSON data source infers floating-point values as a ↵hyukjinkwon2016-04-025-12/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | double when they do not fit in a decimal ## What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-14231 Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`. But there are few restrictions in Spark `DecimalType` below: 1. The precision cannot be bigger than 38. 2. scale cannot be bigger than precision. Currently, both restrictions are not being handled. This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579). So, the codes below: ```scala def doubleRecords: RDD[String] = sqlContext.sparkContext.parallelize( s"""{"a": 1${"0" * 38}, "b": 0.01}""" :: s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil) val jsonDF = sqlContext.read .option("prefersDecimal", "true") .json(doubleRecords) jsonDF.printSchema() ``` produces below: - **Before** ```scala org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).; at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44) at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144) at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108) at ... ``` - **After** ```scala root |-- a: double (nullable = true) |-- b: double (nullable = true) ``` ## How was this patch tested? Unit tests were used and `./dev/run_tests` for coding style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12030 from HyukjinKwon/SPARK-14231.
* [HOTFIX] Fix Scala 2.10 compilationReynold Xin2016-04-021-2/+2
|
* [SPARK-13996] [SQL] Add more not null attributes for Filter codegenLiang-Chi Hsieh2016-04-021-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-13996 Filter codegen finds the attributes not null by checking IsNotNull(a) expression with a condition if child.output.contains(a). However, the current approach to checking it is not comprehensive. We can improve it. E.g., for this plan: val rdd = sqlContext.sparkContext.makeRDD(Seq(Row(1, "1"), Row(null, "1"), Row(2, "2"))) val schema = new StructType().add("k", IntegerType).add("v", StringType) val smallDF = sqlContext.createDataFrame(rdd, schema) val df = smallDF.filter("isnotnull(k + 1)") The code snippet generated without this patch: /* 031 */ protected void processNext() throws java.io.IOException { /* 032 */ /*** PRODUCE: Filter isnotnull((k#0 + 1)) */ /* 033 */ /* 034 */ /*** PRODUCE: INPUT */ /* 035 */ /* 036 */ while (!shouldStop() && inputadapter_input.hasNext()) { /* 037 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 038 */ /*** CONSUME: Filter isnotnull((k#0 + 1)) */ /* 039 */ /* input[0, int] */ /* 040 */ boolean filter_isNull = inputadapter_row.isNullAt(0); /* 041 */ int filter_value = filter_isNull ? -1 : (inputadapter_row.getInt(0)); /* 042 */ /* 043 */ /* isnotnull((input[0, int] + 1)) */ /* 044 */ /* (input[0, int] + 1) */ /* 045 */ boolean filter_isNull3 = true; /* 046 */ int filter_value3 = -1; /* 047 */ /* 048 */ if (!filter_isNull) { /* 049 */ filter_isNull3 = false; // resultCode could change nullability. /* 050 */ filter_value3 = filter_value + 1; /* 051 */ /* 052 */ } /* 053 */ if (!(!(filter_isNull3))) continue; /* 054 */ /* 055 */ filter_metricValue.add(1); With this patch: /* 031 */ protected void processNext() throws java.io.IOException { /* 032 */ /*** PRODUCE: Filter isnotnull((k#0 + 1)) */ /* 033 */ /* 034 */ /*** PRODUCE: INPUT */ /* 035 */ /* 036 */ while (!shouldStop() && inputadapter_input.hasNext()) { /* 037 */ InternalRow inputadapter_row = (InternalRow) inputadapter_input.next(); /* 038 */ /*** CONSUME: Filter isnotnull((k#0 + 1)) */ /* 039 */ /* input[0, int] */ /* 040 */ boolean filter_isNull = inputadapter_row.isNullAt(0); /* 041 */ int filter_value = filter_isNull ? -1 : (inputadapter_row.getInt(0)); /* 042 */ /* 043 */ if (filter_isNull) continue; /* 044 */ /* 045 */ filter_metricValue.add(1); ## How was this patch tested? Existing tests. Author: Liang-Chi Hsieh <simonh@tw.ibm.com> Closes #11810 from viirya/add-more-not-null-attrs.
* [SPARK-14056] Appends s3 specific configurations and spark.hadoop con…Sital Kedia2016-04-021-2/+2
| | | | | | | | | | | | | | | | ## What changes were proposed in this pull request? Appends s3 specific configurations and spark.hadoop configurations to hive configuration. ## How was this patch tested? Tested by running a job on cluster. …figurations to hive configuration. Author: Sital Kedia <skedia@fb.com> Closes #11876 from sitalkedia/hiveConf.
* [MINOR][DOCS] Use multi-line JavaDoc comments in Scala code.Dongjoon Hyun2016-04-0236-450/+458
| | | | | | | | | | | | | | | ## What changes were proposed in this pull request? This PR aims to fix all Scala-Style multiline comments into Java-Style multiline comments in Scala codes. (All comment-only changes over 77 files: +786 lines, −747 lines) ## How was this patch tested? Manual. Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12130 from dongjoon-hyun/use_multiine_javadoc_comments.
* [SPARK-14338][SQL] Improve `SimplifyConditionals` rule to handle `null` in ↵Dongjoon Hyun2016-04-022-8/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | IF/CASEWHEN ## What changes were proposed in this pull request? Currently, `SimplifyConditionals` handles `true` and `false` to optimize branches. This PR improves `SimplifyConditionals` to take advantage of `null` conditions for `if` and `CaseWhen` expressions, too. **Before** ``` scala> sql("SELECT IF(null, 1, 0)").explain() == Physical Plan == WholeStageCodegen : +- Project [if (null) 1 else 0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4] : +- INPUT +- Scan OneRowRelation[] scala> sql("select case when cast(null as boolean) then 1 else 2 end").explain() == Physical Plan == WholeStageCodegen : +- Project [CASE WHEN null THEN 1 ELSE 2 END AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#14] : +- INPUT +- Scan OneRowRelation[] ``` **After** ``` scala> sql("SELECT IF(null, 1, 0)").explain() == Physical Plan == WholeStageCodegen : +- Project [0 AS (IF(CAST(NULL AS BOOLEAN), 1, 0))#4] : +- INPUT +- Scan OneRowRelation[] scala> sql("select case when cast(null as boolean) then 1 else 2 end").explain() == Physical Plan == WholeStageCodegen : +- Project [2 AS CASE WHEN CAST(NULL AS BOOLEAN) THEN 1 ELSE 2 END#4] : +- INPUT +- Scan OneRowRelation[] ``` **Hive** ``` hive> select if(null,1,2); OK 2 hive> select case when cast(null as boolean) then 1 else 2 end; OK 2 ``` ## How was this patch tested? Pass the Jenkins tests (including new extended test cases). Author: Dongjoon Hyun <dongjoon@apache.org> Closes #12122 from dongjoon-hyun/SPARK-14338.
* [HOTFIX] Disable StateStoreSuite.maintenanceReynold Xin2016-04-021-1/+1
|
* [MINOR] Typo fixesJacek Laskowski2016-04-024-10/+10
| | | | | | | | | | | | | | ## What changes were proposed in this pull request? Typo fixes. No functional changes. ## How was this patch tested? Built the sources and ran with samples. Author: Jacek Laskowski <jacek@japila.pl> Closes #11802 from jaceklaskowski/typo-fixes.
* [HOTFIX] Fix compilation break.Reynold Xin2016-04-022-5/+4
|
* [MINOR][SQL] Fix comments styl and correct several styles and nits in CSV ↵hyukjinkwon2016-04-014-49/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | data source ## What changes were proposed in this pull request? While trying to create a PR (which was not an issue at the end), I just corrected some style nits. So, I removed the changes except for some coding style corrections. - According to the [scala-style-guide#documentation-style](https://github.com/databricks/scala-style-guide#documentation-style), Scala style comments are discouraged. >```scala >/** This is a correct one-liner, short description. */ > >/** > * This is correct multi-line JavaDoc comment. And > * this is my second line, and if I keep typing, this would be > * my third line. > */ > >/** In Spark, we don't use the ScalaDoc style so this > * is not correct. > */ >``` - Double newlines between consecutive methods was removed. According to [scala-style-guide#blank-lines-vertical-whitespace](https://github.com/databricks/scala-style-guide#blank-lines-vertical-whitespace), single newline appears when >Between consecutive members (or initializers) of a class: fields, constructors, methods, nested classes, static initializers, instance initializers. - Remove uesless parentheses in tests - Use `mapPartitions` instead of `mapPartitionsWithIndex()`. ## How was this patch tested? Unit tests were used and `dev/run_tests` for style tests. Author: hyukjinkwon <gurwls223@gmail.com> Closes #12109 from HyukjinKwon/SPARK-14271.
* [SPARK-14285][SQL] Implement common type-safe aggregate functionsReynold Xin2016-04-019-111/+342
| | | | | | | | | | | | ## What changes were proposed in this pull request? In the Dataset API, it is fairly difficult for users to perform simple aggregations in a type-safe way at the moment because there are no aggregators that have been implemented. This pull request adds a few common aggregate functions in expressions.scala.typed package, and also creates the expressions.java.typed package without implementation. The java implementation should probably come as a separate pull request. One challenge there is to resolve the type difference between Scala primitive types and Java boxed types. ## How was this patch tested? Added unit tests for them. Author: Reynold Xin <rxin@databricks.com> Closes #12077 from rxin/SPARK-14285.