aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-5213] [SQL] Remove the duplicated SparkSQLParserCheng Hao2015-05-074-23/+11
| | | | | | | | | | | This is a follow up of #5827 to remove the additional `SparkSQLParser` Author: Cheng Hao <hao.cheng@intel.com> Closes #5965 from chenghao-intel/remove_sparksqlparser and squashes the following commits: 509a233 [Cheng Hao] Remove the HiveQlQueryExecution a5f9e3b [Cheng Hao] Remove the duplicated SparkSQLParser
* [SPARK-7116] [SQL] [PYSPARK] Remove cache() causing memory leakksonj2015-05-071-4/+3
| | | | | | | | | | | This patch simply removes a `cache()` on an intermediate RDD when evaluating Python UDFs. Author: ksonj <kson@siberie.de> Closes #5973 from ksonj/udf and squashes the following commits: db5b564 [ksonj] removed TODO about cleaning up fe70c54 [ksonj] Remove cache() causing memory leak
* [SPARK-1442] [SQL] [FOLLOW-UP] Address minor comments in Window Function PR ↵Yin Huai2015-05-073-8/+68
| | | | | | | | | | | | | | (#5604). Address marmbrus and scwf's comments in #5604. Author: Yin Huai <yhuai@databricks.com> Closes #5945 from yhuai/windowFollowup and squashes the following commits: 0ef879d [Yin Huai] Add collectFirst to TreeNode. 2373968 [Yin Huai] wip 4a16df9 [Yin Huai] Address minor comments for [SPARK-1442].
* [SPARK-7330] [SQL] avoid NPE at jdbc rddDaoyuan Wang2015-05-072-1/+32
| | | | | | | | | | Thank nadavoosh point this out in #5590 Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5877 from adrian-wang/jdbcrdd and squashes the following commits: cc11900 [Daoyuan Wang] avoid NPE in jdbcrdd
* [SPARK-7295][SQL] bitwise operations for DataFrame DSLShiti2015-05-074-2/+77
| | | | | | | | Author: Shiti <ssaxena.ece@gmail.com> Closes #5867 from Shiti/spark-7295 and squashes the following commits: 71a9913 [Shiti] implementation for bitwise and,or, not and xor on Column with tests and docs
* [SPARK-5938] [SPARK-5443] [SQL] Improve JsonRDD performanceNathan Howell2015-05-0613-128/+715
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch comprises of a few related pieces of work: * Schema inference is performed directly on the JSON token stream * `String => Row` conversion populate Spark SQL structures without intermediate types * Projection pushdown is implemented via CatalystScan for DataFrame queries * Support for the legacy parser by setting `spark.sql.json.useJacksonStreamingAPI` to `false` Performance improvements depend on the schema and queries being executed, but it should be faster across the board. Below are benchmarks using the last.fm Million Song dataset: ``` Command | Baseline | Patched ---------------------------------------------------|----------|-------- import sqlContext.implicits._ | | val df = sqlContext.jsonFile("/tmp/lastfm.json") | 70.0s | 14.6s df.count() | 28.8s | 6.2s df.rdd.count() | 35.3s | 21.5s df.where($"artist" === "Robert Hood").collect() | 28.3s | 16.9s ``` To prepare this dataset for benchmarking, follow these steps: ``` # Fetch the datasets from http://labrosa.ee.columbia.edu/millionsong/lastfm wget http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_test.zip \ http://labrosa.ee.columbia.edu/millionsong/sites/default/files/lastfm/lastfm_train.zip # Decompress and combine, pipe through `jq -c` to ensure there is one record per line unzip -p lastfm_test.zip lastfm_train.zip | jq -c . > lastfm.json ``` Author: Nathan Howell <nhowell@godaddy.com> Closes #5801 from NathanHowell/json-performance and squashes the following commits: 26fea31 [Nathan Howell] Recreate the baseRDD each for each scan operation a7ebeb2 [Nathan Howell] Increase coverage of inserts into a JSONRelation e06a1dd [Nathan Howell] Add comments to the `useJacksonStreamingAPI` config flag 6822712 [Nathan Howell] Split up JsonRDD2 into multiple objects fa8234f [Nathan Howell] Wrap long lines b31917b [Nathan Howell] Rename `useJsonRDD2` to `useJacksonStreamingAPI` 15c5d1b [Nathan Howell] JSONRelation's baseRDD need not be lazy f8add6e [Nathan Howell] Add comments on lack of support for precision and scale DecimalTypes fa0be47 [Nathan Howell] Remove unused default case in the field parser 80dba17 [Nathan Howell] Add comments regarding null handling and empty strings 842846d [Nathan Howell] Point the empty schema inference test at JsonRDD2 ab6ee87 [Nathan Howell] Add projection pushdown support to JsonRDD/JsonRDD2 f636c14 [Nathan Howell] Enable JsonRDD2 by default, add a flag to switch back to JsonRDD 0bbc445 [Nathan Howell] Improve JSON parsing and type inference performance 7ca70c1 [Nathan Howell] Eliminate arrow pattern, replace with pattern matches
* [HOT-FIX] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir.Yin Huai2015-05-061-0/+0
| | | | | | | | Author: Yin Huai <yhuai@databricks.com> Closes #5951 from yhuai/fixBuildMaven and squashes the following commits: fdde183 [Yin Huai] Move HiveWindowFunctionQuerySuite.scala to hive compatibility dir.
* [SPARK-7311] Introduce internal Serializer API for determining if ↵Josh Rosen2015-05-061-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | serializers support object relocation This patch extends the `Serializer` interface with a new `Private` API which allows serializers to indicate whether they support relocation of serialized objects in serializer stream output. This relocatibilty property is described in more detail in `Serializer.scala`, but in a nutshell a serializer supports relocation if reordering the bytes of serialized objects in serialization stream output is equivalent to having re-ordered those elements prior to serializing them. The optimized shuffle path introduced in #4450 and #5868 both rely on serializers having this property; this patch just centralizes the logic for determining whether a serializer has this property. I also added tests and comments clarifying when this works for KryoSerializer. This change allows the optimizations in #4450 to be applied for shuffles that use `SqlSerializer2`. Author: Josh Rosen <joshrosen@databricks.com> Closes #5924 from JoshRosen/SPARK-7311 and squashes the following commits: 50a68ca [Josh Rosen] Address minor nits 0a7ebd7 [Josh Rosen] Clarify reason why SqlSerializer2 supports this serializer 123b992 [Josh Rosen] Cleanup for submitting as standalone patch. 4aa61b2 [Josh Rosen] Add missing newline 2c1233a [Josh Rosen] Small refactoring of SerializerPropertiesSuite to enable test re-use: 0ba75e6 [Josh Rosen] Add tests for serializer relocation property. 450fa21 [Josh Rosen] Back out accidental log4j.properties change 86d4dcd [Josh Rosen] Flag that SparkSqlSerializer2 supports relocation b9624ee [Josh Rosen] Expand serializer API and use new function to help control when new UnsafeShuffle path is used.
* [SPARK-1442] [SQL] Window Function Support for Spark SQLYin Huai2015-05-06101-34/+34768
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding more information about the implementation... This PR is adding the support of window functions to Spark SQL (specifically OVER and WINDOW clause). For every expression having a OVER clause, we use a WindowExpression as the container of a WindowFunction and the corresponding WindowSpecDefinition (the definition of a window frame, i.e. partition specification, order specification, and frame specification appearing in a OVER clause). # Implementation # The high level work flow of the implementation is described as follows. * Query parsing: In the query parse process, all WindowExpressions are originally placed in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. It makes our changes to simple and keep all of parsing rules for window functions at a single place (nodesToWindowSpecification). For the WINDOWclause in a query, we use a WithWindowDefinition as the container as the mapping from the name of a window specification to a WindowSpecDefinition. This changes is similar with our common table expression support. * Analysis: The query analysis process has three steps for window functions. * Resolve all WindowSpecReferences by replacing them with WindowSpecReferences according to the mapping table stored in the node of WithWindowDefinition. * Resolve WindowFunctions in the projectList of a Project operator or the aggregateExpressions of an Aggregate operator. For this PR, we use Hive's functions for window functions because we will have a major refactoring of our internal UDAFs and it is better to switch our UDAFs after that refactoring work. * Once we have resolved all WindowFunctions, we will use ResolveWindowFunction to extract WindowExpressions from projectList and aggregateExpressions and then create a Window operator for every distinct WindowSpecDefinition. With this choice, at the execution time, we can rely on the Exchange operator to do all of work on reorganizing the table and we do not need to worry about it in the physical Window operator. An example analyzed plan is shown as follows ``` sql(""" SELECT year, country, product, sales, avg(sales) over(partition by product) avg_product, sum(sales) over(partition by country) sum_country FROM sales ORDER BY year, country, product """).explain(true) == Analyzed Logical Plan == Sort [year#34 ASC,country#35 ASC,product#36 ASC], true Project [year#34,country#35,product#36,sales#37,avg_product#27,sum_country#28] Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Project [year#34,country#35,product#36,sales#37] MetastoreRelation default, sales, None ``` * Query planning: In the process of query planning, we simple generate the physical Window operator based on the logical Window operator. Then, to prepare the executedPlan, the EnsureRequirements rule will add Exchange and Sort operators if necessary. The EnsureRequirements rule will analyze the data properties and try to not add unnecessary shuffle and sort. The physical plan for the above example query is shown below. ``` == Physical Plan == Sort [year#34 ASC,country#35 ASC,product#36 ASC], true Exchange (RangePartitioning [year#34 ASC,country#35 ASC,product#36 ASC], 200), [] Window [year#34,country#35,product#36,sales#37,avg_product#27], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFSum(sales#37) WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS sum_country#28], WindowSpecDefinition [country#35], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Exchange (HashPartitioning [country#35], 200), [country#35 ASC] Window [year#34,country#35,product#36,sales#37], [HiveWindowFunction#org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage(sales#37) WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS avg_product#27], WindowSpecDefinition [product#36], [], ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING Exchange (HashPartitioning [product#36], 200), [product#36 ASC] HiveTableScan [year#34,country#35,product#36,sales#37], (MetastoreRelation default, sales, None), None ``` * Execution time: At execution time, a physical Window operator buffers all rows in a partition specified in the partition spec of a OVER clause. If necessary, it also maintains a sliding window frame. The current implementation tries to buffer the input parameters of a window function according to the window frame to avoid evaluating a row multiple times. # Future work # Here are three improvements that are not hard to add: * Taking advantage of the window frame specification to reduce the number of rows buffered in the physical Window operator. For some cases, we only need to buffer the rows appearing in the sliding window. But for other cases, we will not be able to reduce the number of rows buffered (e.g. ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING). * When aRAGEN frame is used, for <value> PRECEDING and <value> FOLLOWING, it will be great if the <value> part is an expression (we can start with Literal). So, when the data type of ORDER BY expression is a FractionalType, we can support FractionalType as the type <value> (<value> still needs to be evaluated as a positive value). * When aRAGEN frame is used, we need to support DateType and TimestampType as the data type of the expression appearing in the order specification. Then, the <value> part of <value> PRECEDING and <value> FOLLOWING can support interval types (once we support them). This is a joint work with guowei2 and yhuai Thanks hbutani hvanhovell for his comments Thanks scwf for his comments and unit tests Author: Yin Huai <yhuai@databricks.com> Closes #5604 from guowei2/windowImplement and squashes the following commits: 76fe1c8 [Yin Huai] Implementation. aa2b0ae [Yin Huai] Tests.
* [SPARK-6201] [SQL] promote string and do widen types for INDaoyuan Wang2015-05-063-2/+22
| | | | | | | | | | | | | | | | huangjs Acutally spark sql will first go through analysis period, in which we do widen types and promote strings, and then optimization, where constant IN will be converted into INSET. So it turn out that we only need to fix this for IN. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4945 from adrian-wang/inset and squashes the following commits: 71e05cc [Daoyuan Wang] minor fix 581fa1c [Daoyuan Wang] mysql way f3f7baf [Daoyuan Wang] address comments 5eed4bc [Daoyuan Wang] promote string and do widen types for IN
* [SPARK-5456] [SQL] fix decimal compare for jdbc rddDaoyuan Wang2015-05-062-2/+11
| | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5803 from adrian-wang/decimalcompare and squashes the following commits: aef0e96 [Daoyuan Wang] add null handle ec455b9 [Daoyuan Wang] fix decimal compare for jdbc rdd
* [SQL] JavaDoc update for various DataFrame functions.Reynold Xin2015-05-064-21/+32
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5935 from rxin/df-doc1 and squashes the following commits: aaeaadb [Reynold Xin] [SQL] JavaDoc update for various DataFrame functions.
* [SPARK-7358][SQL] Move DataFrame mathfunctions into functionsBurak Yavuz2015-05-055-390/+490
| | | | | | | | | | | | | After a discussion on the user mailing list, it was decided to put all UDF's under `o.a.s.sql.functions` cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5923 from brkyvz/move-math-funcs and squashes the following commits: a8dc3f7 [Burak Yavuz] address comments cf7a7bb [Burak Yavuz] [SPARK-7358] Move DataFrame mathfunctions into functions
* [SPARK-6231][SQL/DF] Automatically resolve join condition ambiguity for ↵Reynold Xin2015-05-056-43/+170
| | | | | | | | | | | | | | | self-joins. See the comment in join function for more information. Author: Reynold Xin <rxin@databricks.com> Closes #5919 from rxin/self-join-resolve and squashes the following commits: e2fb0da [Reynold Xin] Updated SQLConf comment. 7233a86 [Reynold Xin] Updated comment. 6be2b4d [Reynold Xin] Removed println 9f6b72f [Reynold Xin] [SPARK-6231][SQL/DF] Automatically resolve ambiguity in join condition for self-joins.
* [SQL][Minor] make StringComparison extends ExpectsInputTypeswangfei2015-05-051-7/+6
| | | | | | | | | | make StringComparison extends ExpectsInputTypes and added expectedChildTypes, so do not need override expectedChildTypes in each subclass Author: wangfei <wangfei1@huawei.com> Closes #5905 from scwf/ExpectsInputTypes and squashes the following commits: b374ddf [wangfei] make stringcomparison extends ExpectsInputTypes
* [SPARK-7294][SQL] ADD BETWEEN云峤2015-05-052-0/+23
| | | | | | | | | | | | | | | | | | | Author: 云峤 <chensong.cs@alibaba-inc.com> Author: kaka1992 <kaka_1992@163.com> Closes #5839 from kaka1992/master and squashes the following commits: b15360d [kaka1992] Fix python unit test in sql/test. =_= I forget to commit this file last time. f928816 [kaka1992] Fix python style in sql/test. d2e7f72 [kaka1992] Fix python style in sql/test. c54d904 [kaka1992] Fix empty map bug. 7e64d1e [云峤] Update 7b9b858 [云峤] undo f080f8d [云峤] update pep8 76f0c51 [云峤] Merge remote-tracking branch 'remotes/upstream/master' 7d62368 [云峤] [SPARK-7294] ADD BETWEEN baf839b [云峤] [SPARK-7294] ADD BETWEEN d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
* [SPARK-7243][SQL] Reduce size for Contingency Tables in DataFramesBurak Yavuz2015-05-052-7/+8
| | | | | | | | | | | | | | Reduced take size from 1e8 to 1e6. cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5900 from brkyvz/df-cont-followup and squashes the following commits: c11e762 [Burak Yavuz] fix grammar b30ace2 [Burak Yavuz] address comments a417ba5 [Burak Yavuz] [SPARK-7243][SQL] Reduce size for Contingency Tables in DataFrames
* [MINOR] Minor update for documentLiang-Chi Hsieh2015-05-051-1/+1
| | | | | | | | | | Two minor doc errors in `BytesToBytesMap` and `UnsafeRow`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5906 from viirya/minor_doc and squashes the following commits: 27f9089 [Liang-Chi Hsieh] Minor update for doc.
* [SPARK-7266] Add ExpectsInputTypes to expressions when possible.Reynold Xin2015-05-045-56/+71
| | | | | | | | | | This should gives us better analysis time error messages (rather than runtime) and automatic type casting. Author: Reynold Xin <rxin@databricks.com> Closes #5796 from rxin/expected-input-types and squashes the following commits: c900760 [Reynold Xin] [SPARK-7266] Add ExpectsInputTypes to expressions when possible.
* [SPARK-7243][SQL] Contingency Tables for DataFramesBurak Yavuz2015-05-044-31/+126
| | | | | | | | | | | | | | | | | | | | | Computes a pair-wise frequency table of the given columns. Also known as cross-tabulation. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5842 from brkyvz/df-cont and squashes the following commits: a07c01e [Burak Yavuz] addressed comments v4.1 ae9e01d [Burak Yavuz] fix test 9106585 [Burak Yavuz] addressed comments v4.0 bced829 [Burak Yavuz] fix merge conflicts a63ad00 [Burak Yavuz] addressed comments v3.0 a0cad97 [Burak Yavuz] addressed comments v3.0 6805df8 [Burak Yavuz] addressed comments and fixed test 939b7c4 [Burak Yavuz] lint python 7f098bc [Burak Yavuz] add crosstab pyTest fd53b00 [Burak Yavuz] added python support for crosstab 27a5a81 [Burak Yavuz] implemented crosstab
* [SPARK-7319][SQL] Improve the output from DataFrame.show()云峤2015-05-042-6/+41
| | | | | | | | | | | | | | | | | | | | Author: 云峤 <chensong.cs@alibaba-inc.com> Closes #5865 from kaka1992/df.show and squashes the following commits: c79204b [云峤] Update a1338f6 [云峤] Update python dataFrame show test and add empty df unit test. 734369c [云峤] Update python dataFrame show test and add empty df unit test. 84aec3e [云峤] Update python dataFrame show test and add empty df unit test. 159b3d5 [云峤] update 03ef434 [云峤] update 7394fd5 [云峤] update test show ced487a [云峤] update pep8 b6e690b [云峤] Merge remote-tracking branch 'upstream/master' into df.show 30ac311 [云峤] [SPARK-7294] ADD BETWEEN 7d62368 [云峤] [SPARK-7294] ADD BETWEEN baf839b [云峤] [SPARK-7294] ADD BETWEEN d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
* [SPARK-5100] [SQL] add webui for thriftservertianyi2015-05-0410-22/+751
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR is a rebased version of #3946 , and mainly focused on creating an independent tab for the thrift server in spark web UI. Features: 1. Session related statistics ( username and IP are only supported in hive-0.13.1 ) 2. List all the SQL executing or executed on this server 3. Provide links to the job generated by SQL 4. Provide link to show all SQL executing or executed in a specified session Prototype snapshots: This is the main page for thrift server ![image](https://cloud.githubusercontent.com/assets/1411869/7361379/df7dcc64-ed89-11e4-9964-4df0b32f475e.png) Author: tianyi <tianyi.asiainfo@gmail.com> Closes #5730 from tianyi/SPARK-5100 and squashes the following commits: cfd14c7 [tianyi] style fix 0efe3d5 [tianyi] revert part of pom change c0f2fa0 [tianyi] extends HiveThriftJdbcTest to start/stop thriftserver for UI test aa20408 [tianyi] fix style problem c9df6f9 [tianyi] add testsuite for thriftserver ui and fix some style issue 9830199 [tianyi] add webui for thriftserver
* [SPARK-7241] Pearson correlation for DataFramesBurak Yavuz2015-05-034-26/+98
| | | | | | | | | | | | | | | | | submitting this PR from a phone, excuse the brevity. adds Pearson correlation to Dataframes, reusing the covariance calculation code cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5858 from brkyvz/df-corr and squashes the following commits: 285b838 [Burak Yavuz] addressed comments v2.0 d10babb [Burak Yavuz] addressed comments v0.2 4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr 4fe693b [Burak Yavuz] addressed comments v0.1 a682d06 [Burak Yavuz] ready for PR
* [SPARK-6907] [SQL] Isolated client for HiveMetastoreMichael Armbrust2015-05-038-6/+1087
| | | | | | | | | | | | | | | | | | | | | | | | This PR adds initial support for loading multiple versions of Hive in a single JVM and provides a common interface for extracting metadata from the `HiveMetastoreClient` for a given version. This is accomplished by creating an isolated `ClassLoader` that operates according to the following rules: - __Shared Classes__: Java, Scala, logging, and Spark classes are delegated to `baseClassLoader` allowing the results of calls to the `ClientInterface` to be visible externally. - __Hive Classes__: new instances are loaded from `execJars`. These classes are not accessible externally due to their custom loading. - __Barrier Classes__: Classes such as `ClientWrapper` are defined in Spark but must link to a specific version of Hive. As a result, the bytecode is acquired from the Spark `ClassLoader` but a new copy is created for each instance of `IsolatedClientLoader`. This new instance is able to see a specific version of hive without using reflection where ever hive is consistent across versions. Since this is a unique instance, it is not visible externally other than as a generic `ClientInterface`, unless `isolationOn` is set to `false`. In addition to the unit tests, I have also tested this locally against mysql instances of the Hive Metastore. I've also successfully ported Spark SQL to run with this client, but due to the size of the changes, that will come in a follow-up PR. By default, Hive jars are currently downloaded from Maven automatically for a given version to ease packaging and testing. However, there is also support for specifying their location manually for deployments without internet. Author: Michael Armbrust <michael@databricks.com> Closes #5851 from marmbrus/isolatedClient and squashes the following commits: c72f6ac [Michael Armbrust] rxins comments 1e271fa [Michael Armbrust] [SPARK-6907][SQL] Isolated client for HiveMetastore
* [SPARK-5213] [SQL] Pluggable SQL Parser SupportCheng Hao2015-05-029-39/+196
| | | | | | | | | | | | | | | | | | | based on #4015, we should not delete `sqlParser` from sqlcontext, that leads to mima failed. Users implement dialect to give a fallback for `sqlParser` and we should construct `sqlParser` in sqlcontext according to the dialect `protected[sql] val sqlParser = new SparkSQLParser(getSQLDialect().parse(_))` Author: Cheng Hao <hao.cheng@intel.com> Author: scwf <wangfei1@huawei.com> Closes #5827 from scwf/sqlparser1 and squashes the following commits: 81b9737 [scwf] comment fix 0878bd1 [scwf] remove comments c19780b [scwf] fix mima tests c2895cf [scwf] Merge branch 'master' of https://github.com/apache/spark into sqlparser1 493775c [Cheng Hao] update the code as feedback 81a731f [Cheng Hao] remove the unecessary comment aab0b0b [Cheng Hao] polish the code a little bit 49b9d81 [Cheng Hao] shrink the comment for rebasing
* [MINOR] [HIVE] Fix QueryPartitionSuite.Marcelo Vanzin2015-05-021-3/+4
| | | | | | | | | | | | | | At least in the version of Hive I tested on, the test was deleting a temp directory generated by Hive instead of one containing partition data. So fix the filter to only consider partition directories when deciding what to delete. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5854 from vanzin/hive-test-fix and squashes the following commits: 7594ae9 [Marcelo Vanzin] Fix typo. 729fa80 [Marcelo Vanzin] [minor] [hive] Fix QueryPartitionSuite.
* [SPARK-7242] added python api for freqItems in DataFramesBurak Yavuz2015-05-011-3/+6
| | | | | | | | | | | | The python api for DataFrame's plus addressed your comments from previous PR. rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5859 from brkyvz/df-freq-py2 and squashes the following commits: f9aa9ce [Burak Yavuz] addressed comments v0.1 4b25056 [Burak Yavuz] added python api for freqItems
* [SPARK-6999] [SQL] Remove the infinite recursive method (useless)Cheng Hao2015-05-011-14/+0
| | | | | | | | | | | Remove the method, since it causes infinite recursive calls. And seems it's a dummy method, since we have the API: `def createDataFrame(rowRDD: JavaRDD[Row], schema: StructType): DataFrame` Author: Cheng Hao <hao.cheng@intel.com> Closes #5804 from chenghao-intel/spark_6999 and squashes the following commits: 63220a8 [Cheng Hao] remove the infinite recursive method (useless)
* [SPARK-7312][SQL] SPARK-6913 broke jdk6 buildYin Huai2015-05-011-2/+4
| | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7312 Author: Yin Huai <yhuai@databricks.com> Closes #5847 from yhuai/jdbcJava6 and squashes the following commits: 68433a2 [Yin Huai] compile with Java 6
* [SPARK-7240][SQL] Single pass covariance calculation for dataframesBurak Yavuz2015-05-014-3/+114
| | | | | | | | | | | | | | | | | | | | | | Added the calculation of covariance between two columns to DataFrames. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5825 from brkyvz/df-cov and squashes the following commits: cb18046 [Burak Yavuz] changed to sample covariance f2e862b [Burak Yavuz] fixed failed test 51e39b8 [Burak Yavuz] moved implementation 0c6a759 [Burak Yavuz] addressed math comments 8456eca [Burak Yavuz] fix pyStyle3 aa2ad29 [Burak Yavuz] fix pyStyle2 4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov e3b0b85 [Burak Yavuz] addressed comments v0.1 a7115f1 [Burak Yavuz] fix python style 7dc6dbc [Burak Yavuz] reorder imports 408cb77 [Burak Yavuz] initial commit
* [SPARK-7274] [SQL] Create Column expression for array/struct creation.Reynold Xin2015-05-013-2/+133
| | | | | | | | | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5802 from rxin/SPARK-7274 and squashes the following commits: 19aecaa [Reynold Xin] Fixed unicode tests. bfc1538 [Reynold Xin] Export all Python functions. 2517b8c [Reynold Xin] Code review. 23da335 [Reynold Xin] Fixed Python bug. 132002e [Reynold Xin] Fixed tests. 56fce26 [Reynold Xin] Added Python support. b0d591a [Reynold Xin] Fixed debug error. 86926a6 [Reynold Xin] Added test suite. 7dbb9ab [Reynold Xin] Ok one more. 470e2f5 [Reynold Xin] One more MLlib ... e2d14f0 [Reynold Xin] [SPARK-7274][SQL] Create Column expression for array/struct creation.
* [SPARK-4550] In sort-based shuffle, store map outputs in serialized formSandy Ryza2015-04-301-9/+29
| | | | | | | | | | | | | | | | | | | Refer to the JIRA for the design doc and some perf results. I wanted to call out some of the more possibly controversial changes up front: * Map outputs are only stored in serialized form when Kryo is in use. I'm still unsure whether Java-serialized objects can be relocated. At the very least, Java serialization writes out a stream header which causes problems with the current approach, so I decided to leave investigating this to future work. * The shuffle now explicitly operates on key-value pairs instead of any object. Data is written to shuffle files in alternating keys and values instead of key-value tuples. `BlockObjectWriter.write` now accepts a key argument and a value argument instead of any object. * The map output buffer can hold a max of Integer.MAX_VALUE bytes. Though this wouldn't be terribly difficult to change. * When spilling occurs, the objects that still in memory at merge time end up serialized and deserialized an extra time. Author: Sandy Ryza <sandy@cloudera.com> Closes #4450 from sryza/sandy-spark-4550 and squashes the following commits: 8c70dd9 [Sandy Ryza] Fix serialization 9c16fe6 [Sandy Ryza] Fix a couple tests and move getAutoReset to KryoSerializerInstance 6c54e06 [Sandy Ryza] Fix scalastyle d8462d8 [Sandy Ryza] SPARK-4550
* [SPARK-7248] implemented random number generators for DataFramesBurak Yavuz2015-04-308-45/+115
| | | | | | | | | | | | | | | | | | | Adds the functions `rand` (Uniform Dist) and `randn` (Normal Dist.) as expressions to DataFrames. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5819 from brkyvz/df-rng and squashes the following commits: 50d69d4 [Burak Yavuz] add seed for test that failed 4234c3a [Burak Yavuz] fix Rand expression 13cad5c [Burak Yavuz] couple fixes 7d53953 [Burak Yavuz] waiting for hive tests b453716 [Burak Yavuz] move radn with seed down 03637f0 [Burak Yavuz] fix broken hive func c5909eb [Burak Yavuz] deleted old implementation of Rand 6d43895 [Burak Yavuz] implemented random generators
* Revert "[SPARK-5213] [SQL] Pluggable SQL Parser Support"Patrick Wendell2015-04-309-199/+42
| | | | This reverts commit 3ba5aaab8266822545ac82b9e733fd25cc215a77.
* [SPARK-7123] [SQL] support table.star in sqlcontextscwf2015-04-302-0/+11
| | | | | | | | | | | | Run following sql get error `SELECT r.* FROM testData l join testData2 r on (l.key = r.a)` Author: scwf <wangfei1@huawei.com> Closes #5690 from scwf/tablestar and squashes the following commits: 3b2e2b6 [scwf] support table.star
* [SPARK-5213] [SQL] Pluggable SQL Parser SupportCheng Hao2015-04-309-42/+199
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR aims to make the SQL Parser Pluggable, and user can register it's own parser via Spark SQL CLI. ``` # add the jar into the classpath $hchengmydesktop:spark>bin/spark-sql --jars sql99.jar -- switch to "hiveql" dialect spark-sql>SET spark.sql.dialect=hiveql; spark-sql>SELECT * FROM src LIMIT 1; -- switch to "sql" dialect spark-sql>SET spark.sql.dialect=sql; spark-sql>SELECT * FROM src LIMIT 1; -- switch to a custom dialect spark-sql>SET spark.sql.dialect=com.xxx.xxx.SQL99Dialect; spark-sql>SELECT * FROM src LIMIT 1; -- register the non-exist SQL dialect spark-sql> SET spark.sql.dialect=NotExistedClass; spark-sql> SELECT * FROM src LIMIT 1; -- Exception will be thrown and switch to default sql dialect ("sql" for SQLContext and "hiveql" for HiveContext) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #4015 from chenghao-intel/sqlparser and squashes the following commits: 493775c [Cheng Hao] update the code as feedback 81a731f [Cheng Hao] remove the unecessary comment aab0b0b [Cheng Hao] polish the code a little bit 49b9d81 [Cheng Hao] shrink the comment for rebasing
* [SPARK-6913][SQL] Fixed "java.sql.SQLException: No suitable driver found"Vyacheslav Baranov2015-04-303-4/+62
| | | | | | | | | | | | | | | | | | Fixed `java.sql.SQLException: No suitable driver found` when loading DataFrame into Spark SQL if the driver is supplied with `--jars` argument. The problem is in `java.sql.DriverManager` class that can't access drivers loaded by Spark ClassLoader. Wrappers that forward requests are created for these drivers. Also, it's not necessary any more to include JDBC drivers in `--driver-class-path` in local mode, specifying in `--jars` argument is sufficient. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #5782 from SlavikBaranov/SPARK-6913 and squashes the following commits: 510c43f [Vyacheslav Baranov] [SPARK-6913] Fixed review comments b2a727c [Vyacheslav Baranov] [SPARK-6913] Fixed thread race on driver registration c8294ae [Vyacheslav Baranov] [SPARK-6913] Fixed "No suitable driver found" when using using JDBC driver added with SparkContext.addJar
* [SPARK-7109] [SQL] Push down left side filter for left semi joinwangfei2015-04-302-5/+24
| | | | | | | | | | | | | | | Now in spark sql optimizer we only push down right side filter for left semi join, actually we can push down left side filter because left semi join is doing filter on left table essentially. Author: wangfei <wangfei1@huawei.com> Author: scwf <wangfei1@huawei.com> Closes #5677 from scwf/leftsemi and squashes the following commits: 483d205 [wangfei] update with master to fix compile issue 82df0e1 [wangfei] Merge branch 'master' of https://github.com/apache/spark into leftsemi d68a053 [wangfei] added apply 8f48a3d [scwf] added test ebadaa9 [wangfei] left filter push down for left semi join
* [SPARK-7093] [SQL] Using newPredicate in NestedLoopJoin to enable code ↵scwf2015-04-302-8/+2
| | | | | | | | | | | | | generation Using newPredicate in NestedLoopJoin instead of InterpretedPredicate to make it can make use of code generation Author: scwf <wangfei1@huawei.com> Closes #5665 from scwf/NLP and squashes the following commits: d19dd31 [scwf] improvement a887c02 [scwf] improve for NLP boundCondition
* [SPARK-7280][SQL] Add "drop" column/s on a data framerakeshchalasani2015-04-302-4/+45
| | | | | | | | | | | | | | Takes a column name/s and returns a new DataFrame that drops a column/s. Author: rakeshchalasani <vnit.rakesh@gmail.com> Closes #5818 from rakeshchalasani/SPARK-7280 and squashes the following commits: ce2ec09 [rakeshchalasani] Minor edit 45c06f1 [rakeshchalasani] Change withColumnRename and format changes f68945a [rakeshchalasani] Minor fix 0b9104d [rakeshchalasani] Drop one column at a time 289afd2 [rakeshchalasani] [SPARK-7280][SQL] Add "drop" column/s on a data frame
* [SPARK-7242][SQL][MLLIB] Frequent items for DataFramesBurak Yavuz2015-04-305-5/+256
| | | | | | | | | | | | | | | | | | | | | | | | | | | Finding frequent items with possibly false positives, using the algorithm described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. public API under: ``` df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame ``` The output is a local DataFrame having the input column names with `-freqItems` appended to it. This is a single pass algorithm that may return false positives, but no false negatives. cc mengxr rxin Let's get the implementations in, I can add python API in a follow up PR. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5799 from brkyvz/freq-items and squashes the following commits: a6ec82c [Burak Yavuz] addressed comments v? 39b1bba [Burak Yavuz] removed toSeq 0915e23 [Burak Yavuz] addressed comments v2.1 3a5c177 [Burak Yavuz] addressed comments v2.0 482e741 [Burak Yavuz] removed old import 38e784d [Burak Yavuz] addressed comments v1.0 8279d4d [Burak Yavuz] added default value for support 3d82168 [Burak Yavuz] made base implementation
* [Build] Enable MiMa checks for SQLJosh Rosen2015-04-303-4/+5
| | | | | | | | | | | | | | Now that 1.3 has been released, we should enable MiMa checks for the `sql` subproject. Author: Josh Rosen <joshrosen@databricks.com> Closes #5727 from JoshRosen/enable-more-mima-checks and squashes the following commits: 3ad302b [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks 0c48e4d [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks e276cee [Josh Rosen] Fix SQL MiMa checks via excludes and private[sql] 44d0d01 [Josh Rosen] Add back 'launcher' exclude 1aae027 [Josh Rosen] Enable MiMa checks for launcher and sql projects.
* [SPARK-7267][SQL]Push down Project when it's child is LimitZhongshuai Pei2015-04-302-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SQL ``` select key from (select key,value from t1 limit 100) t2 limit 10 ``` Optimized Logical Plan before modifying ``` == Optimized Logical Plan == Limit 10 Project key#228 Limit 100 MetastoreRelation default, t1, None ``` Optimized Logical Plan after modifying ``` == Optimized Logical Plan == Limit 10 Limit 100 Project key#228 MetastoreRelation default, t1, None ``` After this, we can combine limits Author: Zhongshuai Pei <799203320@qq.com> Author: DoingDone9 <799203320@qq.com> Closes #5797 from DoingDone9/ProjectLimit and squashes the following commits: 70d0fca [Zhongshuai Pei] Update FilterPushdownSuite.scala dc83ae9 [Zhongshuai Pei] Update FilterPushdownSuite.scala 485c61c [Zhongshuai Pei] Update Optimizer.scala f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master 34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master 802261c [DoingDone9] Merge pull request #7 from apache/master d00303b [DoingDone9] Merge pull request #6 from apache/master 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
* [SPARK-7196][SQL] Support precision and scale of decimal type for JDBCLiang-Chi Hsieh2015-04-302-2/+10
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7196 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5777 from viirya/jdbc_precision and squashes the following commits: f40f5e6 [Liang-Chi Hsieh] Support precision and scale for NUMERIC type. 49acbf9 [Liang-Chi Hsieh] Add unit test. a509e19 [Liang-Chi Hsieh] Support precision and scale of decimal type for JDBC.
* [SPARK-7225][SQL] CombineLimits optimizer does not workZhongshuai Pei2015-04-292-8/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SQL ``` select key from (select key from src limit 100) t2 limit 10 ``` Optimized Logical Plan before modifying ``` == Optimized Logical Plan == Limit 10 Limit 100 Project key#3 MetastoreRelation default, src, None ``` Optimized Logical Plan after modifying ``` == Optimized Logical Plan == Limit 10 Project [key#1] MetastoreRelation default, src, None ``` Author: Zhongshuai Pei <799203320@qq.com> Author: DoingDone9 <799203320@qq.com> Closes #5770 from DoingDone9/limitOptimizer and squashes the following commits: c68eaa7 [Zhongshuai Pei] Update CombiningLimitsSuite.scala 97e18cf [Zhongshuai Pei] Update Optimizer.scala 19ab875 [Zhongshuai Pei] Update CombiningLimitsSuite.scala 7db4566 [Zhongshuai Pei] Update CombiningLimitsSuite.scala e2a491d [Zhongshuai Pei] Update Optimizer.scala f03fe7f [Zhongshuai Pei] Merge pull request #12 from apache/master f12fa50 [Zhongshuai Pei] Merge pull request #10 from apache/master f61210c [Zhongshuai Pei] Merge pull request #9 from apache/master 34b1a9a [Zhongshuai Pei] Merge pull request #8 from apache/master 802261c [DoingDone9] Merge pull request #7 from apache/master d00303b [DoingDone9] Merge pull request #6 from apache/master 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
* [SPARK-7156][SQL] Addressed follow up comments for randomSplitBurak Yavuz2015-04-291-1/+1
| | | | | | | | | | | | | small fixes regarding comments in PR #5761 cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5795 from brkyvz/split-followup and squashes the following commits: 369c522 [Burak Yavuz] changed wording a little 1ea456f [Burak Yavuz] Addressed follow up comments
* [SPARK-7234][SQL] Fix DateType mismatch when codegen on.云峤2015-04-291-0/+1
| | | | | | | | Author: 云峤 <chensong.cs@alibaba-inc.com> Closes #5778 from kaka1992/fix_codegenon_datetype_mismatch and squashes the following commits: 1ad4cff [云峤] SPARK-7234 fix dateType mismatch
* [SQL] [Minor] Print detail query execution info when spark answer is not rightwangfei2015-04-291-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Print detail query execution info including parsed/analyzed/optimized/Physical plan for query when spak answer is not rignt. ``` Results do not match for query: == Parsed Logical Plan == 'Aggregate ['x.str], ['x.str,SUM('x.strCount) AS c1#46] 'Join Inner, Some(('x.str = 'y.str)) 'UnresolvedRelation [df], Some(x) 'UnresolvedRelation [df], Some(y) == Analyzed Logical Plan == Aggregate [str#44], [str#44,SUM(strCount#45L) AS c1#46L] Join Inner, Some((str#44 = str#51)) Subquery x Subquery df Aggregate [str#44], [str#44,COUNT(str#44) AS strCount#45L] Project [_1#41 AS int#43,_2#42 AS str#44] LocalRelation [_1#41,_2#42], [[1,1],[2,2],[3,3]] Subquery y Subquery df Aggregate [str#51], [str#51,COUNT(str#51) AS strCount#47L] Project [_1#41 AS int#50,_2#42 AS str#51] LocalRelation [_1#41,_2#42], [[1,1],[2,2],[3,3]] == Optimized Logical Plan == Aggregate [str#44], [str#44,SUM(strCount#45L) AS c1#46L] Project [str#44,strCount#45L] Join Inner, Some((str#44 = str#51)) Aggregate [str#44], [str#44,COUNT(str#44) AS strCount#45L] LocalRelation [str#44], [[1],[2],[3]] Aggregate [str#51], [str#51] LocalRelation [str#51], [[1],[2],[3]] == Physical Plan == Aggregate false, [str#44], [str#44,CombineSum(PartialSum#53L) AS c1#46L] Aggregate true, [str#44], [str#44,SUM(strCount#45L) AS PartialSum#53L] Project [str#44,strCount#45L] BroadcastHashJoin [str#44], [str#51], BuildRight Aggregate false, [str#44], [str#44,Coalesce(SUM(PartialCount#55L),0) AS strCount#45L] Exchange (HashPartitioning [str#44], 5), [] Aggregate true, [str#44], [str#44,COUNT(str#44) AS PartialCount#55L] LocalTableScan [str#44], [[1],[2],[3]] Aggregate false, [str#51], [str#51] Exchange (HashPartitioning [str#51], 5), [] Aggregate true, [str#51], [str#51] LocalTableScan [str#51], [[1],[2],[3]] Code Generation: false == RDD == == Results == !== Correct Answer - 3 == == Spark Answer - 3 == [1,1] [1,1] ![2,3] [2,1] [3,1] [3,1] ``` Author: wangfei <wangfei1@huawei.com> Closes #5774 from scwf/checkanswer and squashes the following commits: 5be6f78 [wangfei] print detail query execution info when Spark Answer is not right
* [SPARK-7229] [SQL] SpecificMutableRow should take integer type as internal ↵Cheng Hao2015-04-292-0/+10
| | | | | | | | | | | representation for Date Author: Cheng Hao <hao.cheng@intel.com> Closes #5772 from chenghao-intel/specific_row and squashes the following commits: 2cd064d [Cheng Hao] scala style issue 60347a2 [Cheng Hao] SpecificMutableRow should take integer type as internal representation for DateType
* [SPARK-7156][SQL] support RandomSplit in DataFramesBurak Yavuz2015-04-297-15/+92
| | | | | | | | | | | | | | | | | This is built on top of kaka1992 's PR #5711 using Logical plans. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5761 from brkyvz/random-sample and squashes the following commits: a1fb0aa [Burak Yavuz] remove unrelated file 69669c3 [Burak Yavuz] fix broken test 1ddb3da [Burak Yavuz] copy base 6000328 [Burak Yavuz] added python api and fixed test 3c11d1b [Burak Yavuz] fixed broken test f400ade [Burak Yavuz] fix build errors 2384266 [Burak Yavuz] addressed comments v0.1 e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames