aboutsummaryrefslogtreecommitdiff
path: root/sql
Commit message (Collapse)AuthorAgeFilesLines
* [SPARK-6672][SQL] convert row to catalyst in createDataFrame(RDD[Row], ...)Xiangrui Meng2015-04-027-8/+37
| | | | | | | | | | | | We assume that `RDD[Row]` contains Scala types. So we need to convert them into catalyst types in createDataFrame. liancheng Author: Xiangrui Meng <meng@databricks.com> Closes #5329 from mengxr/SPARK-6672 and squashes the following commits: 2d52644 [Xiangrui Meng] set needsConversion = false in jsonRDD 06896e4 [Xiangrui Meng] add createDataFrame without conversion 4a3767b [Xiangrui Meng] convert Row to catalyst
* [SPARK-6663] [SQL] use Literal.create instread of constructorDavies Liu2015-04-0116-213/+220
| | | | | | | | | | | In order to do inbound checking and type conversion, we should use Literal.create() instead of constructor. Author: Davies Liu <davies@databricks.com> Closes #5320 from davies/literal and squashes the following commits: 1667604 [Davies Liu] fix style and add comment 5f8c0fd [Davies Liu] use Literal.create instread of constructor
* Revert "[SPARK-6618][SQL] HiveMetastoreCatalog.lookupRelation should use ↵Cheng Lian2015-04-022-20/+3
| | | | | | fine-grained lock" This reverts commit 314afd0e2f08dd8d3333d3143712c2c79fa40d1e.
* [SPARK-6658][SQL] Update DataFrame documentation to fix type references.Chet Mancini2015-04-011-6/+6
| | | | | | | | | | First contribution here; would love to be getting some code contributions in soon. Let me know if there's anything about contribution process I should improve. Author: Chet Mancini <chetmancini@gmail.com> Closes #5316 from chetmancini/SPARK_6658_dataframe_doc and squashes the following commits: 53b627a [Chet Mancini] [SQL] SPARK-6658: Update DataFrame documentation to refer to correct types
* SPARK-6433 hive tests to import spark-sql test JAR for QueryTest accessSteve Loughran2015-04-014-212/+14
| | | | | | | | | | | | | | | | | | | | | | | 1. Test JARs are built & published 1. log4j.resources is explicitly excluded. Without this, downstream test run logging depends on the order the JARs are listed/loaded 1. sql/hive pulls in spark-sql &...spark-catalyst for its test runs 1. The copied in test classes were rm'd, and a test edited to remove its now duplicate assert method 1. Spark streaming is now build with the same plugin/phase as the rest, but its shade plugin declaration is kept in (so different from the rest of the test plugins). Due to (#2), this means the test JAR no longer includes its log4j file. Outstanding issues: * should the JARs be shaded? `spark-streaming-test.jar` does, but given these are test jars for developers only, especially in the same spark source tree, it's hard to justify. * `maven-jar-plugin` v 2.6 was explicitly selected; without this the apache-1.4 parent template JAR version (2.4) chosen. * Are there any other resources to exclude? Author: Steve Loughran <stevel@hortonworks.com> Closes #5119 from steveloughran/stevel/patches/SPARK-6433-test-jars and squashes the following commits: 81ceb01 [Steve Loughran] SPARK-6433 add a clearer comment explaining what the plugin is doing & why a6dca33 [Steve Loughran] SPARK-6433 : pull configuration section form archive plugin c2b5f89 [Steve Loughran] SPARK-6433 omit "jar" goal from jar plugin fdac51b [Steve Loughran] SPARK-6433 -002; indentation & delegate plugin version to parent 650f442 [Steve Loughran] SPARK-6433 patch 001: test JARs are built; sql/hive pulls in spark-sql & spark-catalyst for its test runs
* [SPARK-6608] [SQL] Makes DataFrame.rdd a lazy valCheng Lian2015-04-011-2/+4
| | | | | | | | | | | | | | | Before 1.3.0, `SchemaRDD.id` works as a unique identifier of each `SchemaRDD`. In 1.3.0, unlike `SchemaRDD`, `DataFrame` is no longer an RDD, and `DataFrame.rdd` is actually a function which always returns a new RDD instance. Making `DataFrame.rdd` a lazy val should bring the unique identifier back. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5265) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5265 from liancheng/spark-6608 and squashes the following commits: 7500968 [Cheng Lian] Updates javadoc 7f37d21 [Cheng Lian] Makes DataFrame.rdd a lazy val
* [Doc] Improve Python DataFrame documentationReynold Xin2015-03-311-0/+3
| | | | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5287 from rxin/pyspark-df-doc-cleanup-context and squashes the following commits: 1841b60 [Reynold Xin] Lint. f2007f1 [Reynold Xin] functions and types. bc3b72b [Reynold Xin] More improvements to DataFrame Python doc. ac1d4c0 [Reynold Xin] Bug fix. b163365 [Reynold Xin] Python fix. Added Experimental flag to DataFrameNaFunctions. 608422d [Reynold Xin] [Doc] Cleanup context.py Python docs.
* [SPARK-6633][SQL] Should be "Contains" instead of "EndsWith" when ↵Liang-Chi Hsieh2015-03-311-1/+1
| | | | | | | | | | constructing sources.StringContains Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5299 from viirya/stringcontains and squashes the following commits: c1ece4c [Liang-Chi Hsieh] Should be Contains instead of EndsWith.
* [SPARK-5371][SQL] Propagate types after function conversion, before futher ↵Michael Armbrust2015-03-313-2/+27
| | | | | | | | | | | | resolution Before it was possible for a query to flip back and forth from a resolved state, allowing resolution to propagate up before coercion had stabilized. The issue was that `ResolvedReferences` would run after `FunctionArgumentConversion`, but before `PropagateTypes` had run. This PR ensures we correctly `PropagateTypes` after any coercion has applied. Author: Michael Armbrust <michael@databricks.com> Closes #5278 from marmbrus/unionNull and squashes the following commits: dc3581a [Michael Armbrust] [SPARK-5371][SQL] Propogate types after function conversion / before futher resolution
* [SPARK-6145][SQL] fix ORDER BY on nested fieldsMichael Armbrust2015-03-318-57/+185
| | | | | | | | | | | | | | This PR is based on work by cloud-fan in #4904, but with two differences: - We isolate the logic for Sort's special handling into `ResolveSortReferences` - We avoid creating UnresolvedGetField expressions during resolution. Instead we either resolve GetField or we return None. This avoids us going down the wrong path early on. Author: Michael Armbrust <michael@databricks.com> Closes #5189 from marmbrus/nestedOrderBy and squashes the following commits: b8cae45 [Michael Armbrust] fix another test 0f36a11 [Michael Armbrust] WIP 91820cd [Michael Armbrust] Fix bug.
* [SPARK-6575] [SQL] Adds configuration to disable schema merging while ↵Cheng Lian2015-03-312-10/+15
| | | | | | | | | | | | | | | | | | | | | converting metastore Parquet tables Consider a metastore Parquet table that 1. doesn't have schema evolution issue 2. has lots of data files and/or partitions In this case, driver schema merging can be both slow and unnecessary. Would be good to have a configuration to let the use disable schema merging when converting such a metastore Parquet table. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5231) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5231 from liancheng/spark-6575 and squashes the following commits: cd96159 [Cheng Lian] Adds configuration to disable schema merging while converting metastore Parquet tables
* [SPARK-6555] [SQL] Overrides equals() and hashCode() for MetastoreRelationCheng Lian2015-03-312-20/+28
| | | | | | | | | | | | | | | | Also removes temporary workarounds made in #5183 and #5251. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5289) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5289 from liancheng/spark-6555 and squashes the following commits: d0095ac [Cheng Lian] Removes unused imports cfafeeb [Cheng Lian] Removes outdated comment 75a2746 [Cheng Lian] Overrides equals() and hashCode() for MetastoreRelation
* [SPARK-6542][SQL] add CreateStructXiangrui Meng2015-03-313-23/+73
| | | | | | | | | | | | | | Similar to `CreateArray`, we can add `CreateStruct` to create nested columns. marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #5195 from mengxr/SPARK-6542 and squashes the following commits: 3795c57 [Xiangrui Meng] update error message ae7ac3e [Xiangrui Meng] move unit test to a separate suite 85dd559 [Xiangrui Meng] use NamedExpr c78e31a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-6542 85f3106 [Xiangrui Meng] add CreateStruct
* [SPARK-6618][SQL] HiveMetastoreCatalog.lookupRelation should use ↵Yin Huai2015-03-312-3/+20
| | | | | | | | | | | | | fine-grained lock JIRA: https://issues.apache.org/jira/browse/SPARK-6618 Author: Yin Huai <yhuai@databricks.com> Closes #5281 from yhuai/lookupRelationLock and squashes the following commits: 591b4be [Yin Huai] A test? b3a9625 [Yin Huai] Just protect client.
* [SPARK-6625][SQL] Add common string filters to data sources.Reynold Xin2015-03-314-22/+133
| | | | | | | | | | | | | Filters such as startsWith, endsWith, contains will be very useful for data sources that provide search functionality, e.g. Succinct, Elastic Search, Solr. I also took this chance to improve documentation for the data source filters. Author: Reynold Xin <rxin@databricks.com> Closes #5285 from rxin/ds-string-filters and squashes the following commits: f021727 [Reynold Xin] Fixed grammar. 7695a52 [Reynold Xin] [SPARK-6625][SQL] Add common string filters to data sources.
* [SPARK-6119][SQL] DataFrame support for missing data handlingReynold Xin2015-03-306-8/+424
| | | | | | | | | | | | | | | | | | | | This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API. Author: Reynold Xin <rxin@databricks.com> Closes #5274 from rxin/df-missing-value and squashes the following commits: 4ee1b98 [Reynold Xin] Improve error reporting in Python. 33a330c [Reynold Xin] Remove replace for now. bc4fdbb [Reynold Xin] Added documentation for replace. d56f5a5 [Reynold Xin] Added replace for Scala/Java. 2385d00 [Reynold Xin] Feedback from Xiangrui on "how". 914a374 [Reynold Xin] fill with map. 185c67e [Reynold Xin] Allow specifying column subsets in fill. 749eb47 [Reynold Xin] fillna 249b94e [Reynold Xin] Removing undefined functions. 6a73c68 [Reynold Xin] Missing file. 67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python)
* [SPARK-6369] [SQL] Uses commit coordinator to help committing Hive and ↵Cheng Lian2015-03-314-22/+11
| | | | | | | | | | | | | | | | | | | | | | | | Parquet tables This PR leverages the output commit coordinator introduced in #4066 to help committing Hive and Parquet tables. This PR extracts output commit code in `SparkHadoopWriter.commit` to `SparkHadoopMapRedUtil.commitTask`, and reuses it for committing Parquet and Hive tables on executor side. TODO - [ ] Add tests <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5139) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5139 from liancheng/spark-6369 and squashes the following commits: 72eb628 [Cheng Lian] Fixes typo in javadoc 9a4b82b [Cheng Lian] Adds javadoc and addresses @aarondav's comments dfdf3ef [Cheng Lian] Uses commit coordinator to help committing Hive and Parquet tables
* [SPARK-6592][SQL] fix filter for scaladoc to generate API doc for Row class ↵CodingCat2015-03-302-2/+2
| | | | | | | | | | | | | | | | | | | | under catalyst dir https://issues.apache.org/jira/browse/SPARK-6592 The current impl in SparkBuild.scala filter all classes under catalyst directory, however, we have a corner case that Row class is a public API under that directory we need to include Row into the scaladoc while still excluding other classes of catalyst project Thanks for the help on this patch from rxin and liancheng Author: CodingCat <zhunansjtu@gmail.com> Closes #5252 from CodingCat/SPARK-6592 and squashes the following commits: 02098a4 [CodingCat] ignore collection, enable types (except those protected classes) f7af2cb [CodingCat] commit 3ab4403 [CodingCat] fix filter for scaladoc to generate API doc for Row.scala under catalyst directory
* [SPARK-6595][SQL] MetastoreRelation should be a MultiInstanceRelationMichael Armbrust2015-03-304-4/+27
| | | | | | | | | | Now that we have `DataFrame`s it is possible to have multiple copies in a single query plan. As such, it needs to inherit from `MultiInstanceRelation` or self joins will break. I also add better debugging errors when our self join handling fails in case there are future bugs. Author: Michael Armbrust <michael@databricks.com> Closes #5251 from marmbrus/multiMetaStore and squashes the following commits: 4272f6d [Michael Armbrust] [SPARK-6595][SQL] MetastoreRelation should be MuliInstanceRelation
* [spark-sql] a better exception message than "scala.MatchError" for ↵Eran Medan2015-03-301-0/+2
| | | | | | | | | | | | | | | | unsupported types in Schema creation Currently if trying to register an RDD (or DataFrame in 1.3) as a table that has types that have no supported Schema representation (e.g. type "Any") - it would throw a match error. e.g. scala.MatchError: Any (of class scala.reflect.internal.Types$ClassNoArgsTypeRef) This fix is just to have a nicer error message than a MatchError Author: Eran Medan <ehrann.mehdan@gmail.com> Closes #5235 from eranation/patch-2 and squashes the following commits: af4b1a2 [Eran Medan] Line should be under 100 chars 0c69e9d [Eran Medan] Change from sys.error UnsupportedOperationException 524be86 [Eran Medan] better exception than scala.MatchError: Any
* [SPARK-6538][SQL] Add missing nullable Metastore fields when merging a ↵Adam Budde2015-03-282-6/+66
| | | | | | | | | | | | | | | | | | | | | Parquet schema Opening to replace #5188. When Spark SQL infers a schema for a DataFrame, it will take the union of all field types present in the structured source data (e.g. an RDD of JSON data). When the source data for a row doesn't define a particular field on the DataFrame's schema, a null value will simply be assumed for this field. This workflow makes it very easy to construct tables and query over a set of structured data with a nonuniform schema. However, this behavior is not consistent in some cases when dealing with Parquet files and an external table managed by an external Hive metastore. In our particular usecase, we use Spark Streaming to parse and transform our input data and then apply a window function to save an arbitrary-sized batch of data as a Parquet file, which itself will be added as a partition to an external Hive table via an *"ALTER TABLE... ADD PARTITION..."* statement. Since our input data is nonuniform, it is expected that not every partition batch will contain every field present in the table's schema obtained from the Hive metastore. As such, we expect that the schema of some of our Parquet files may not contain the same set fields present in the full metastore schema. In such cases, it seems natural that Spark SQL would simply assume null values for any missing fields in the partition's Parquet file, assuming these fields are specified as nullable by the metastore schema. This is not the case in the current implementation of ParquetRelation2. The **mergeMetastoreParquetSchema()** method used to reconcile differences between a Parquet file's schema and a schema retrieved from the Hive metastore will raise an exception if the Parquet file doesn't match the same set of fields specified by the metastore. This pull requests alters the behavior of **mergeMetastoreParquetSchema()** by having it first add any nullable fields from the metastore schema to the Parquet file schema if they aren't already present there. Author: Adam Budde <budde@amazon.com> Closes #5214 from budde/nullable-fields and squashes the following commits: a52d378 [Adam Budde] Refactor ParquetSchemaSuite.scala for cases now permitted by SPARK-6471 and SPARK-6538 9041bfa [Adam Budde] Add missing nullable Metastore fields when merging a Parquet schema
* [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 row, not 1 rowReynold Xin2015-03-278-10/+18
| | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5226 from rxin/empty-df and squashes the following commits: 1306d88 [Reynold Xin] Proper fix. e135bb9 [Reynold Xin] [SPARK-6564][SQL] SQLContext.emptyDataFrame should contain 0 rows, not 1 row.
* [SPARK-6550][SQL] Use analyzed plan in DataFrameMichael Armbrust2015-03-272-1/+5
| | | | | | | | | | | | | | | | This is based on bug and test case proposed by viirya. See #5203 for a excellent description of the problem. TLDR; The problem occurs because the function `groupBy(String)` calls `resolve`, which returns an `AttributeReference`. However, this `AttributeReference` is based on an analyzed plan which is thrown away. At execution time, we once again analyze the plan. However, in the case of self-joins, each call to analyze will produce a new tree for the left side of the join, rendering the previously returned `AttributeReference` invalid. As a fix, I propose we keep the analyzed plan instead of the unresolved plan inside of a `DataFrame`. Author: Michael Armbrust <michael@databricks.com> Closes #5217 from marmbrus/preanalyzer and squashes the following commits: 1f98e2d [Michael Armbrust] revert change dd4dec1 [Michael Armbrust] Use the analyzed plan in DataFrame 089c52e [Michael Armbrust] WIP
* [SPARK-6554] [SQL] Don't push down predicates which reference partition ↵Cheng Lian2015-03-262-5/+29
| | | | | | | | | | | | | | | | | | | | | | | | | column(s) There are two cases for the new Parquet data source: 1. Partition columns exist in the Parquet data files We don't need to push-down these predicates since partition pruning already handles them. 1. Partition columns don't exist in the Parquet data files We can't push-down these predicates since they are considered as invalid columns by Parquet. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5210) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5210 from liancheng/spark-6554 and squashes the following commits: 4f7ec03 [Cheng Lian] Adds comments e134ced [Cheng Lian] Don't push down predicates which reference partition column(s)
* [SPARK-6117] [SQL] Improvements to DataFrame.describe()Reynold Xin2015-03-262-20/+29
| | | | | | | | | | | | | 1. Slightly modifications to the code to make it more readable. 2. Added Python implementation. 3. Updated the documentation to state that we don't guarantee the output schema for this function and it should only be used for exploratory data analysis. Author: Reynold Xin <rxin@databricks.com> Closes #5201 from rxin/df-describe and squashes the following commits: 25a7834 [Reynold Xin] Reset run-tests. 6abdfee [Reynold Xin] [SPARK-6117] [SQL] Improvements to DataFrame.describe()
* [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet ↵Yash Datta2015-03-262-4/+19
| | | | | | | | | | | | | | schema to support dropping of columns using replace columns Currently in the parquet relation 2 implementation, error is thrown in case merged schema is not exactly the same as metastore schema. But to support cases like deletion of column using replace column command, we can relax the restriction so that even if metastore schema is a subset of merged parquet schema, the query will work. Author: Yash Datta <Yash.Datta@guavus.com> Closes #5141 from saucam/replace_col and squashes the following commits: e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for metastore schema to be subset of parquet schema 5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset of parquet schema to support dropping of columns using replace columns
* [SPARK-6465][SQL] Fix serialization of GenericRowWithSchema using kryoMichael Armbrust2015-03-265-5/+39
| | | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #5191 from marmbrus/kryoRowsWithSchema and squashes the following commits: bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using kryo f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema
* [SPARK-6546][Build] Using the wrong code that will make spark compile failed!!DoingDone92015-03-261-1/+1
| | | | | | | | | | | | | | | | | | | wrong code : val tmpDir = Files.createTempDir() not Files should Utils Author: DoingDone9 <799203320@qq.com> Closes #5198 from DoingDone9/FilesBug and squashes the following commits: 6e0140d [DoingDone9] Update InsertIntoHiveTableSuite.scala e57d23f [DoingDone9] Update InsertIntoHiveTableSuite.scala 802261c [DoingDone9] Merge pull request #7 from apache/master d00303b [DoingDone9] Merge pull request #6 from apache/master 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
* [SPARK-6117] [SQL] add describe function to DataFrame for summary statis...azagrebin2015-03-262-1/+97
| | | | | | | | | | | | Please review my solution for SPARK-6117 Author: azagrebin <azagrebin@gmail.com> Closes #5073 from azagrebin/SPARK-6117 and squashes the following commits: f9056ac [azagrebin] [SPARK-6117] [SQL] create one aggregation and split it locally into resulting DF, colocate test data with test case ddb3950 [azagrebin] [SPARK-6117] [SQL] simplify implementation, add test for DF without numeric columns 9daf31e [azagrebin] [SPARK-6117] [SQL] add describe function to DataFrame for summary statistics
* [SPARK-6463][SQL] AttributeSet.equal should compare sizeMichael Armbrust2015-03-252-1/+84
| | | | | | | | | | | | | | | Previously this could result in sets compare equals when in fact the right was a subset of the left. Based on #5133 by sisihj Author: sisihj <jun.hejun@huawei.com> Author: Michael Armbrust <michael@databricks.com> Closes #5194 from marmbrus/pr/5133 and squashes the following commits: 5ed4615 [Michael Armbrust] fix imports d4cbbc0 [Michael Armbrust] Add test cases 0a0834f [sisihj] AttributeSet.equal should compare size
* The UT test of spark is failed. Because there is a test in SQLQuerySuite ↵KaiXinXiaoLei2015-03-251-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | about creating table “test” If the tests in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala" are running before CachedTableSuite.scala, the test("Drop cached table") will failed. Because the table test is created in SQLQuerySuite.scala ,and this table not droped. So when running "drop cached table", table test already exists. There is error info: 01:18:35.738 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException(message:Table test already exists) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:616) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)test” And the test about "create table test" in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala,is: test("SPARK-4825 save join to table") { val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)).toDF() sql("CREATE TABLE test1 (key INT, value STRING)") testData.insertInto("test1") sql("CREATE TABLE test2 (key INT, value STRING)") testData.insertInto("test2") testData.insertInto("test2") sql("CREATE TABLE test AS SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key") checkAnswer( table("test"), sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq) } Author: KaiXinXiaoLei <huleilei1@huawei.com> Closes #5150 from KaiXinXiaoLei/testFailed and squashes the following commits: 7534b02 [KaiXinXiaoLei] The UT test of spark is failed.
* [SPARK-6202] [SQL] enable variable substitution on test frameworkDaoyuan Wang2015-03-251-1/+7
| | | | | | | | | | | Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4930 from adrian-wang/testvs and squashes the following commits: 2ce590f [Daoyuan Wang] add explicit function types b1d68bf [Daoyuan Wang] only substitute for parseSql 9c4a950 [Daoyuan Wang] add a comment explaining 18fb481 [Daoyuan Wang] enable variable substitute on test framework
* [SPARK-6271][SQL] Sort these tokens in alphabetic order to avoid further ↵DoingDone92015-03-251-42/+46
| | | | | | | | | | | | | | duplicate in HiveQl Author: DoingDone9 <799203320@qq.com> Closes #4973 from DoingDone9/sort_token and squashes the following commits: 855fa10 [DoingDone9] Update HiveQl.scala c7080b3 [DoingDone9] Sort these tokens in alphabetic order to avoid further duplicate in HiveQl c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
* [SPARK-6326][SQL] Improve castStruct to be fasterLiang-Chi Hsieh2015-03-251-4/+11
| | | | | | | | | | | Current `castStruct` should be very slow. This pr slightly improves it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5017 from viirya/faster_caststruct and squashes the following commits: 385d5b0 [Liang-Chi Hsieh] Further improved. 746fcfb [Liang-Chi Hsieh] Make castStruct faster.
* [SPARK-5498][SQL]fix query exception when partition schema does not match ↵jeanlyn2015-03-254-14/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | table schema In hive,the schema of partition may be difference from the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the [jira](https://issues.apache.org/jira/browse/SPARK-5498) .For example: * We take a look of the schema for the partition and the table ```sql DESCRIBE partition_test PARTITION (dt='1'); id int None name string None dt string None # Partition Information # col_name data_type comment dt string None ``` ``` DESCRIBE partition_test; OK id bigint None name string None dt string None # Partition Information # col_name data_type comment dt string None ``` * run the sql ```sql SELECT * FROM partition_test where dt='1'; ``` we will get the cast exception `java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt` Author: jeanlyn <jeanlyn92@gmail.com> Closes #4289 from jeanlyn/schema and squashes the following commits: 9c8da74 [jeanlyn] fix style b41d6b9 [jeanlyn] fix compile errors 07d84b6 [jeanlyn] Merge branch 'master' into schema 535b0b6 [jeanlyn] reduce conflicts d6c93c5 [jeanlyn] fix bug 1e8b30c [jeanlyn] fix code style 0549759 [jeanlyn] fix code style c879aa1 [jeanlyn] clean the code 2a91a87 [jeanlyn] add more test case and clean the code 12d800d [jeanlyn] fix code style 63d170a [jeanlyn] fix compile problem 7470901 [jeanlyn] reduce conflicts afc7da5 [jeanlyn] make getConvertedOI compatible between 0.12.0 and 0.13.1 b1527d5 [jeanlyn] fix type mismatch 10744ca [jeanlyn] Insert a space after the start of the comment 3b27af3 [jeanlyn] SPARK-5498:fix bug when query the data when partition schema does not match table schema
* [SPARK-6450] [SQL] Fixes metastore Parquet table conversionCheng Lian2015-03-252-16/+43
| | | | | | | | | | | | | | | | The `ParquetConversions` analysis rule generates a hash map, which maps from the original `MetastoreRelation` instances to the newly created `ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't compare output attributes. Thus, if a single metastore Parquet table appears multiple times in a query, only a single entry ends up in the hash map, and the conversion is not correctly performed. Proper fix for this issue should be overriding `equals` and `hashCode` for MetastoreRelation. Unfortunately, this breaks more tests than expected. It's possible that these tests are ill-formed from the very beginning. As 1.3.1 release is approaching, we'd like to make the change more surgical to avoid potential regressions. The proposed fix here is to make both the metastore relations and their output attributes as keys in the hash map used in ParquetConversions. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5183 from liancheng/spark-6450 and squashes the following commits: 3536780 [Cheng Lian] Fixes metastore Parquet table conversion
* [SPARK-6409][SQL] It is not necessary that avoid old inteface of hive, ↵DoingDone92015-03-252-3/+11
| | | | | | | | | | | | | | | | | | because this will make some UDAF can not work. spark avoid old inteface of hive, then some udaf can not work like "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage" Author: DoingDone9 <799203320@qq.com> Closes #5131 from DoingDone9/udaf and squashes the following commits: 9de08d0 [DoingDone9] Update HiveUdfSuite.scala 49c62dc [DoingDone9] Update hiveUdfs.scala 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
* [SPARK-6483][SQL]Improve ScalaUdf called performance.zzcclp2015-03-251-355/+661
| | | | | | | | | | | | | | | | As issue [SPARK-6483](https://issues.apache.org/jira/browse/SPARK-6483) description, ScalaUdf is low performance because of calling *asInstanceOf* to convert per record. With this, the performance of ScalaUdf is the same as other case. thank lianhuiwang for telling me how to resolve this problem. Author: zzcclp <xm_zzc@sina.com> Closes #5154 from zzcclp/SPARK-6483 and squashes the following commits: 5ac6e09 [zzcclp] Add a newline at the end of source file cc6868e [zzcclp] Fix for fail on unit test. 0a8cdc3 [zzcclp] indention issue b73836a [zzcclp] Access Seq[Expression] element by :: operator, and update the code gen script. 7763848 [zzcclp] rebase from master
* [SPARK-6428][SQL] Added explicit types for all public methods in catalystReynold Xin2015-03-2440-586/+626
| | | | | | | | | | I think after this PR, we can finally turn the rule on. There are still some smaller ones that need to be fixed, but those are easier. Author: Reynold Xin <rxin@databricks.com> Closes #5162 from rxin/catalyst-explicit-types and squashes the following commits: e7eac03 [Reynold Xin] [SPARK-6428][SQL] Added explicit types for all public methods in catalyst.
* [SPARK-6458][SQL] Better error messages for invalid data sourcesMichael Armbrust2015-03-241-3/+9
| | | | | | | | | | | Avoid unclear match errors and use `AnalysisException`. Author: Michael Armbrust <michael@databricks.com> Closes #5158 from marmbrus/dataSourceError and squashes the following commits: af9f82a [Michael Armbrust] Yins comment 90c6ba4 [Michael Armbrust] Better error messages for invalid data sources
* [SPARK-6376][SQL] Avoid eliminating subqueries until optimizationMichael Armbrust2015-03-249-17/+34
| | | | | | | | | | | | | | | | | | | | Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again. However, with eager analysis in `DataFrame`s this can cause errors for queries such as: ```scala val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str") df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count() ``` As a result, in this PR we defer the elimination of subqueries until the optimization phase. Author: Michael Armbrust <michael@databricks.com> Closes #5160 from marmbrus/subqueriesInDfs and squashes the following commits: a9bb262 [Michael Armbrust] Update Optimizer.scala 27d25bf [Michael Armbrust] fix hive tests 9137e03 [Michael Armbrust] add type 81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization
* [SPARK-6375][SQL] Fix formatting of error messages.Michael Armbrust2015-03-247-5/+53
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #5155 from marmbrus/errorMessages and squashes the following commits: b898188 [Michael Armbrust] Fix formatting of error messages.
* [SPARK-6054][SQL] Fix transformations of TreeNodes that hold StructTypesMichael Armbrust2015-03-243-3/+25
| | | | | | | | | | Due to a recent change that made `StructType` a `Seq` we started inadvertently turning `StructType`s into generic `Traversable` when attempting nested tree transformations. In this PR we explicitly avoid descending into `DataType`s to avoid this bug. Author: Michael Armbrust <michael@databricks.com> Closes #5157 from marmbrus/udfFix and squashes the following commits: 26f7087 [Michael Armbrust] Fix transformations of TreeNodes that hold StructTypes
* [SPARK-6437][SQL] Use completion iterator to close external sorterMichael Armbrust2015-03-241-2/+4
| | | | | | | | | | | Otherwise we will leak files when spilling occurs. Author: Michael Armbrust <michael@databricks.com> Closes #5161 from marmbrus/cleanupAfterSort and squashes the following commits: cb13d3c [Michael Armbrust] hint to inferencer cdebdf5 [Michael Armbrust] Use completion iterator to close external sorter
* [SPARK-6459][SQL] Warn when constructing trivially true equals predicateMichael Armbrust2015-03-241-2/+11
| | | | | | | | | | | | | | | | | | For example, one might expect the following code to work, but it does not. Now you will at least get a warning with a suggestion to use aliases. ```scala val df = sqlContext.load(path, "parquet") val txns = df.groupBy("cust_id").agg($"cust_id", countDistinct($"day_num").as("txns")) val spend = df.groupBy("cust_id").agg($"cust_id", sum($"extended_price").as("spend")) val rmJoin = txns.join(spend, txns("cust_id") === spend("cust_id"), "inner") ``` Author: Michael Armbrust <michael@databricks.com> Closes #5163 from marmbrus/selfJoinError and squashes the following commits: 16c1f0b [Michael Armbrust] fix visibility 1b57e8d [Michael Armbrust] Warn when constructing trivially true equals predicate
* [SPARK-6361][SQL] support adding a column with metadata in DFXiangrui Meng2015-03-243-10/+38
| | | | | | | | | | This is used by ML pipelines to embed ML attributes in columns created by ML transformers/estimators. marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #5151 from mengxr/SPARK-6361 and squashes the following commits: bb30de3 [Xiangrui Meng] support adding a column with metadata in DF
* [SPARK-6475][SQL] recognize array types when infer data types from JavaBeansXiangrui Meng2015-03-242-32/+89
| | | | | | | | | | | Right now if there is a array field in a JavaBean, the user wold see an exception in `createDataFrame`. liancheng Author: Xiangrui Meng <meng@databricks.com> Closes #5146 from mengxr/SPARK-6475 and squashes the following commits: 51e87e5 [Xiangrui Meng] validate schemas 4f2df5e [Xiangrui Meng] recognize array types when infer data types from JavaBeans
* [SPARK-6452] [SQL] Checks for missing attributes and unresolved operator for ↵Cheng Lian2015-03-243-7/+33
| | | | | | | | | | | | | | | | | | | | | | | all types of operator In `CheckAnalysis`, `Filter` and `Aggregate` are checked in separate case clauses, thus never hit those clauses for unresolved operators and missing input attributes. This PR also removes the `prettyString` call when generating error message for missing input attributes. Because result of `prettyString` doesn't contain expression ID, and may give confusing messages like > resolved attributes a missing from a cc rxin <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5129) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5129 from liancheng/spark-6452 and squashes the following commits: 52cdc69 [Cheng Lian] Addresses comments 029f9bd [Cheng Lian] Checks for missing attributes and unresolved operator for all types of operator
* [SPARK-6124] Support jdbc connection properties in OPTIONS part of the queryVolodymyr Lyubinets2015-03-233-29/+59
| | | | | | | | | | One more thing if this PR is considered to be OK - it might make sense to add extra .jdbc() API's that take Properties to SQLContext. Author: Volodymyr Lyubinets <vlyubin@gmail.com> Closes #4859 from vlyubin/jdbcProperties and squashes the following commits: 7a8cfda [Volodymyr Lyubinets] Support jdbc connection properties in OPTIONS part of the query
* [SPARK-6397][SQL] Check the missingInput simplyYadong Qi2015-03-232-5/+5
| | | | | | | | | | | | https://github.com/apache/spark/pull/5082 /cc liancheng Author: Yadong Qi <qiyadong2010@gmail.com> Closes #5132 from watermen/sql-missingInput-new and squashes the following commits: 1e5bdc5 [Yadong Qi] Check the missingInput simply