spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
...
*	The UT test of spark is failed. Because there is a test in SQLQuerySuite ↵	KaiXinXiaoLei	2015-03-25	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	about creating table “test” If the tests in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala" are running before CachedTableSuite.scala, the test("Drop cached table") will failed. Because the table test is created in SQLQuerySuite.scala ,and this table not droped. So when running "drop cached table", table test already exists. There is error info: 01:18:35.738 ERROR hive.ql.exec.DDLTask: org.apache.hadoop.hive.ql.metadata.HiveException: AlreadyExistsException(message:Table test already exists) at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:616) at org.apache.hadoop.hive.ql.exec.DDLTask.createTable(DDLTask.java:4189) at org.apache.hadoop.hive.ql.exec.DDLTask.execute(DDLTask.java:281) at org.apache.hadoop.hive.ql.exec.Task.executeTask(Task.java:153) at org.apache.hadoop.hive.ql.exec.TaskRunner.runSequential(TaskRunner.java:85) at org.apache.hadoop.hive.ql.Driver.launchTask(Driver.java:1503) at org.apache.hadoop.hive.ql.Driver.execute(Driver.java:1270) at org.apache.hadoop.hive.ql.Driver.runInternal(Driver.java:1088) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:911) at org.apache.hadoop.hive.ql.Driver.run(Driver.java:901)test” And the test about "create table test" in "sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala,is: test("SPARK-4825 save join to table") { val testData = sparkContext.parallelize(1 to 10).map(i => TestData(i, i.toString)).toDF() sql("CREATE TABLE test1 (key INT, value STRING)") testData.insertInto("test1") sql("CREATE TABLE test2 (key INT, value STRING)") testData.insertInto("test2") testData.insertInto("test2") sql("CREATE TABLE test AS SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key") checkAnswer( table("test"), sql("SELECT COUNT(a.value) FROM test1 a JOIN test2 b ON a.key = b.key").collect().toSeq) } Author: KaiXinXiaoLei <huleilei1@huawei.com> Closes #5150 from KaiXinXiaoLei/testFailed and squashes the following commits: 7534b02 [KaiXinXiaoLei] The UT test of spark is failed.
*	[SPARK-6202] [SQL] enable variable substitution on test framework	Daoyuan Wang	2015-03-25	1	-1/+7
\| \| \| \| \| \| \| \| \| \| \|	Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4930 from adrian-wang/testvs and squashes the following commits: 2ce590f [Daoyuan Wang] add explicit function types b1d68bf [Daoyuan Wang] only substitute for parseSql 9c4a950 [Daoyuan Wang] add a comment explaining 18fb481 [Daoyuan Wang] enable variable substitute on test framework
*	[SPARK-6271][SQL] Sort these tokens in alphabetic order to avoid further ↵	DoingDone9	2015-03-25	1	-42/+46
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	duplicate in HiveQl Author: DoingDone9 <799203320@qq.com> Closes #4973 from DoingDone9/sort_token and squashes the following commits: 855fa10 [DoingDone9] Update HiveQl.scala c7080b3 [DoingDone9] Sort these tokens in alphabetic order to avoid further duplicate in HiveQl c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
*	[SPARK-6326][SQL] Improve castStruct to be faster	Liang-Chi Hsieh	2015-03-25	1	-4/+11
\| \| \| \| \| \| \| \| \| \| \|	Current `castStruct` should be very slow. This pr slightly improves it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5017 from viirya/faster_caststruct and squashes the following commits: 385d5b0 [Liang-Chi Hsieh] Further improved. 746fcfb [Liang-Chi Hsieh] Make castStruct faster.
*	[SPARK-5498][SQL]fix query exception when partition schema does not match ↵	jeanlyn	2015-03-25	4	-14/+84
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	table schema In hive,the schema of partition may be difference from the table schema.When we use spark-sql to query the data of partition which schema is difference from the table schema,we will get the exceptions as the description of the [jira](https://issues.apache.org/jira/browse/SPARK-5498) .For example: * We take a look of the schema for the partition and the table ```sql DESCRIBE partition_test PARTITION (dt='1'); id int None name string None dt string None # Partition Information # col_name data_type comment dt string None ``` ``` DESCRIBE partition_test; OK id bigint None name string None dt string None # Partition Information # col_name data_type comment dt string None ``` * run the sql ```sql SELECT * FROM partition_test where dt='1'; ``` we will get the cast exception `java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.MutableLong cannot be cast to org.apache.spark.sql.catalyst.expressions.MutableInt` Author: jeanlyn <jeanlyn92@gmail.com> Closes #4289 from jeanlyn/schema and squashes the following commits: 9c8da74 [jeanlyn] fix style b41d6b9 [jeanlyn] fix compile errors 07d84b6 [jeanlyn] Merge branch 'master' into schema 535b0b6 [jeanlyn] reduce conflicts d6c93c5 [jeanlyn] fix bug 1e8b30c [jeanlyn] fix code style 0549759 [jeanlyn] fix code style c879aa1 [jeanlyn] clean the code 2a91a87 [jeanlyn] add more test case and clean the code 12d800d [jeanlyn] fix code style 63d170a [jeanlyn] fix compile problem 7470901 [jeanlyn] reduce conflicts afc7da5 [jeanlyn] make getConvertedOI compatible between 0.12.0 and 0.13.1 b1527d5 [jeanlyn] fix type mismatch 10744ca [jeanlyn] Insert a space after the start of the comment 3b27af3 [jeanlyn] SPARK-5498:fix bug when query the data when partition schema does not match table schema
*	[SPARK-6450] [SQL] Fixes metastore Parquet table conversion	Cheng Lian	2015-03-25	2	-16/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The `ParquetConversions` analysis rule generates a hash map, which maps from the original `MetastoreRelation` instances to the newly created `ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't compare output attributes. Thus, if a single metastore Parquet table appears multiple times in a query, only a single entry ends up in the hash map, and the conversion is not correctly performed. Proper fix for this issue should be overriding `equals` and `hashCode` for MetastoreRelation. Unfortunately, this breaks more tests than expected. It's possible that these tests are ill-formed from the very beginning. As 1.3.1 release is approaching, we'd like to make the change more surgical to avoid potential regressions. The proposed fix here is to make both the metastore relations and their output attributes as keys in the hash map used in ParquetConversions. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5183 from liancheng/spark-6450 and squashes the following commits: 3536780 [Cheng Lian] Fixes metastore Parquet table conversion
*	[SPARK-6409][SQL] It is not necessary that avoid old inteface of hive, ↵	DoingDone9	2015-03-25	2	-3/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	because this will make some UDAF can not work. spark avoid old inteface of hive, then some udaf can not work like "org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage" Author: DoingDone9 <799203320@qq.com> Closes #5131 from DoingDone9/udaf and squashes the following commits: 9de08d0 [DoingDone9] Update HiveUdfSuite.scala 49c62dc [DoingDone9] Update hiveUdfs.scala 98b134f [DoingDone9] Merge pull request #5 from apache/master 161cae3 [DoingDone9] Merge pull request #4 from apache/master c87e8b6 [DoingDone9] Merge pull request #3 from apache/master cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
*	[SPARK-6483][SQL]Improve ScalaUdf called performance.	zzcclp	2015-03-25	1	-355/+661
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As issue [SPARK-6483](https://issues.apache.org/jira/browse/SPARK-6483) description, ScalaUdf is low performance because of calling asInstanceOf to convert per record. With this, the performance of ScalaUdf is the same as other case. thank lianhuiwang for telling me how to resolve this problem. Author: zzcclp <xm_zzc@sina.com> Closes #5154 from zzcclp/SPARK-6483 and squashes the following commits: 5ac6e09 [zzcclp] Add a newline at the end of source file cc6868e [zzcclp] Fix for fail on unit test. 0a8cdc3 [zzcclp] indention issue b73836a [zzcclp] Access Seq[Expression] element by :: operator, and update the code gen script. 7763848 [zzcclp] rebase from master
*	[SPARK-6428][SQL] Added explicit types for all public methods in catalyst	Reynold Xin	2015-03-24	40	-586/+626
\| \| \| \| \| \| \| \| \| \|	I think after this PR, we can finally turn the rule on. There are still some smaller ones that need to be fixed, but those are easier. Author: Reynold Xin <rxin@databricks.com> Closes #5162 from rxin/catalyst-explicit-types and squashes the following commits: e7eac03 [Reynold Xin] [SPARK-6428][SQL] Added explicit types for all public methods in catalyst.
*	[SPARK-6458][SQL] Better error messages for invalid data sources	Michael Armbrust	2015-03-24	1	-3/+9
\| \| \| \| \| \| \| \| \| \| \|	Avoid unclear match errors and use `AnalysisException`. Author: Michael Armbrust <michael@databricks.com> Closes #5158 from marmbrus/dataSourceError and squashes the following commits: af9f82a [Michael Armbrust] Yins comment 90c6ba4 [Michael Armbrust] Better error messages for invalid data sources
*	[SPARK-6376][SQL] Avoid eliminating subqueries until optimization	Michael Armbrust	2015-03-24	9	-17/+34
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previously it was okay to throw away subqueries after analysis, as we would never try to use that tree for resolution again. However, with eager analysis in `DataFrame`s this can cause errors for queries such as: ```scala val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str") df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count() ``` As a result, in this PR we defer the elimination of subqueries until the optimization phase. Author: Michael Armbrust <michael@databricks.com> Closes #5160 from marmbrus/subqueriesInDfs and squashes the following commits: a9bb262 [Michael Armbrust] Update Optimizer.scala 27d25bf [Michael Armbrust] fix hive tests 9137e03 [Michael Armbrust] add type 81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization
*	[SPARK-6375][SQL] Fix formatting of error messages.	Michael Armbrust	2015-03-24	7	-5/+53
\| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #5155 from marmbrus/errorMessages and squashes the following commits: b898188 [Michael Armbrust] Fix formatting of error messages.
*	[SPARK-6054][SQL] Fix transformations of TreeNodes that hold StructTypes	Michael Armbrust	2015-03-24	3	-3/+25
\| \| \| \| \| \| \| \| \| \|	Due to a recent change that made `StructType` a `Seq` we started inadvertently turning `StructType`s into generic `Traversable` when attempting nested tree transformations. In this PR we explicitly avoid descending into `DataType`s to avoid this bug. Author: Michael Armbrust <michael@databricks.com> Closes #5157 from marmbrus/udfFix and squashes the following commits: 26f7087 [Michael Armbrust] Fix transformations of TreeNodes that hold StructTypes
*	[SPARK-6437][SQL] Use completion iterator to close external sorter	Michael Armbrust	2015-03-24	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \|	Otherwise we will leak files when spilling occurs. Author: Michael Armbrust <michael@databricks.com> Closes #5161 from marmbrus/cleanupAfterSort and squashes the following commits: cb13d3c [Michael Armbrust] hint to inferencer cdebdf5 [Michael Armbrust] Use completion iterator to close external sorter
*	[SPARK-6459][SQL] Warn when constructing trivially true equals predicate	Michael Armbrust	2015-03-24	1	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For example, one might expect the following code to work, but it does not. Now you will at least get a warning with a suggestion to use aliases. ```scala val df = sqlContext.load(path, "parquet") val txns = df.groupBy("cust_id").agg($"cust_id", countDistinct($"day_num").as("txns")) val spend = df.groupBy("cust_id").agg($"cust_id", sum($"extended_price").as("spend")) val rmJoin = txns.join(spend, txns("cust_id") === spend("cust_id"), "inner") ``` Author: Michael Armbrust <michael@databricks.com> Closes #5163 from marmbrus/selfJoinError and squashes the following commits: 16c1f0b [Michael Armbrust] fix visibility 1b57e8d [Michael Armbrust] Warn when constructing trivially true equals predicate
*	[SPARK-6361][SQL] support adding a column with metadata in DF	Xiangrui Meng	2015-03-24	3	-10/+38
\| \| \| \| \| \| \| \| \| \|	This is used by ML pipelines to embed ML attributes in columns created by ML transformers/estimators. marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #5151 from mengxr/SPARK-6361 and squashes the following commits: bb30de3 [Xiangrui Meng] support adding a column with metadata in DF
*	[SPARK-6475][SQL] recognize array types when infer data types from JavaBeans	Xiangrui Meng	2015-03-24	2	-32/+89
\| \| \| \| \| \| \| \| \| \| \|	Right now if there is a array field in a JavaBean, the user wold see an exception in `createDataFrame`. liancheng Author: Xiangrui Meng <meng@databricks.com> Closes #5146 from mengxr/SPARK-6475 and squashes the following commits: 51e87e5 [Xiangrui Meng] validate schemas 4f2df5e [Xiangrui Meng] recognize array types when infer data types from JavaBeans
*	[SPARK-6452] [SQL] Checks for missing attributes and unresolved operator for ↵	Cheng Lian	2015-03-24	3	-7/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	all types of operator In `CheckAnalysis`, `Filter` and `Aggregate` are checked in separate case clauses, thus never hit those clauses for unresolved operators and missing input attributes. This PR also removes the `prettyString` call when generating error message for missing input attributes. Because result of `prettyString` doesn't contain expression ID, and may give confusing messages like > resolved attributes a missing from a cc rxin <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5129) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5129 from liancheng/spark-6452 and squashes the following commits: 52cdc69 [Cheng Lian] Addresses comments 029f9bd [Cheng Lian] Checks for missing attributes and unresolved operator for all types of operator
*	[SPARK-6124] Support jdbc connection properties in OPTIONS part of the query	Volodymyr Lyubinets	2015-03-23	3	-29/+59
\| \| \| \| \| \| \| \| \| \|	One more thing if this PR is considered to be OK - it might make sense to add extra .jdbc() API's that take Properties to SQLContext. Author: Volodymyr Lyubinets <vlyubin@gmail.com> Closes #4859 from vlyubin/jdbcProperties and squashes the following commits: 7a8cfda [Volodymyr Lyubinets] Support jdbc connection properties in OPTIONS part of the query
*	[SPARK-6397][SQL] Check the missingInput simply	Yadong Qi	2015-03-23	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	https://github.com/apache/spark/pull/5082 /cc liancheng Author: Yadong Qi <qiyadong2010@gmail.com> Closes #5132 from watermen/sql-missingInput-new and squashes the following commits: 1e5bdc5 [Yadong Qi] Check the missingInput simply
*	Revert "[SPARK-6397][SQL] Check the missingInput simply"	Cheng Lian	2015-03-23	2	-4/+3
\| \| \| \|	This reverts commit e566fe5982bac5d24e6be76e5d7d6270544a85e6.
*	[SPARK-6397][SQL] Check the missingInput simply	q00251598	2015-03-23	2	-3/+4
\| \| \| \| \| \| \| \|	Author: q00251598 <qiyadong@huawei.com> Closes #5082 from watermen/sql-missingInput and squashes the following commits: 25766b9 [q00251598] Check the missingInput simply
*	[SPARK-4985] [SQL] parquet support for date type	Daoyuan Wang	2015-03-23	5	-1/+35
\| \| \| \| \| \| \| \| \| \| \| \| \|	This PR might have some issues with #3732 , and this would have merge conflicts with #3820 so the review can be delayed till that 2 were merged. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #3822 from adrian-wang/parquetdate and squashes the following commits: 2c5d54d [Daoyuan Wang] add a test case faef887 [Daoyuan Wang] parquet support for primitive date 97e9080 [Daoyuan Wang] parquet support for date type
*	[SPARK-6337][Documentation, SQL]Spark 1.3 doc fixes	vinodkc	2015-03-22	1	-1/+1
\| \| \| \| \| \| \| \|	Author: vinodkc <vinod.kc.in@gmail.com> Closes #5112 from vinodkc/spark_1.3_doc_fixes and squashes the following commits: 2c6aee6 [vinodkc] Spark 1.3 doc fixes
*	[SPARK-6408] [SQL] Fix JDBCRDD filtering string literals	ypcat	2015-03-22	2	-7/+24
\| \| \| \| \| \| \| \| \| \| \| \| \|	Author: ypcat <ypcat6@gmail.com> Author: Pei-Lun Lee <pllee@appier.com> Closes #5087 from ypcat/spark-6408 and squashes the following commits: 1becc16 [ypcat] [SPARK-6408] [SQL] styling 1bc4455 [ypcat] [SPARK-6408] [SQL] move nested function outside e57fa4a [ypcat] [SPARK-6408] [SQL] fix test case 245ab6f [ypcat] [SPARK-6408] [SQL] add test cases for filtering quoted strings 8962534 [Pei-Lun Lee] [SPARK-6408] [SQL] Fix filtering string literals
*	[SPARK-6428][SQL] Added explicit type for all public methods for Hive module	Reynold Xin	2015-03-21	16	-62/+79
\| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #5108 from rxin/hive-public-type and squashes the following commits: a320328 [Reynold Xin] [SPARK-6428][SQL] Added explicit type for all public methods for Hive module.
*	[SPARK-6250][SPARK-6146][SPARK-5911][SQL] Types are now reserved words in ↵	Yin Huai	2015-03-21	6	-120/+241
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DDL parser. This PR creates a trait `DataTypeParser` used to parse data types. This trait aims to be single place to provide the functionality of parsing data types' string representation. It is currently mixed in with `DDLParser` and `SqlParser`. It is also used to parse the data type for `DataFrame.cast` and to convert Hive metastore's data type string back to a `DataType`. JIRA: https://issues.apache.org/jira/browse/SPARK-6250 Author: Yin Huai <yhuai@databricks.com> Closes #5078 from yhuai/ddlKeywords and squashes the following commits: 0e66097 [Yin Huai] Special handle struct<>. fea6012 [Yin Huai] Style. c9733fb [Yin Huai] Create a trait to parse data types.
*	[SPARK-5680][SQL] Sum function on all null values, should return zero	Venkata Ramana Gollamudi	2015-03-21	4	-3/+67
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	SELECT sum('a'), avg('a'), variance('a'), std('a') FROM src; Should give output as 0.0 NULL NULL NULL This fixes hive udaf_number_format.q Author: Venkata Ramana G <ramana.gollamudihuawei.com> Author: Venkata Ramana Gollamudi <ramana.gollamudi@huawei.com> Closes #4466 from gvramana/sum_fix and squashes the following commits: 42e14d1 [Venkata Ramana Gollamudi] Added comments 39415c0 [Venkata Ramana Gollamudi] Handled the partitioned Sum expression scenario df66515 [Venkata Ramana Gollamudi] code style fix 4be2606 [Venkata Ramana Gollamudi] Add udaf_number_format to whitelist and golden answer 330fd64 [Venkata Ramana Gollamudi] fix sum function for all null data
*	[SPARK-5320][SQL]Add statistics method at NoRelation (override super).	x1-	2015-03-21	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	Because of no statistics override, in spute of super class say 'LeafNode must override'. fix issue [SPARK-5320: Joins on simple table created using select gives error](https://issues.apache.org/jira/browse/SPARK-5320) Author: x1- <viva008@gmail.com> Closes #5105 from x1-/SPARK-5320 and squashes the following commits: e561aac [x1-] Add statistics method at NoRelation (override super).
*	[SPARK-5821] [SQL] JSON CTAS command should throw error message when delete ↵	Yanbo Liang	2015-03-21	2	-8/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	path failure When using "CREATE TEMPORARY TABLE AS SELECT" to create JSON table, we first delete the path file or directory and then generate a new directory with the same name. But if only read permission was granted, the delete failed. Here we just throwing an error message to let users know what happened. ParquetRelation2 may also hit this problem. I think to restrict JSONRelation and ParquetRelation2 must base on directory is more reasonable for access control. Maybe I can do it in follow up works. Author: Yanbo Liang <ybliang8@gmail.com> Author: Yanbo Liang <yanbohappy@gmail.com> Closes #4610 from yanboliang/jsonInsertImprovements and squashes the following commits: c387fce [Yanbo Liang] fix typos 42d7fb6 [Yanbo Liang] add unittest & fix output format 46f0d9d [Yanbo Liang] Update JSONRelation.scala e2df8d5 [Yanbo Liang] check path exisit when write 79f7040 [Yanbo Liang] Update JSONRelation.scala e4bc229 [Yanbo Liang] Update JSONRelation.scala 5a42d83 [Yanbo Liang] JSONRelation CTAS should check if delete is successful
*	[SPARK-6315] [SQL] Also tries the case class string parser while reading ↵	Cheng Lian	2015-03-21	2	-5/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Parquet schema When writing Parquet files, Spark 1.1.x persists the schema string into Parquet metadata with the result of `StructType.toString`, which was then deprecated in Spark 1.2 by a schema string in JSON format. But we still need to take the old schema format into account while reading Parquet files. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5034) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5034 from liancheng/spark-6315 and squashes the following commits: a182f58 [Cheng Lian] Adds a regression test b9c6dbe [Cheng Lian] Also tries the case class string parser while reading Parquet schema
*	[SPARK-5821] [SQL] ParquetRelation2 CTAS should check if delete is successful	Yanbo Liang	2015-03-21	1	-5/+14
\| \| \| \| \| \| \| \| \| \|	Do the same check as #4610 for ParquetRelation2. Author: Yanbo Liang <ybliang8@gmail.com> Closes #5107 from yanboliang/spark-5821-parquet and squashes the following commits: 7092c8d [Yanbo Liang] ParquetRelation2 CTAS should check if delete is successful
*	[SPARK-6428][SQL] Added explicit type for all public methods in sql/core	Reynold Xin	2015-03-20	53	-330/+438
\| \| \| \| \| \| \| \| \| \| \| \| \|	Also implemented equals/hashCode when they are missing. This is done in order to enable automatic public method type checking. Author: Reynold Xin <rxin@databricks.com> Closes #5104 from rxin/sql-hashcode-explicittype and squashes the following commits: ffce6f3 [Reynold Xin] Code review feedback. 8b36733 [Reynold Xin] [SPARK-6428][SQL] Added explicit type for all public methods.
*	[SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.	Marcelo Vanzin	2015-03-20	4	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #5056 from vanzin/SPARK-6371 and squashes the following commits: 63220df [Marcelo Vanzin] Merge branch 'master' into SPARK-6371 6506f75 [Marcelo Vanzin] Use more fine-grained exclusion. 178ba71 [Marcelo Vanzin] Oops. 75b2375 [Marcelo Vanzin] Exclude VertexRDD in MiMA. a45a62c [Marcelo Vanzin] Work around MIMA warning. 1d8a670 [Marcelo Vanzin] Re-group jetty exclusion. 0e8e909 [Marcelo Vanzin] Ignore ml, don't ignore graphx. cef4603 [Marcelo Vanzin] Indentation. 296cf82 [Marcelo Vanzin] [SPARK-6371] [build] Update version to 1.4.0-SNAPSHOT.
*	SPARK-6338 [CORE] Use standard temp dir mechanisms in tests to avoid ↵	Sean Owen	2015-03-20	14	-80/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	orphaned temp files Use `Utils.createTempDir()` to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify Author: Sean Owen <sowen@cloudera.com> Closes #5029 from srowen/SPARK-6338 and squashes the following commits: 27b740a [Sean Owen] Fix hive-thriftserver tests that don't expect an existing dir 4a212fa [Sean Owen] Standardize a bit more temp dir management 9004081 [Sean Owen] Revert some added recursive-delete calls 57609e4 [Sean Owen] Use Utils.createTempDir() to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
*	[SPARK-6247][SQL] Fix resolution of ambiguous joins caused by new aliases	Michael Armbrust	2015-03-17	9	-12/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to handle ambiguous `exprId`s that are produced by new aliases as well as those caused by leaf nodes (`MultiInstanceRelation`). Attempting to fix this revealed a bug in `equals` for `Alias` as these objects were comparing equal even when the expression ids did not match. Additionally, `LocalRelation` did not correctly provide statistics, and some tests in `catalyst` and `hive` were not using the helper functions for comparing plans. Based on #4991 by chenghao-intel Author: Michael Armbrust <michael@databricks.com> Closes #5062 from marmbrus/selfJoins and squashes the following commits: 8e9b84b [Michael Armbrust] check qualifier too 8038a36 [Michael Armbrust] handle aggs too 0b9c687 [Michael Armbrust] fix more tests c3c574b [Michael Armbrust] revert change. 725f1ab [Michael Armbrust] add statistics a925d08 [Michael Armbrust] check for conflicting attributes in join resolution b022ef7 [Michael Armbrust] Handle project aliases. d8caa40 [Michael Armbrust] test case: SPARK-6247 f9c67c2 [Michael Armbrust] Check for duplicate attributes in join resolution. 898af73 [Michael Armbrust] Fix Alias equality.
*	[SPARK-5651][SQL] Add input64 in blacklist and add test suit for create ↵	watermen	2015-03-17	6	-1/+513
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	table within backticks Now spark version is only support ```create table table_in_database_creation.test1 as select * from src limit 1;``` in HiveContext. This patch is used to support ```create table `table_in_database_creation.test2` as select * from src limit 1;``` in HiveContext. Author: watermen <qiyadong2010@gmail.com> Author: q00251598 <qiyadong@huawei.com> Closes #4427 from watermen/SPARK-5651 and squashes the following commits: c5c8ed1 [watermen] add the generated golden files 1f0e42e [q00251598] add input64 in blacklist and add test suit
*	[SPARK-5404] [SQL] Update the default statistic number	Cheng Hao	2015-03-17	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \|	By default, the statistic for logical plan with multiple children is quite aggressive, and those statistic are quite critical for the join optimization, hence we need to estimate the statistics as accurate as possible. For `Union`, which has 2 children, and overwrite the default implementation by `adding` its children `byteInSize` instead of `multiplying`. For `Expand`, which only has a single child, but it will grows the size, and we need to multiply its inflating factor. Author: Cheng Hao <hao.cheng@intel.com> Closes #4914 from chenghao-intel/statistic and squashes the following commits: d466bbc [Cheng Hao] Update the default statistic
*	[SPARK-5908][SQL] Resolve UdtfsAlias when only single Alias is used	Liang-Chi Hsieh	2015-03-17	2	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	`ResolveUdtfsAlias` in `hiveUdfs` only considers the `HiveGenericUdtf` with multiple alias. When only single alias is used with `HiveGenericUdtf`, the alias is not working. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4692 from viirya/udft_alias and squashes the following commits: 8a3bae4 [Liang-Chi Hsieh] No need to test selected column from DataFrame since DataFrame API is updated. 160a379 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into udft_alias e6531cc [Liang-Chi Hsieh] Selected column from DataFrame should not re-analyze logical plan. a45cc2a [Liang-Chi Hsieh] Resolve UdtfsAlias when only single Alias is used.
*	[SPARK-6330] [SQL] Add a test case for SPARK-6330	Pei-Lun Lee	2015-03-18	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When getting file statuses, create file system from each path instead of a single one from hadoop configuration. Author: Pei-Lun Lee <pllee@appier.com> Closes #5039 from ypcat/spark-6351 and squashes the following commits: a19a3fe [Pei-Lun Lee] [SPARK-6330] [SQL] fix test 506f5a0 [Pei-Lun Lee] [SPARK-6351] [SQL] fix test fa2290e [Pei-Lun Lee] [SPARK-6330] [SQL] Rename test case and add comment 606c967 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6351 896e80a [Pei-Lun Lee] [SPARK-6351] [SQL] Add test case 2ae0916 [Pei-Lun Lee] [SPARK-6351] [SQL] ParquetRelation2 supporting multiple file systems
*	[SQL][docs][minor] Fixed sample code in SQLContext scaladoc	Lomig Mégard	2015-03-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Error in the code sample of the `implicits` object in `SQLContext`. Author: Lomig Mégard <lomig.megard@gmail.com> Closes #5051 from tarfaa/simple and squashes the following commits: 5a88acc [Lomig Mégard] [docs][minor] Fixed sample code in SQLContext scaladoc
*	[SPARK-5712] [SQL] fix comment with semicolon at end	Daoyuan Wang	2015-03-17	4	-13/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	---- comment; Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4500 from adrian-wang/semicolon and squashes the following commits: 70b8abb [Daoyuan Wang] use mkstring instead of reduce 2d49738 [Daoyuan Wang] remove outdated golden file 317346e [Daoyuan Wang] only skip comment with semicolon at end of line, to avoid golden file outdated d3ae01e [Daoyuan Wang] fix error a11602d [Daoyuan Wang] fix comment with semicolon at end
*	[SPARK-6330] Fix filesystem bug in newParquet relation	Volodymyr Lyubinets	2015-03-16	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \|	If I'm running this locally and my path points to S3, this would currently error out because of incorrect FS. I tested this in a scenario that previously didn't work, this change seemed to fix the issue. Author: Volodymyr Lyubinets <vlyubin@gmail.com> Closes #5020 from vlyubin/parquertbug and squashes the following commits: a645ad5 [Volodymyr Lyubinets] Fix filesystem bug in newParquet relation
*	[SPARK-2087] [SQL] Multiple thriftserver sessions with single HiveContext ↵	Cheng Hao	2015-03-17	8	-105/+353
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	instance Still, we keep only a single HiveContext within ThriftServer, and we also create a object called `SQLSession` for isolating the different user states. Developers can obtain/release a new user session via `openSession` and `closeSession`, and `SQLContext` and `HiveContext` will also provide a default session if no `openSession` called, for backward-compatibility. Author: Cheng Hao <hao.cheng@intel.com> Closes #4885 from chenghao-intel/multisessions_singlecontext and squashes the following commits: 1c47b2a [Cheng Hao] rename the tss => tlSession 815b27a [Cheng Hao] code style issue 57e3fa0 [Cheng Hao] openSession is not compatible between Hive0.12 & 0.13.1 4665b0d [Cheng Hao] thriftservice with single context
*	[SPARK-6285][SQL]Remove ParquetTestData in SparkBuild.scala and in README.md	OopsOutOfMemory	2015-03-15	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a following clean up PR for #5010 This will resolve issues when launching `hive/console` like below: ``` <console>:20: error: object ParquetTestData is not a member of package org.apache.spark.sql.parquet import org.apache.spark.sql.parquet.ParquetTestData ``` Author: OopsOutOfMemory <victorshengli@126.com> Closes #5032 from OopsOutOfMemory/SPARK-6285 and squashes the following commits: 2996aeb [OopsOutOfMemory] remove ParquetTestData
*	[SPARK-6195] [SQL] Adds in-memory column type for fixed-precision decimals	Cheng Lian	2015-03-14	11	-76/+179
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds a specialized in-memory column type for fixed-precision decimals. For all other column types, a single integer column type ID is enough to determine which column type to use. However, this doesn't apply to fixed-precision decimal types with different precision and scale parameters. Moreover, according to the previous design, there seems no trivial way to encode precision and scale information into the columnar byte buffer. On the other hand, considering we always know the data type of the column to be built / scanned ahead of time. This PR no longer use column type ID to construct `ColumnBuilder`s and `ColumnAccessor`s, but resorts to the actual column data type. In this way, we can pass precision / scale information along the way. The column type ID is now not used anymore and can be removed in a future PR. ### Micro benchmark result The following micro benchmark builds a simple table with 2 million decimals (precision = 10, scale = 0), cache it in memory, then count all the rows. Code (simply paste it into Spark shell): ```scala import sc._ import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch def benchmark(n: Int)(f: => Long) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } // Explicit casting is required because ScalaReflection can't inspect decimal precision parallelize(1 to 2000000) .map(i => Tuple1(Decimal(i, 10, 0))) .toDF("dec") .select($"dec" cast DecimalType(10, 0)) .registerTempTable("dec") sql("CACHE TABLE dec") val df = table("dec") // Warm up df.count() df.count() benchmark(5) { df.count() } ``` With `FIXED_DECIMAL` column type: - Round 0: 75 ms - Round 1: 97 ms - Round 2: 75 ms - Round 3: 70 ms - Round 4: 72 ms - Average: 77.8 ms Without `FIXED_DECIMAL` column type: - Round 0: 1233 ms - Round 1: 1170 ms - Round 2: 1171 ms - Round 3: 1141 ms - Round 4: 1141 ms - Average: 1171.2 ms <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4938) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4938 from liancheng/decimal-column-type and squashes the following commits: fef5338 [Cheng Lian] Updates fixed decimal column type related test cases e08ab5b [Cheng Lian] Only resorts to FIXED_DECIMAL when the value can be held in a long 4db713d [Cheng Lian] Adds in-memory column type for fixed-precision decimals
*	[SQL]Delete some dupliate code in HiveThriftServer2	ArcherShao	2015-03-14	1	-7/+5
\| \| \| \| \| \| \| \| \| \|	Author: ArcherShao <ArcherShao@users.noreply.github.com> Author: ArcherShao <shaochuan@huawei.com> Closes #5007 from ArcherShao/20150313 and squashes the following commits: ae422ae [ArcherShao] Updated 459efbd [ArcherShao] [SQL]Delete some dupliate code in HiveThriftServer2
*	[SPARK-6210] [SQL] use prettyString as column name in agg()	Davies Liu	2015-03-14	2	-5/+5
\| \| \| \| \| \| \| \| \| \|	use prettyString instead of toString() (which include id of expression) as column name in agg() Author: Davies Liu <davies@databricks.com> Closes #5006 from davies/prettystring and squashes the following commits: cb1fdcf [Davies Liu] use prettyString as column name in agg()
*	[SPARK-6317][SQL]Fixed HIVE console startup issue	vinodkc	2015-03-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Author: vinodkc <vinod.kc.in@gmail.com> Author: Vinod K C <vinod.kc@huawei.com> Closes #5011 from vinodkc/HIVE_console_startupError and squashes the following commits: b43925f [vinodkc] Changed order of import b4f5453 [Vinod K C] Fixed HIVE console startup issue
*	[SPARK-6285] [SQL] Removes unused ParquetTestData and duplicated ↵	Cheng Lian	2015-03-14	1	-466/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TestGroupWriteSupport All the contents in this file are not referenced anywhere and should have been removed in #4116 when I tried to get rid of the old Parquet test suites. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5010) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5010 from liancheng/spark-6285 and squashes the following commits: 06ed057 [Cheng Lian] Removes unused ParquetTestData and duplicated TestGroupWriteSupport