spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	SPARK-6338 [CORE] Use standard temp dir mechanisms in tests to avoid ↵	Sean Owen	2015-03-20	14	-80/+61
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	orphaned temp files Use `Utils.createTempDir()` to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify Author: Sean Owen <sowen@cloudera.com> Closes #5029 from srowen/SPARK-6338 and squashes the following commits: 27b740a [Sean Owen] Fix hive-thriftserver tests that don't expect an existing dir 4a212fa [Sean Owen] Standardize a bit more temp dir management 9004081 [Sean Owen] Revert some added recursive-delete calls 57609e4 [Sean Owen] Use Utils.createTempDir() to replace other temp file mechanisms used in some tests, to further ensure they are cleaned up, and simplify
*	[SPARK-6247][SQL] Fix resolution of ambiguous joins caused by new aliases	Michael Armbrust	2015-03-17	9	-12/+96
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We need to handle ambiguous `exprId`s that are produced by new aliases as well as those caused by leaf nodes (`MultiInstanceRelation`). Attempting to fix this revealed a bug in `equals` for `Alias` as these objects were comparing equal even when the expression ids did not match. Additionally, `LocalRelation` did not correctly provide statistics, and some tests in `catalyst` and `hive` were not using the helper functions for comparing plans. Based on #4991 by chenghao-intel Author: Michael Armbrust <michael@databricks.com> Closes #5062 from marmbrus/selfJoins and squashes the following commits: 8e9b84b [Michael Armbrust] check qualifier too 8038a36 [Michael Armbrust] handle aggs too 0b9c687 [Michael Armbrust] fix more tests c3c574b [Michael Armbrust] revert change. 725f1ab [Michael Armbrust] add statistics a925d08 [Michael Armbrust] check for conflicting attributes in join resolution b022ef7 [Michael Armbrust] Handle project aliases. d8caa40 [Michael Armbrust] test case: SPARK-6247 f9c67c2 [Michael Armbrust] Check for duplicate attributes in join resolution. 898af73 [Michael Armbrust] Fix Alias equality.
*	[SPARK-5651][SQL] Add input64 in blacklist and add test suit for create ↵	watermen	2015-03-17	6	-1/+513
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	table within backticks Now spark version is only support ```create table table_in_database_creation.test1 as select * from src limit 1;``` in HiveContext. This patch is used to support ```create table `table_in_database_creation.test2` as select * from src limit 1;``` in HiveContext. Author: watermen <qiyadong2010@gmail.com> Author: q00251598 <qiyadong@huawei.com> Closes #4427 from watermen/SPARK-5651 and squashes the following commits: c5c8ed1 [watermen] add the generated golden files 1f0e42e [q00251598] add input64 in blacklist and add test suit
*	[SPARK-5404] [SQL] Update the default statistic number	Cheng Hao	2015-03-17	1	-1/+11
\| \| \| \| \| \| \| \| \| \| \| \| \|	By default, the statistic for logical plan with multiple children is quite aggressive, and those statistic are quite critical for the join optimization, hence we need to estimate the statistics as accurate as possible. For `Union`, which has 2 children, and overwrite the default implementation by `adding` its children `byteInSize` instead of `multiplying`. For `Expand`, which only has a single child, but it will grows the size, and we need to multiply its inflating factor. Author: Cheng Hao <hao.cheng@intel.com> Closes #4914 from chenghao-intel/statistic and squashes the following commits: d466bbc [Cheng Hao] Update the default statistic
*	[SPARK-5908][SQL] Resolve UdtfsAlias when only single Alias is used	Liang-Chi Hsieh	2015-03-17	2	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	`ResolveUdtfsAlias` in `hiveUdfs` only considers the `HiveGenericUdtf` with multiple alias. When only single alias is used with `HiveGenericUdtf`, the alias is not working. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4692 from viirya/udft_alias and squashes the following commits: 8a3bae4 [Liang-Chi Hsieh] No need to test selected column from DataFrame since DataFrame API is updated. 160a379 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into udft_alias e6531cc [Liang-Chi Hsieh] Selected column from DataFrame should not re-analyze logical plan. a45cc2a [Liang-Chi Hsieh] Resolve UdtfsAlias when only single Alias is used.
*	[SPARK-6330] [SQL] Add a test case for SPARK-6330	Pei-Lun Lee	2015-03-18	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When getting file statuses, create file system from each path instead of a single one from hadoop configuration. Author: Pei-Lun Lee <pllee@appier.com> Closes #5039 from ypcat/spark-6351 and squashes the following commits: a19a3fe [Pei-Lun Lee] [SPARK-6330] [SQL] fix test 506f5a0 [Pei-Lun Lee] [SPARK-6351] [SQL] fix test fa2290e [Pei-Lun Lee] [SPARK-6330] [SQL] Rename test case and add comment 606c967 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6351 896e80a [Pei-Lun Lee] [SPARK-6351] [SQL] Add test case 2ae0916 [Pei-Lun Lee] [SPARK-6351] [SQL] ParquetRelation2 supporting multiple file systems
*	[SQL][docs][minor] Fixed sample code in SQLContext scaladoc	Lomig Mégard	2015-03-16	1	-2/+2
\| \| \| \| \| \| \| \| \| \|	Error in the code sample of the `implicits` object in `SQLContext`. Author: Lomig Mégard <lomig.megard@gmail.com> Closes #5051 from tarfaa/simple and squashes the following commits: 5a88acc [Lomig Mégard] [docs][minor] Fixed sample code in SQLContext scaladoc
*	[SPARK-5712] [SQL] fix comment with semicolon at end	Daoyuan Wang	2015-03-17	4	-13/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	---- comment; Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4500 from adrian-wang/semicolon and squashes the following commits: 70b8abb [Daoyuan Wang] use mkstring instead of reduce 2d49738 [Daoyuan Wang] remove outdated golden file 317346e [Daoyuan Wang] only skip comment with semicolon at end of line, to avoid golden file outdated d3ae01e [Daoyuan Wang] fix error a11602d [Daoyuan Wang] fix comment with semicolon at end
*	[SPARK-6330] Fix filesystem bug in newParquet relation	Volodymyr Lyubinets	2015-03-16	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \|	If I'm running this locally and my path points to S3, this would currently error out because of incorrect FS. I tested this in a scenario that previously didn't work, this change seemed to fix the issue. Author: Volodymyr Lyubinets <vlyubin@gmail.com> Closes #5020 from vlyubin/parquertbug and squashes the following commits: a645ad5 [Volodymyr Lyubinets] Fix filesystem bug in newParquet relation
*	[SPARK-2087] [SQL] Multiple thriftserver sessions with single HiveContext ↵	Cheng Hao	2015-03-17	8	-105/+353
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	instance Still, we keep only a single HiveContext within ThriftServer, and we also create a object called `SQLSession` for isolating the different user states. Developers can obtain/release a new user session via `openSession` and `closeSession`, and `SQLContext` and `HiveContext` will also provide a default session if no `openSession` called, for backward-compatibility. Author: Cheng Hao <hao.cheng@intel.com> Closes #4885 from chenghao-intel/multisessions_singlecontext and squashes the following commits: 1c47b2a [Cheng Hao] rename the tss => tlSession 815b27a [Cheng Hao] code style issue 57e3fa0 [Cheng Hao] openSession is not compatible between Hive0.12 & 0.13.1 4665b0d [Cheng Hao] thriftservice with single context
*	[SPARK-6285][SQL]Remove ParquetTestData in SparkBuild.scala and in README.md	OopsOutOfMemory	2015-03-15	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is a following clean up PR for #5010 This will resolve issues when launching `hive/console` like below: ``` <console>:20: error: object ParquetTestData is not a member of package org.apache.spark.sql.parquet import org.apache.spark.sql.parquet.ParquetTestData ``` Author: OopsOutOfMemory <victorshengli@126.com> Closes #5032 from OopsOutOfMemory/SPARK-6285 and squashes the following commits: 2996aeb [OopsOutOfMemory] remove ParquetTestData
*	[SPARK-6195] [SQL] Adds in-memory column type for fixed-precision decimals	Cheng Lian	2015-03-14	11	-76/+179
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR adds a specialized in-memory column type for fixed-precision decimals. For all other column types, a single integer column type ID is enough to determine which column type to use. However, this doesn't apply to fixed-precision decimal types with different precision and scale parameters. Moreover, according to the previous design, there seems no trivial way to encode precision and scale information into the columnar byte buffer. On the other hand, considering we always know the data type of the column to be built / scanned ahead of time. This PR no longer use column type ID to construct `ColumnBuilder`s and `ColumnAccessor`s, but resorts to the actual column data type. In this way, we can pass precision / scale information along the way. The column type ID is now not used anymore and can be removed in a future PR. ### Micro benchmark result The following micro benchmark builds a simple table with 2 million decimals (precision = 10, scale = 0), cache it in memory, then count all the rows. Code (simply paste it into Spark shell): ```scala import sc._ import sqlContext._ import sqlContext.implicits._ import org.apache.spark.sql.types._ import com.google.common.base.Stopwatch def benchmark(n: Int)(f: => Long) { val stopwatch = new Stopwatch() def run() = { stopwatch.reset() stopwatch.start() f stopwatch.stop() stopwatch.elapsedMillis() } val records = (0 until n).map(_ => run()) (0 until n).foreach(i => println(s"Round $i: ${records(i)} ms")) println(s"Average: ${records.sum / n.toDouble} ms") } // Explicit casting is required because ScalaReflection can't inspect decimal precision parallelize(1 to 2000000) .map(i => Tuple1(Decimal(i, 10, 0))) .toDF("dec") .select($"dec" cast DecimalType(10, 0)) .registerTempTable("dec") sql("CACHE TABLE dec") val df = table("dec") // Warm up df.count() df.count() benchmark(5) { df.count() } ``` With `FIXED_DECIMAL` column type: - Round 0: 75 ms - Round 1: 97 ms - Round 2: 75 ms - Round 3: 70 ms - Round 4: 72 ms - Average: 77.8 ms Without `FIXED_DECIMAL` column type: - Round 0: 1233 ms - Round 1: 1170 ms - Round 2: 1171 ms - Round 3: 1141 ms - Round 4: 1141 ms - Average: 1171.2 ms <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4938) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4938 from liancheng/decimal-column-type and squashes the following commits: fef5338 [Cheng Lian] Updates fixed decimal column type related test cases e08ab5b [Cheng Lian] Only resorts to FIXED_DECIMAL when the value can be held in a long 4db713d [Cheng Lian] Adds in-memory column type for fixed-precision decimals
*	[SQL]Delete some dupliate code in HiveThriftServer2	ArcherShao	2015-03-14	1	-7/+5
\| \| \| \| \| \| \| \| \| \|	Author: ArcherShao <ArcherShao@users.noreply.github.com> Author: ArcherShao <shaochuan@huawei.com> Closes #5007 from ArcherShao/20150313 and squashes the following commits: ae422ae [ArcherShao] Updated 459efbd [ArcherShao] [SQL]Delete some dupliate code in HiveThriftServer2
*	[SPARK-6210] [SQL] use prettyString as column name in agg()	Davies Liu	2015-03-14	2	-5/+5
\| \| \| \| \| \| \| \| \| \|	use prettyString instead of toString() (which include id of expression) as column name in agg() Author: Davies Liu <davies@databricks.com> Closes #5006 from davies/prettystring and squashes the following commits: cb1fdcf [Davies Liu] use prettyString as column name in agg()
*	[SPARK-6317][SQL]Fixed HIVE console startup issue	vinodkc	2015-03-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Author: vinodkc <vinod.kc.in@gmail.com> Author: Vinod K C <vinod.kc@huawei.com> Closes #5011 from vinodkc/HIVE_console_startupError and squashes the following commits: b43925f [vinodkc] Changed order of import b4f5453 [Vinod K C] Fixed HIVE console startup issue
*	[SPARK-6285] [SQL] Removes unused ParquetTestData and duplicated ↵	Cheng Lian	2015-03-14	1	-466/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	TestGroupWriteSupport All the contents in this file are not referenced anywhere and should have been removed in #4116 when I tried to get rid of the old Parquet test suites. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5010) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5010 from liancheng/spark-6285 and squashes the following commits: 06ed057 [Cheng Lian] Removes unused ParquetTestData and duplicated TestGroupWriteSupport
*	[SPARK-6296] [SQL] Added equals to Column	Volodymyr Lyubinets	2015-03-12	2	-0/+12
\| \| \| \| \| \| \| \|	Author: Volodymyr Lyubinets <vlyubin@gmail.com> Closes #4988 from vlyubin/columncomp and squashes the following commits: 92d7c8f [Volodymyr Lyubinets] Added equals to Column
*	SPARK-6245 [SQL] jsonRDD() of empty RDD results in exception	Sean Owen	2015-03-11	3	-1/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Avoid `UnsupportedOperationException` from JsonRDD.inferSchema on empty RDD. Not sure if this is supposed to be an error (but a better one), but it seems like this case can come up if the input is down-sampled so much that nothing is sampled. Now stuff like this: ``` sqlContext.jsonRDD(sc.parallelize(List[String]())) ``` just results in ``` org.apache.spark.sql.DataFrame = [] ``` Author: Sean Owen <sowen@cloudera.com> Closes #4971 from srowen/SPARK-6245 and squashes the following commits: 3699964 [Sean Owen] Set() -> Set.empty 3c619e1 [Sean Owen] Avoid UnsupportedOperationException from JsonRDD.inferSchema on empty RDD
*	SPARK-6225 [CORE] [SQL] [STREAMING] Resolve most build warnings, 1.3.0 edition	Sean Owen	2015-03-11	3	-2/+3
\| \| \| \| \| \| \| \| \| \| \|	Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc. Author: Sean Owen <sowen@cloudera.com> Closes #4950 from srowen/SPARK-6225 and squashes the following commits: 3080972 [Sean Owen] Ordered imports: Java, Scala, 3rd party, Spark c67985b [Sean Owen] Resolve javac, scalac warnings of various types -- deprecations, Scala lang, unchecked cast, etc.
*	[SQL][Minor] fix typo in comments	Hongbo Liu	2015-03-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Removed an repeated "from" in the comments. Author: Hongbo Liu <liuhb86@gmail.com> Closes #4976 from liuhb86/mine and squashes the following commits: e280e7c [Hongbo Liu] [SQL][Minor] fix typo in comments
*	Minor doc: Remove the extra blank line in data types javadoc.	Reynold Xin	2015-03-10	1	-18/+6
\| \| \| \| \| \| \| \| \| \|	The extra blank line is preventing the first lines from showing up in the package summary page. Author: Reynold Xin <rxin@databricks.com> Closes #4955 from rxin/datatype-docs and squashes the following commits: 1621114 [Reynold Xin] Minor doc: Remove the extra blank line in data types javadoc.
*	[SQL] Make Strategies a public developer API	Michael Armbrust	2015-03-05	1	-2/+5
\| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #4920 from marmbrus/openStrategies and squashes the following commits: cbc35c0 [Michael Armbrust] [SQL] Make Strategies a public developer API
*	[SPARK-6163][SQL] jsonFile should be backed by the data source API	Yin Huai	2015-03-05	2	-8/+32
\| \| \| \| \| \| \| \| \| \| \| \|	jira: https://issues.apache.org/jira/browse/SPARK-6163 Author: Yin Huai <yhuai@databricks.com> Closes #4896 from yhuai/SPARK-6163 and squashes the following commits: 45e023e [Yin Huai] Address @chenghao-intel's comment. 2e8734e [Yin Huai] Use JSON data source for jsonFile. 92a4a33 [Yin Huai] Test.
*	[SPARK-6145][SQL] fix ORDER BY on nested fields	Wenchen Fan	2015-03-05	3	-3/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Based on #4904 with style errors fixed. `LogicalPlan#resolve` will not only produce `Attribute`, but also "`GetField` chain". So in `ResolveSortReferences`, after resolve the ordering expressions, we should not just collect the `Attribute` results, but also `Attribute` at the bottom of "`GetField` chain". Author: Wenchen Fan <cloud0fan@outlook.com> Author: Michael Armbrust <michael@databricks.com> Closes #4918 from marmbrus/pr/4904 and squashes the following commits: 997f84e [Michael Armbrust] fix style 3eedbfc [Wenchen Fan] fix 6145
*	SPARK-6182 [BUILD] spark-parent pom needs to be published for both 2.10 and 2.11	Sean Owen	2015-03-05	4	-4/+4
\| \| \| \| \| \| \| \| \| \|	Option 1 of 2: Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11 Author: Sean Owen <sowen@cloudera.com> Closes #4912 from srowen/SPARK-6182.1 and squashes the following commits: eff60de [Sean Owen] Convert spark-parent module name to spark-parent_2.10 / spark-parent_2.11
*	[SPARK-6153] [SQL] promote guava dep for hive-thriftserver	Daoyuan Wang	2015-03-05	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \| \|	For package thriftserver, guava is used at runtime. /cc pwendell Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4884 from adrian-wang/test and squashes the following commits: 4600ae7 [Daoyuan Wang] only promote for thriftserver 44dda18 [Daoyuan Wang] promote guava dep for hive
*	[SPARK-6134][SQL] Fix wrong datatype for casting FloatType and default ↵	Liang-Chi Hsieh	2015-03-04	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	LongType value in defaultPrimitive In `CodeGenerator`, the casting on `FloatType` should use `FloatType` instead of `IntegerType`. Besides, `defaultPrimitive` for `LongType` should be `-1L` instead of `1L`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4870 from viirya/codegen_type and squashes the following commits: 76311dd [Liang-Chi Hsieh] Fix wrong datatype for casting on FloatType. Fix the wrong value for LongType in defaultPrimitive.
*	[SPARK-6136] [SQL] Removed JDBC integration tests which depends on docker-client	Cheng Lian	2015-03-04	4	-432/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Integration test suites in the JDBC data source (`MySQLIntegration` and `PostgresIntegration`) depend on docker-client 2.7.5, which transitively depends on Guava 17.0. Unfortunately, Guava 17.0 is causing test runtime binary compatibility issues when Spark is compiled against Hive 0.12.0, or Hadoop 2.4. Considering `MySQLIntegration` and `PostgresIntegration` are ignored right now, I'd suggest moving them from the Spark project to the [Spark integration tests] [1] project. This PR removes both the JDBC data source integration tests and the docker-client test dependency. [1]: \|https://github.com/databricks/spark-integration-tests <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4872) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4872 from liancheng/remove-docker-client and squashes the following commits: 1f4169e [Cheng Lian] Removes DockerHacks 159b24a [Cheng Lian] Removed JDBC integration tests which depends on docker-client
*	[SPARK-5310][SQL] Fixes to Docs and Datasources API	Reynold Xin	2015-03-02	21	-124/+98
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	- Various Fixes to docs - Make data source traits actually interfaces Based on #4862 but with fixed conflicts. Author: Reynold Xin <rxin@databricks.com> Author: Michael Armbrust <michael@databricks.com> Closes #4868 from marmbrus/pr/4862 and squashes the following commits: fe091ea [Michael Armbrust] Merge remote-tracking branch 'origin/master' into pr/4862 0208497 [Reynold Xin] Test fixes. 34e0a28 [Reynold Xin] [SPARK-5310][SQL] Various fixes to Spark SQL docs.
*	[SPARK-5950][SQL]Insert array into a metastore table saved as parquet should ↵	Yin Huai	2015-03-02	15	-35/+327
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	work when using datasource api This PR contains the following changes: 1. Add a new method, `DataType.equalsIgnoreCompatibleNullability`, which is the middle ground between DataType's equality check and `DataType.equalsIgnoreNullability`. For two data types `from` and `to`, it does `equalsIgnoreNullability` as well as if the nullability of `from` is compatible with that of `to`. For example, the nullability of `ArrayType(IntegerType, containsNull = false)` is compatible with that of `ArrayType(IntegerType, containsNull = true)` (for an array without null values, we can always say it may contain null values). However, the nullability of `ArrayType(IntegerType, containsNull = true)` is incompatible with that of `ArrayType(IntegerType, containsNull = false)` (for an array that may have null values, we cannot say it does not have null values). 2. For the `resolved` field of `InsertIntoTable`, use `equalsIgnoreCompatibleNullability` to replace the equality check of the data types. 3. For our data source write path, when appending data, we always use the schema of existing table to write the data. This is important for parquet, since nullability direct impacts the way to encode/decode values. If we do not do this, we may see corrupted values when reading values from a set of parquet files generated with different nullability settings. 4. When generating a new parquet table, we always set nullable/containsNull/valueContainsNull to true. So, we will not face situations that we cannot append data because containsNull/valueContainsNull in an Array/Map column of the existing table has already been set to `false`. This change makes the whole data pipeline more robust. 5. Update the equality check of JSON relation. Since JSON does not really cares nullability, `equalsIgnoreNullability` seems a better choice to compare schemata from to JSON tables. JIRA: https://issues.apache.org/jira/browse/SPARK-5950 Thanks viirya for the initial work in #4729. cc marmbrus liancheng Author: Yin Huai <yhuai@databricks.com> Closes #4826 from yhuai/insertNullabilityCheck and squashes the following commits: 3b61a04 [Yin Huai] Revert change on equals. 80e487e [Yin Huai] asNullable in UDT. 587d88b [Yin Huai] Make methods private. 0cb7ea2 [Yin Huai] marmbrus's comments. 3cec464 [Yin Huai] Cheng's comments. 486ed08 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck d3747d1 [Yin Huai] Remove unnecessary change. 8360817 [Yin Huai] Merge remote-tracking branch 'upstream/master' into insertNullabilityCheck 8a3f237 [Yin Huai] Use equalsIgnoreNullability instead of equality check. 0eb5578 [Yin Huai] Fix tests. f6ed813 [Yin Huai] Update old parquet path. e4f397c [Yin Huai] Unit tests. b2c06f8 [Yin Huai] Ignore nullability in JSON relation's equality check. 8bd008b [Yin Huai] nullable, containsNull, and valueContainsNull will be always true for parquet data. bf50d73 [Yin Huai] When appending data, we use the schema of the existing table instead of the schema of the new data. 0a703e7 [Yin Huai] Test failed again since we cannot read correct content. 9a26611 [Yin Huai] Make InsertIntoTable happy. 8f19fe5 [Yin Huai] equalsIgnoreCompatibleNullability 4ec17fd [Yin Huai] Failed test.
*	[SPARK-6082] [SQL] Provides better error message for malformed rows when ↵	Cheng Lian	2015-03-02	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	caching tables Constructs like Hive `TRANSFORM` may generate malformed rows (via badly authored external scripts for example). I'm a bit hesitant to have this feature, since it introduces per-tuple cost when caching tables. However, considering caching tables is usually a one-time cost, this is probably worth having. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4842) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4842 from liancheng/spark-6082 and squashes the following commits: b05dbff [Cheng Lian] Provides better error message for malformed rows when caching tables
*	[SPARK-6114][SQL] Avoid metastore conversions before plan is resolved	Michael Armbrust	2015-03-02	2	-0/+14
\| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #4855 from marmbrus/explodeBug and squashes the following commits: a712249 [Michael Armbrust] [SPARK-6114][SQL] Avoid metastore conversions before plan is resolved
*	[SPARK-6040][SQL] Fix the percent bug in tablesample	q00251598	2015-03-02	2	-1/+11
\| \| \| \| \| \| \| \| \| \|	HiveQL expression like `select count(1) from src tablesample(1 percent);` means take 1% sample to select. But it means 100% in the current version of the Spark. Author: q00251598 <qiyadong@huawei.com> Closes #4789 from watermen/SPARK-6040 and squashes the following commits: 2453ebe [q00251598] check and adjust the fraction.
*	[Minor] Fix doc typo for describing primitiveTerm effectiveness condition	Liang-Chi Hsieh	2015-03-02	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	It should be `true` instead of `false`? Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4762 from viirya/doc_fix and squashes the following commits: 2e37482 [Liang-Chi Hsieh] Fix doc.
*	[DOCS] Refactored Dataframe join comment to use correct parameter ordering	Paul Power	2015-03-02	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	The API signatire for join requires the JoinType to be the third parameter. The code examples provided for join show JoinType being provided as the 2nd parater resuling in errors (i.e. "df1.join(df2, "outer", $"df1Key" === $"df2Key") ). The correct sample code is df1.join(df2, $"df1Key" === $"df2Key", "outer") Author: Paul Power <paul.power@peerside.com> Closes #4847 from peerside/master and squashes the following commits: ebc1efa [Paul Power] Merge pull request #1 from peerside/peerside-patch-1 e353340 [Paul Power] Updated comments use correct sample code for Dataframe joins
*	[SPARK-5741][SQL] Support the path contains comma in HiveContext	q00251598	2015-03-02	18	-1/+2511
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When run ```select * from nzhang_part where hr = 'file,';```, it throws exception ```java.lang.IllegalArgumentException: Can not create a Path from an empty string``` . Because the path of hdfs contains comma, and FileInputFormat.setInputPaths will split path by comma. ### SQL ``` set hive.merge.mapfiles=true; set hive.merge.mapredfiles=true; set hive.input.format=org.apache.hadoop.hive.ql.io.HiveInputFormat; set hive.exec.dynamic.partition=true; set hive.exec.dynamic.partition.mode=nonstrict; create table nzhang_part like srcpart; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select key, value, hr from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr=11) select key, value from srcpart where ds='2008-04-08'; insert overwrite table nzhang_part partition (ds='2010-08-15', hr) select * from ( select key, value, hr from srcpart where ds='2008-04-08' union all select '1' as key, '1' as value, 'file,' as hr from src limit 1) s; select * from nzhang_part where hr = 'file,'; ``` ### Error Log ``` 15/02/10 14:33:16 ERROR SparkSQLDriver: Failed in [select * from nzhang_part where hr = 'file,'] java.lang.IllegalArgumentException: Can not create a Path from an empty string at org.apache.hadoop.fs.Path.checkPathArg(Path.java:127) at org.apache.hadoop.fs.Path.<init>(Path.java:135) at org.apache.hadoop.util.StringUtils.stringToPath(StringUtils.java:241) at org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:400) at org.apache.spark.sql.hive.HadoopTableReader$.initializeLocalJobConfFunc(TableReader.scala:251) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.sql.hive.HadoopTableReader$$anonfun$11.apply(TableReader.scala:229) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:172) at scala.Option.map(Option.scala:145) at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:172) at org.apache.spark.rdd.HadoopRDD.getPartitions(HadoopRDD.scala:196) Author: q00251598 <qiyadong@huawei.com> Closes #4532 from watermen/SPARK-5741 and squashes the following commits: 9758ab1 [q00251598] fix bug 1db1a1c [q00251598] use setInputPaths(Job job, Path... inputPaths) b788a72 [q00251598] change FileInputFormat.setInputPaths to jobConf.set and add test suite
*	[SPARK-6052][SQL]In JSON schema inference, we should always set containsNull ↵	Yin Huai	2015-03-02	2	-24/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	of an ArrayType to true Always set `containsNull = true` when infer the schema of JSON datasets. If we set `containsNull` based on records we scanned, we may miss arrays with null values when we do sampling. Also, because future data can have arrays with null values, if we convert JSON data to parquet, always setting `containsNull = true` is a more robust way to go. JIRA: https://issues.apache.org/jira/browse/SPARK-6052 Author: Yin Huai <yhuai@databricks.com> Closes #4806 from yhuai/jsonArrayContainsNull and squashes the following commits: 05eab9d [Yin Huai] Change containsNull to true.
*	[SPARK-6073][SQL] Need to refresh metastore cache after append data in ↵	Yin Huai	2015-03-02	2	-0/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	CreateMetastoreDataSourceAsSelect JIRA: https://issues.apache.org/jira/browse/SPARK-6073 liancheng Author: Yin Huai <yhuai@databricks.com> Closes #4824 from yhuai/refreshCache and squashes the following commits: b9542ef [Yin Huai] Refresh metadata cache in the Catalog in CreateMetastoreDataSourceAsSelect.
*	[SPARK-6074] [sql] Package pyspark sql bindings.	Marcelo Vanzin	2015-03-01	1	-0/+8
\| \| \| \| \| \| \| \| \| \|	This is needed for the SQL bindings to work on Yarn. Author: Marcelo Vanzin <vanzin@cloudera.com> Closes #4822 from vanzin/SPARK-6074 and squashes the following commits: fb52001 [Marcelo Vanzin] [SPARK-6074] [sql] Package pyspark sql bindings.
*	[SPARK-5775] [SQL] BugFix: GenericRow cannot be cast to SpecificMutableRow ↵	Cheng Lian	2015-02-28	3	-24/+217
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	when nested data and partitioned table This PR adapts anselmevignon's #4697 to master and branch-1.3. Please refer to PR description of #4697 for details. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4792) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Author: Cheng Lian <liancheng@users.noreply.github.com> Author: Yin Huai <yhuai@databricks.com> Closes #4792 from liancheng/spark-5775 and squashes the following commits: 538f506 [Cheng Lian] Addresses comments cee55cf [Cheng Lian] Merge pull request #4 from yhuai/spark-5775-yin b0b74fb [Yin Huai] Remove runtime pattern matching. ca6e038 [Cheng Lian] Fixes SPARK-5775
*	[SPARK-5751] [SQL] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift ↵	Cheng Lian	2015-02-28	1	-2/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	server test suites This is a follow-up of #4720. By default, `spark-daemon.sh` writes PID files under `/tmp`, which makes it impossible to start multiple server instances simultaneously. This PR sets `SPARK_PID_DIR` to Spark home directory to workaround this problem. Many thanks to chenghao-intel for pointing out this issue! <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4758) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4758 from liancheng/thriftserver-pid-dir and squashes the following commits: 252fa0f [Cheng Lian] Uses temporary directory as Thrift server PID directory 1b3d1e3 [Cheng Lian] Sets SPARK_HOME as SPARK_PID_DIR when running Thrift server test suites
*	[SPARK-6024][SQL] When a data source table has too many columns, it's schema ↵	Yin Huai	2015-02-26	3	-6/+54
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cannot be stored in metastore. JIRA: https://issues.apache.org/jira/browse/SPARK-6024 Author: Yin Huai <yhuai@databricks.com> Closes #4795 from yhuai/wideSchema and squashes the following commits: 4882e6f [Yin Huai] Address comments. 73e71b4 [Yin Huai] Address comments. 143927a [Yin Huai] Simplify code. cc1d472 [Yin Huai] Make the schema wider. 12bacae [Yin Huai] If the JSON string of a schema is too large, split it before storing it in metastore. e9b4f70 [Yin Huai] Failed test.
*	[SPARK-6037][SQL] Avoiding duplicate Parquet schema merging	Liang-Chi Hsieh	2015-02-27	1	-16/+7
\| \| \| \| \| \| \| \| \| \|	`FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4786 from viirya/dup_parquet_schemas_merge and squashes the following commits: ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging.
*	[SPARK-6007][SQL] Add numRows param in DataFrame.show()	Jacky Li	2015-02-26	3	-3/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is useful to let the user decide the number of rows to show in DataFrame.show Author: Jacky Li <jacky.likun@huawei.com> Closes #4767 from jackylk/show and squashes the following commits: a0e0f4b [Jacky Li] fix testcase 7cdbe91 [Jacky Li] modify according to comment bb54537 [Jacky Li] for Java compatibility d7acc18 [Jacky Li] modify according to comments 981be52 [Jacky Li] add numRows param in DataFrame.show()
*	[SPARK-6016][SQL] Cannot read the parquet table after overwriting the ↵	Yin Huai	2015-02-27	3	-42/+42
\| \| \| \| \| \| \| \| \| \| \| \| \|	existing table when spark.sql.parquet.cacheMetadata=true Please see JIRA (https://issues.apache.org/jira/browse/SPARK-6016) for details of the bug. Author: Yin Huai <yhuai@databricks.com> Closes #4775 from yhuai/parquetFooterCache and squashes the following commits: 78787b1 [Yin Huai] Remove footerCache in FilteringParquetRowInputFormat. dff6fba [Yin Huai] Failed unit test.
*	[SPARK-6023][SQL] ParquetConversions fails to replace the destination ↵	Yin Huai	2015-02-26	2	-7/+152
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MetastoreRelation of an InsertIntoTable node to ParquetRelation2 JIRA: https://issues.apache.org/jira/browse/SPARK-6023 Author: Yin Huai <yhuai@databricks.com> Closes #4782 from yhuai/parquetInsertInto and squashes the following commits: ae7e806 [Yin Huai] Convert MetastoreRelation in InsertIntoTable and InsertIntoHiveTable. ba543cd [Yin Huai] More tests. 50b6d0f [Yin Huai] Update error messages. 346780c [Yin Huai] Failed test.
*	[SPARK-5926] [SQL] make DataFrame.explain leverage queryExecution.logical	Yanbo Liang	2015-02-25	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	DataFrame.explain return wrong result when the query is DDL command. For example, the following two queries should print out the same execution plan, but it not. sql("create table tb as select * from src where key > 490").explain(true) sql("explain extended create table tb as select * from src where key > 490") This is because DataFrame.explain leverage logicalPlan which had been forced executed, we should use the unexecuted plan queryExecution.logical. Author: Yanbo Liang <ybliang8@gmail.com> Closes #4707 from yanboliang/spark-5926 and squashes the following commits: fa6db63 [Yanbo Liang] logicalPlan is not lazy 0e40a1b [Yanbo Liang] make DataFrame.explain leverage queryExecution.logical
*	[SPARK-5999][SQL] Remove duplicate Literal matching block	Liang-Chi Hsieh	2015-02-25	2	-19/+3
\| \| \| \| \| \| \| \|	Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #4760 from viirya/dup_literal and squashes the following commits: 06e7516 [Liang-Chi Hsieh] Remove duplicate Literal matching block.
*	[SPARK-6010] [SQL] Merging compatible Parquet schemas before computing splits	Cheng Lian	2015-02-25	3	-1/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	`ReadContext.init` calls `InitContext.getMergedKeyValueMetadata`, which doesn't know how to merge conflicting user defined key-value metadata and throws exception. In our case, when dealing with different but compatible schemas, we have different Spark SQL schema JSON strings in different Parquet part-files, thus causes this problem. Reading similar Parquet files generated by Hive doesn't suffer from this issue. In this PR, we manually merge the schemas before passing it to `ReadContext` to avoid the exception. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/4768) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #4768 from liancheng/spark-6010 and squashes the following commits: 9002f0a [Cheng Lian] Fixes SPARK-6010
*	[SPARK-5996][SQL] Fix specialized outbound conversions	Michael Armbrust	2015-02-25	3	-5/+20
\| \| \| \| \| \| \| \|	Author: Michael Armbrust <michael@databricks.com> Closes #4757 from marmbrus/udtConversions and squashes the following commits: 3714aad [Michael Armbrust] [SPARK-5996][SQL] Fix specialized outbound conversions