spark - Mirror of Apache Spark

	Commit message (Collapse)	Author	Age	Files	Lines
*	[SPARK-7156][SQL] support RandomSplit in DataFrames	Burak Yavuz	2015-04-29	2	-8/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is built on top of kaka1992 's PR #5711 using Logical plans. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5761 from brkyvz/random-sample and squashes the following commits: a1fb0aa [Burak Yavuz] remove unrelated file 69669c3 [Burak Yavuz] fix broken test 1ddb3da [Burak Yavuz] copy base 6000328 [Burak Yavuz] added python api and fixed test 3c11d1b [Burak Yavuz] fixed broken test f400ade [Burak Yavuz] fix build errors 2384266 [Burak Yavuz] addressed comments v0.1 e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
*	[SPARK-7076][SPARK-7077][SPARK-7080][SQL] Use managed memory for aggregations	Josh Rosen	2015-04-29	6	-0/+1194
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory. This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses. As a result, this allows for higher memory utilization. It can also result in better cache locality since objects will be stored closer together in memory. This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`. For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric. ### Managing memory with sun.misc.Unsafe This patch supports both on- and off-heap managed memory. - In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object. - In off-heap mode, memory is addressed directly with 64-bit long addresses. To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields. In off-heap mode, we simply pass `null` as `baseObject`. We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks. By default, we use on-heap mode. To enable off-heap mode, set `spark.unsafe.offHeap=true`. To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`. These classes work together to track allocations and detect memory leaks. ### Compact tuple format This patch introduces `UnsafeRow`, a compact row layout. In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values: ![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png) - Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes) - The bit set is used for null tracking: - Position _i_ is set if and only if field _i_ is null - The bit set is aligned to an 8-byte word boundary. - Every field appears as an 8-byte word in the fixed-length values part: - If a field is null, we zero out the values. - If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part. - Each variable-length data type can have its own encoding: - For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes. If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment. For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this: ![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png) This format allows us to compare tuples for equality by directly comparing their raw bytes. This also enables fast hashing of tuples. ### Hash map for performing aggregations This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with. This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values. `BytesToBytesMap` stores pointers to key and value tuples. For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place. This map is implemented using open hashing with triangular sequence probing. Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode. By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision). `UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map. Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`. This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #5725 from JoshRosen/unsafe and squashes the following commits: eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns. 81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate 1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing. 017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize() 50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication 70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager: 6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods. de5e001 [Josh Rosen] Fix debug vs. trace in logging message. a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention. 78a5b84 [Josh Rosen] Add logging to MemoryManager ce3c565 [Josh Rosen] More comments, formatting, and code cleanup. 529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds. 3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace 162caf7 [Josh Rosen] Fix test compilation b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size. a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv. 0925847 [Josh Rosen] Disable MiMa checks for new unsafe module cde4132 [Josh Rosen] Add missing pom.xml 9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap 6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow. 31eaabc [Josh Rosen] Lots of TODO and doc cleanup. a95291e [Josh Rosen] Cleanups to string handling code afe8dca [Josh Rosen] Some Javadoc cleanup f3dcbfe [Josh Rosen] More mod replacement 854201a [Josh Rosen] Import and comment cleanup 06e929d [Josh Rosen] More warning cleanup ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections 29a7575 [Josh Rosen] Remove debug logging 49aed30 [Josh Rosen] More long -> int conversion. b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation. 765243d [Josh Rosen] Enable optional performance metrics for hash map. 23a440a [Josh Rosen] Bump up default hash map size 628f936 [Josh Rosen] Use ints intead of longs for indexing. 92d5a06 [Josh Rosen] Address a number of minor code review comments. 1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests. d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...) b3eaccd [Josh Rosen] Extract aggregation map into its own class. d2bb986 [Josh Rosen] Update to implement new Row methods added upstream 58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable) 7df6008 [Josh Rosen] Optimizations related to zeroing out memory: c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free(): 738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed. 62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns. c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort" c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup f764d13 [Josh Rosen] Simplify address + length calculation in Location. 079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract. 1a483c5 [Josh Rosen] First version that passes some aggregation tests: fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate 53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters 1ff814d [Josh Rosen] Add reminder to free memory on iterator completion 8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration. 5d55cef [Josh Rosen] Add skeleton for Row implementation. f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods. ab68e08 [Josh Rosen] Begin merging the UTF8String implementations. 480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.
*	[SPARK-7188] added python support for math DataFrame functions	Burak Yavuz	2015-04-29	3	-117/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adds support for the math functions for DataFrames in PySpark. rxin I love Davies. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5750 from brkyvz/python-math-udfs and squashes the following commits: 7c4f563 [Burak Yavuz] removed is_math 3c4adde [Burak Yavuz] cleanup imports d5dca3f [Burak Yavuz] moved math functions to mathfunctions 25e6534 [Burak Yavuz] addressed comments v2.0 d3f7e0f [Burak Yavuz] addressed comments and added tests 7b7d7c4 [Burak Yavuz] remove tests for removed methods 33c2c15 [Burak Yavuz] fixed python style 3ee0c05 [Burak Yavuz] added python functions
*	[SPARK-7215] made coalesce and repartition a part of the query plan	Burak Yavuz	2015-04-28	2	-1/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Coalesce and repartition now show up as part of the query plan, rather than resulting in a new `DataFrame`. cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5762 from brkyvz/df-repartition and squashes the following commits: b1e76dd [Burak Yavuz] added documentation on repartitions 5807e35 [Burak Yavuz] renamed coalescepartitions fa4509f [Burak Yavuz] rename coalesce 2c349b5 [Burak Yavuz] address comments f2e6af1 [Burak Yavuz] add ticks 686c90b [Burak Yavuz] made coalesce and repartition a part of the query plan
*	[SPARK-6829] Added math functions for DataFrames	Burak Yavuz	2015-04-27	5	-0/+455
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implemented almost all math functions found in scala.math (max, min and abs were already present). cc mengxr marmbrus Author: Burak Yavuz <brkyvz@gmail.com> Closes #5616 from brkyvz/math-udfs and squashes the following commits: fb27153 [Burak Yavuz] reverted exception message 836a098 [Burak Yavuz] fixed test and addressed small comment e5f0d13 [Burak Yavuz] addressed code review v2.2 b26c5fb [Burak Yavuz] addressed review v2.1 2761f08 [Burak Yavuz] addressed review v2 6588a5b [Burak Yavuz] fixed merge conflicts b084e10 [Burak Yavuz] Addressed code review 029e739 [Burak Yavuz] fixed atan2 test 534cc11 [Burak Yavuz] added more tests, addressed comments fa68dbe [Burak Yavuz] added double specific test data 937d5a5 [Burak Yavuz] use doubles instead of ints 8e28fff [Burak Yavuz] Added apache header 7ec8f7f [Burak Yavuz] Added math functions for DataFrames
*	[SQL][Minor] rename DataTypeParser.apply to DataTypeParser.parse	wangfei	2015-04-26	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	rename DataTypeParser.apply to DataTypeParser.parse to make it more clear and readable. /cc rxin Author: wangfei <wangfei1@huawei.com> Closes #5710 from scwf/apply and squashes the following commits: c319977 [wangfei] rename apply to parse
*	Fixed a typo from the previous commit.	Reynold Xin	2015-04-23	1	-1/+1
\|
*	[SQL] Fixed expression data type matching.	Reynold Xin	2015-04-23	2	-10/+42
\| \| \| \| \| \| \| \| \| \| \| \|	Also took the chance to improve documentation for various types. Author: Reynold Xin <rxin@databricks.com> Closes #5675 from rxin/data-type-matching-expr and squashes the following commits: 0f31856 [Reynold Xin] One more function documentation. 27c1973 [Reynold Xin] Added more documentation. 336a36d [Reynold Xin] [SQL] Fixed expression data type matching.
*	[SQL] Break dataTypes.scala into multiple files.	Reynold Xin	2015-04-23	20	-1224/+1649
\| \| \| \| \| \| \| \| \| \| \|	It was over 1000 lines of code, making it harder to find all the types. Only moved code around, and didn't change any. Author: Reynold Xin <rxin@databricks.com> Closes #5670 from rxin/break-types and squashes the following commits: 8c59023 [Reynold Xin] Check in missing files. dcd5193 [Reynold Xin] [SQL] Break dataTypes.scala into multiple files.
*	[SPARK-7069][SQL] Rename NativeType -> AtomicType.	Reynold Xin	2015-04-23	7	-87/+93
\| \| \| \| \| \| \| \| \| \|	Also renamed JvmType to InternalType. Author: Reynold Xin <rxin@databricks.com> Closes #5651 from rxin/native-to-atomic-type and squashes the following commits: cbd4028 [Reynold Xin] [SPARK-7069][SQL] Rename NativeType -> AtomicType.
*	[SPARK-7068][SQL] Remove PrimitiveType	Reynold Xin	2015-04-22	1	-40/+30
\| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #5646 from rxin/remove-primitive-type and squashes the following commits: 01b673d [Reynold Xin] [SPARK-7068][SQL] Remove PrimitiveType
*	[SPARK-7066][MLlib] VectorAssembler should use NumericType not NativeType.	Reynold Xin	2015-04-22	1	-2/+2
\| \| \| \| \| \| \| \| \|	Author: Reynold Xin <rxin@databricks.com> Closes #5642 from rxin/mllib-native-type and squashes the following commits: e23af5b [Reynold Xin] Remove StringType 7cbb205 [Reynold Xin] [SPARK-7066][MLlib] VectorAssembler should use NumericType and StringType, not NativeType.
*	[SQL] Rename some apply functions.	Reynold Xin	2015-04-22	25	-95/+95
\| \| \| \| \| \| \| \| \| \|	I was looking at the code gen code and got confused by a few of use cases of apply, in particular apply on objects. So I went ahead and changed a few of them. Hopefully slightly more clear with a proper verb. Author: Reynold Xin <rxin@databricks.com> Closes #5624 from rxin/apply-rename and squashes the following commits: ee45034 [Reynold Xin] [SQL] Rename some apply functions.
*	[SPARK-5817] [SQL] Fix bug of udtf with column names	Cheng Hao	2015-04-21	9	-66/+112
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's a bug while do query like: ```sql select d from (select explode(array(1,1)) d from src limit 1) t ``` And it will throws exception like: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input columns _c0; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) ``` To solve the bug, it requires code refactoring for UDTF The major changes are about: * Simplifying the UDTF development, UDTF will manage the output attribute names any more, instead, the `logical.Generate` will handle that properly. * UDTF will be asked for the output schema (data types) during the logical plan analyzing. Author: Cheng Hao <hao.cheng@intel.com> Closes #4602 from chenghao-intel/explode_bug and squashes the following commits: c2a5132 [Cheng Hao] add back resolved for Alias 556e982 [Cheng Hao] revert the unncessary change 002c361 [Cheng Hao] change the rule of resolved for Generate 04ae500 [Cheng Hao] add qualifier only for generator output 5ee5d2c [Cheng Hao] prepend the new qualifier d2e8b43 [Cheng Hao] Update the code as feedback ca5e7f4 [Cheng Hao] shrink the commits
*	[SPARK-6996][SQL] Support map types in java beans	Punya Biswal	2015-04-21	1	-0/+20
\| \| \| \| \| \| \| \| \| \| \| \|	liancheng mengxr this is similar to #5146. Author: Punya Biswal <pbiswal@palantir.com> Closes #5578 from punya/feature/SPARK-6996 and squashes the following commits: d56c3e0 [Punya Biswal] Fix imports c7e308b [Punya Biswal] Support java iterable types in POJOs 5e00685 [Punya Biswal] Support map types in java beans
*	[SQL][minor] make it more clear that we only need to re-throw GetField ↵	Wenchen Fan	2015-04-21	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \|	exception for UnresolvedAttribute For `GetField` outside `UnresolvedAttribute`, we will throw exception in `Analyzer`. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5588 from cloud-fan/tmp and squashes the following commits: 7ac74d2 [Wenchen Fan] small refactor
*	[SPARK-6994] Allow to fetch field values by name in sql.Row	vidmantas zemleris	2015-04-21	5	-0/+127
\| \| \| \| \| \| \| \| \| \| \| \| \|	It looked weird that up to now there was no way in Spark's Scala API to access fields of `DataFrame/sql.Row` by name, only by their index. This tries to solve this issue. Author: vidmantas zemleris <vidmantas@vinted.com> Closes #5573 from vidma/features/row-with-named-fields and squashes the following commits: 6145ae3 [vidmantas zemleris] [SPARK-6994][SQL] Allow to fetch field values by name on Row 9564ebb [vidmantas zemleris] [SPARK-6994][SQL] Add fieldIndex to schema (StructType)
*	[SPARK-7011] Build(compilation) fails with scala 2.11 option, because a ↵	Prashant Sharma	2015-04-21	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	protected[sql] type is accessed in ml package. [This](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/VectorAssembler.scala#L58) is where it is used and fails compilations at. Author: Prashant Sharma <prashant.s@imaginea.com> Closes #5593 from ScrapCodes/SPARK-7011/build-fix and squashes the following commits: e6d57a3 [Prashant Sharma] [SPARK-7011] Build fails with scala 2.11 option, because a protected[sql] type is accessed in ml package.
*	[SPARK-6899][SQL] Fix type mismatch when using codegen with Average on ↵	Liang-Chi Hsieh	2015-04-16	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	DecimalType JIRA https://issues.apache.org/jira/browse/SPARK-6899 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5517 from viirya/fix_codegen_average and squashes the following commits: 8ae5f65 [Liang-Chi Hsieh] Add the case of DecimalType.Unlimited to Average.
*	[SQL][Minor] Fix foreachUp of treenode	scwf	2015-04-16	2	-1/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	`foreachUp` should runs the given function recursively on [[children]] then on this node(just like transformUp). The current implementation does not follow this. This will leads to checkanalysis do not check from bottom of logical tree. Author: scwf <wangfei1@huawei.com> Author: Fei Wang <wangfei1@huawei.com> Closes #5518 from scwf/patch-1 and squashes the following commits: 18e28b2 [scwf] added a test case 1ccbfa8 [Fei Wang] fix foreachUp
*	SPARK-6927 [SQL] Sorting Error when codegen on	云峤	2015-04-16	1	-1/+13
\| \| \| \| \| \| \| \| \| \| \|	Fix this error by adding BinaryType comparor in GenerateOrdering. JIRA https://issues.apache.org/jira/browse/SPARK-6927 Author: 云峤 <chensong.cs@alibaba-inc.com> Closes #5524 from kaka1992/fix-codegen-sort and squashes the following commits: d7e2afe [云峤] fix codegen sorting error
*	[SPARK-2213] [SQL] sort merge join for spark sql	Daoyuan Wang	2015-04-15	2	-2/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Thanks for the initial work from Ishiihara in #3173 This PR introduce a new join method of sort merge join, which firstly ensure that keys of same value are in the same partition, and inside each partition the Rows are sorted by key. Then we can run down both sides together, find matched rows using [sort merge join](http://en.wikipedia.org/wiki/Sort-merge_join). In this way, we don't have to store the whole hash table of one side as hash join, thus we have less memory usage. Also, this PR would benefit from #3438 , making the sorting phrase much more efficient. We introduced a new configuration of "spark.sql.planner.sortMergeJoin" to switch between this(`true`) and ShuffledHashJoin(`false`), probably we want the default value of it be `false` at first. Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Michael Armbrust <michael@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #5208 from adrian-wang/smj and squashes the following commits: 2493b9f [Daoyuan Wang] fix style 5049d88 [Daoyuan Wang] propagate rowOrdering for RangePartitioning f91a2ae [Daoyuan Wang] yin's comment: use external sort if option is enabled, add comments f515cd2 [Daoyuan Wang] yin's comment: outputOrdering, join suite refine ec8061b [Daoyuan Wang] minor change 413fd24 [Daoyuan Wang] Merge pull request #3 from marmbrus/pr/5208 952168a [Michael Armbrust] add type 5492884 [Michael Armbrust] copy when ordering 7ddd656 [Michael Armbrust] Cleanup addition of ordering requirements b198278 [Daoyuan Wang] inherit ordering in project c8e82a3 [Daoyuan Wang] fix style 6e897dd [Daoyuan Wang] hide boundReference from manually construct RowOrdering for key compare in smj 8681d73 [Daoyuan Wang] refactor Exchange and fix copy for sorting 2875ef2 [Daoyuan Wang] fix changed configuration 61d7f49 [Daoyuan Wang] add omitted comment 00a4430 [Daoyuan Wang] fix bug 078d69b [Daoyuan Wang] address comments: add comments, do sort in shuffle, and others 3af6ba5 [Daoyuan Wang] use buffer for only one side 171001f [Daoyuan Wang] change default outputordering 47455c9 [Daoyuan Wang] add apache license ... a28277f [Daoyuan Wang] fix style 645c70b [Daoyuan Wang] address comments using sort 068c35d [Daoyuan Wang] fix new style and add some tests 925203b [Daoyuan Wang] address comments 07ce92f [Daoyuan Wang] fix ArrayIndexOutOfBound 42fca0e [Daoyuan Wang] code clean e3ec096 [Daoyuan Wang] fix comment style.. 2edd235 [Daoyuan Wang] fix outputpartitioning 57baa40 [Daoyuan Wang] fix sort eval bug 303b6da [Daoyuan Wang] fix several errors 95db7ad [Daoyuan Wang] fix brackets for if-statement 4464f16 [Daoyuan Wang] fix error 880d8e9 [Daoyuan Wang] sort merge join for spark sql
*	[SPARK-6898][SQL] completely support special chars in column names	Wenchen Fan	2015-04-15	6	-27/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Even if we wrap column names in backticks like `` `a#$b.c` ``, we still handle the "." inside column name specially. I think it's fragile to use a special char to split name parts, why not put name parts in `UnresolvedAttribute` directly? Author: Wenchen Fan <cloud0fan@outlook.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #5511 from cloud-fan/6898 and squashes the following commits: 48e3e57 [Wenchen Fan] more style fix 820dc45 [Wenchen Fan] do not ignore newName in UnresolvedAttribute d81ad43 [Wenchen Fan] fix style 11699d6 [Wenchen Fan] completely support special chars in column names
*	[SPARK-6638] [SQL] Improve performance of StringType in SQL	Davies Liu	2015-04-15	20	-157/+543
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR change the internal representation for StringType from java.lang.String to UTF8String, which is implemented use ArrayByte. This PR should not break any public API, Row.getString() will still return java.lang.String. This is the first step of improve the performance of String in SQL. cc rxin Author: Davies Liu <davies@databricks.com> Closes #5350 from davies/string and squashes the following commits: 3b7bfa8 [Davies Liu] fix schema of AddJar 2772f0d [Davies Liu] fix new test failure 6d776a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string 59025c8 [Davies Liu] address comments from @marmbrus 341ec2c [Davies Liu] turn off scala style check in UTF8StringSuite 744788f [Davies Liu] Merge branch 'master' of github.com:apache/spark into string b04a19c [Davies Liu] add comment for getString/setString 08d897b [Davies Liu] Merge branch 'master' of github.com:apache/spark into string 5116b43 [Davies Liu] rollback unrelated changes 1314a37 [Davies Liu] address comments from Yin 867bf50 [Davies Liu] fix String filter push down 13d9d42 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string 2089d24 [Davies Liu] add hashcode check back ac18ae6 [Davies Liu] address comment fd11364 [Davies Liu] optimize UTF8String 8d17f21 [Davies Liu] fix hive compatibility tests e5fa5b8 [Davies Liu] remove clone in UTF8String 28f3d81 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string 28d6f32 [Davies Liu] refactor 537631c [Davies Liu] some comment about Date 9f4c194 [Davies Liu] convert data type for data source 956b0a4 [Davies Liu] fix hive tests 73e4363 [Davies Liu] Merge branch 'master' of github.com:apache/spark into string 9dc32d1 [Davies Liu] fix some hive tests 23a766c [Davies Liu] refactor 8b45864 [Davies Liu] fix codegen with UTF8String bb52e44 [Davies Liu] fix scala style c7dd4d2 [Davies Liu] fix some catalyst tests 38c303e [Davies Liu] fix python sql tests 5f9e120 [Davies Liu] fix sql tests 6b499ac [Davies Liu] fix style a85fb27 [Davies Liu] refactor d32abd1 [Davies Liu] fix utf8 for python api 4699c3a [Davies Liu] use Array[Byte] in UTF8String 21f67c6 [Davies Liu] cleanup 685fd07 [Davies Liu] use UTF8String instead of String for StringType
*	[SPARK-6871][SQL] WITH clause in CTE can not following another WITH clause	Liang-Chi Hsieh	2015-04-14	1	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \|	JIRA https://issues.apache.org/jira/browse/SPARK-6871 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5480 from viirya/no_cte_after_cte and squashes the following commits: 4da3712 [Liang-Chi Hsieh] Create new test. 40b38ed [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into no_cte_after_cte 0edf568 [Liang-Chi Hsieh] for comments. 6591b79 [Liang-Chi Hsieh] WITH clause in CTE can not following another WITH clause.
*	[SPARK-6877][SQL] Add code generation support for Min	Liang-Chi Hsieh	2015-04-13	3	-0/+79
\| \| \| \| \| \| \| \| \| \|	Currently `min` is not supported in code generation. This pr adds the support for it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5487 from viirya/add_min_codegen and squashes the following commits: 0ddec23 [Liang-Chi Hsieh] Add code generation support for Min.
*	[SPARK-6130] [SQL] support if not exists for insert overwrite into partition ↵	Daoyuan Wang	2015-04-13	4	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	in hiveQl Standard syntax: INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 FROM from_statement; INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 FROM from_statement; Hive extension (multiple inserts): FROM from_statement INSERT OVERWRITE TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...) [IF NOT EXISTS]] select_statement1 [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] ...; FROM from_statement INSERT INTO TABLE tablename1 [PARTITION (partcol1=val1, partcol2=val2 ...)] select_statement1 [INSERT INTO TABLE tablename2 [PARTITION ...] select_statement2] [INSERT OVERWRITE TABLE tablename2 [PARTITION ... [IF NOT EXISTS]] select_statement2] ...; Hive extension (dynamic partition inserts): INSERT OVERWRITE TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; INSERT INTO TABLE tablename PARTITION (partcol1[=val1], partcol2[=val2] ...) select_statement FROM from_statement; Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #4865 from adrian-wang/insertoverwrite and squashes the following commits: 2fce94f [Daoyuan Wang] add assert 10ea6f3 [Daoyuan Wang] add name for boolean parameter 0bbe9b9 [Daoyuan Wang] fix failure 4391154 [Daoyuan Wang] support if not exists for insert overwrite into partition in hiveQl
*	[HOTFIX] Add explicit return types to fix lint errors	Josh Rosen	2015-04-11	2	-2/+2
\|
*	[SQL][minor] move `resolveGetField` into a object	Wenchen Fan	2015-04-11	3	-35/+39
\| \| \| \| \| \| \| \| \| \|	The method `resolveGetField` isn't belong to `LogicalPlan` logically and didn't access any members of it. Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5435 from cloud-fan/tmp and squashes the following commits: 9a66c83 [Wenchen Fan] code clean up
*	[SPARK-6367][SQL] Use the proper data type for those expressions that are ↵	Yin Huai	2015-04-11	3	-10/+53
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	hijacking existing data types. This PR adds internal UDTs for expressions that are hijacking existing data types. The following UDTs are added: * `HyperLogLogUDT` (`BinaryType` as the SQL type) for `ApproxCountDistinctPartition` * `OpenHashSetUDT` (`ArrayType` as the SQL type) for `CollectHashSet`, `NewSet`, `AddItemToSet`, and `CombineSets`. I am also adding more unit tests for aggregation with code gen enabled. JIRA: https://issues.apache.org/jira/browse/SPARK-6367 Author: Yin Huai <yhuai@databricks.com> Closes #5094 from yhuai/expressionType and squashes the following commits: 8bcd11a [Yin Huai] Return types. 61a1d66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into expressionType e8b4599 [Yin Huai] Merge remote-tracking branch 'upstream/master' into expressionType 2753156 [Yin Huai] Ignore aggregations having sum functions for now. b5eb259 [Yin Huai] Case object for HyperLogLog type. 00ebdbd [Yin Huai] deserialize/serialize. 54b87ae [Yin Huai] Add UDTs for expressions that return HyperLogLog and OpenHashSet.
*	[SPARK-6199] [SQL] Support CTE in HiveContext and SQLContext	haiyang	2015-04-11	3	-8/+42
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Author: haiyang <huhaiyang@huawei.com> Closes #4929 from haiyangsea/cte and squashes the following commits: 220b67d [haiyang] add golden files for cte test d3c7681 [haiyang] Merge branch 'master' into cte-repair 0ba2070 [haiyang] modify code style 9ce6b58 [haiyang] fix conflict ff74741 [haiyang] add comment for With plan 0d56af4 [haiyang] code indention 776a440 [haiyang] add comments for resolve relation strategy 2fccd7e [haiyang] add comments for resolve relation strategy 241bbe2 [haiyang] fix cte problem of view e9e1237 [haiyang] fix test case problem 614182f [haiyang] add test cases for CTE feature 32e415b [haiyang] add comment 1cc8c15 [haiyang] support with 03f1097 [haiyang] support with e960099 [haiyang] support with 9aaa874 [haiyang] support with 0566978 [haiyang] support with a99ecd2 [haiyang] support with c3fa4c2 [haiyang] support with 3b6077f [haiyang] support with 5f8abe3 [haiyang] support with 4572b05 [haiyang] support with f801f54 [haiyang] support with
*	[Minor][SQL] Fix typo in sql	Guancheng (G.C.) Chen	2015-04-11	3	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \|	In this PR, "analyser" is changed to "analyzer" to keep a consistent naming. Some other typos are also fixed. Author: Guancheng (G.C.) Chen <chenguancheng@gmail.com> Closes #5474 from gchen/sql-typo and squashes the following commits: 70e6e76 [Guancheng (G.C.) Chen] Merge branch 'sql-typo' of github.com:gchen/spark into sql-typo fb7a6e2 [Guancheng (G.C.) Chen] fix typo in sql 37e3da1 [Guancheng (G.C.) Chen] fix type in sql
*	[SPARK-6611][SQL] Add support for INTEGER as synonym of INT.	Santiago M. Mola	2015-04-11	2	-1/+3
\| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-6611 Author: Santiago M. Mola <santiago.mola@sap.com> Closes #5271 from smola/features/integer-parse and squashes the following commits: f5c1c64 [Santiago M. Mola] [SPARK-6611] Add support for INTEGER as synonym of INT.
*	[SQL] [SPARK-6620] Speed up toDF() and rdd() functions by constructing ↵	Volodymyr Lyubinets	2015-04-10	7	-419/+867
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	converters in ScalaReflection cc marmbrus Author: Volodymyr Lyubinets <vlyubin@gmail.com> Closes #5279 from vlyubin/speedup and squashes the following commits: e75a387 [Volodymyr Lyubinets] Changes to ScalaUDF 11a20ec [Volodymyr Lyubinets] Avoid creating a tuple c327bc9 [Volodymyr Lyubinets] Moved the only remaining function from DataTypeConversions to DateUtils dec6802 [Volodymyr Lyubinets] Addresed review feedback 74301fa [Volodymyr Lyubinets] Addressed review comments afa3aa5 [Volodymyr Lyubinets] Minor refactoring, added license, removed debug output 881dc60 [Volodymyr Lyubinets] Moved to a separate module; addressed review comments; one extra place of usage; changed behaviour for Java 8cad6e2 [Volodymyr Lyubinets] Addressed review commments 41b2aa9 [Volodymyr Lyubinets] Creating converters for ScalaReflection stuff, and more
*	[SPARK-6758]block the right jetty package in log	WangTaoTheTonic	2015-04-09	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \|	https://issues.apache.org/jira/browse/SPARK-6758 I am not sure if it is ok to block them in test resources too (as we shade jetty in assembly?). Author: WangTaoTheTonic <wangtao111@huawei.com> Closes #5406 from WangTaoTheTonic/SPARK-6758 and squashes the following commits: e09605b [WangTaoTheTonic] block the right jetty package
*	[SPARK-6765] Fix test code style for SQL	Reynold Xin	2015-04-08	10	-83/+143
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	So we can turn style checker on for test code. Author: Reynold Xin <rxin@databricks.com> Closes #5412 from rxin/test-style-sql and squashes the following commits: 9098a31 [Reynold Xin] One more compilation error ... 8c7250a [Reynold Xin] Fix compilation. 82d0944 [Reynold Xin] Indentation. 0b03fbb [Reynold Xin] code review. f2f4348 [Reynold Xin] oops. ef4ec48 [Reynold Xin] Hive module. 7e0db5e [Reynold Xin] sql module 04ec7ac [Reynold Xin] catalyst module
*	[SQL][minor] remove duplicated resolveGetField and update comment	Wenchen Fan	2015-04-08	2	-39/+6
\| \| \| \| \| \| \| \| \| \|	It's after https://github.com/apache/spark/pull/5189 Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5304 from cloud-fan/tmp and squashes the following commits: c58c9b3 [Wenchen Fan] remove duplicated code and update comment
*	[SPARK-6748] [SQL] Makes QueryPlan.schema a lazy val	Cheng Lian	2015-04-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	`DataFrame.collect()` calls `SparkPlan.executeCollect()`, which consists of a single line: ```scala execute().map(ScalaReflection.convertRowToScala(_, schema)).collect() ``` The problem is that, `QueryPlan.schema` is a function. And since 1.3.0, `convertRowToScala` starts returning a `GenericRowWithSchema`. Thus, every `GenericRowWithSchema` instance holds a separate copy of the schema object. Also, YJP profiling result of the following simple micro benchmark (executed in Spark shell) shows that constructing the schema object takes up to ~35% CPU time. ```scala sc.parallelize(1 to 10000000). map(i => (i, s"val_$i")). toDF("key", "value"). saveAsParquetFile("file:///tmp/src.parquet") // Profiling started from this line sqlContext.parquetFile("file:///tmp/src.parquet").collect() ``` <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5398) <!-- Reviewable:end --> Author: Cheng Lian <lian@databricks.com> Closes #5398 from liancheng/spark-6748 and squashes the following commits: 3159469 [Cheng Lian] Makes QueryPlan.schema a lazy val
*	[SPARK-6647][SQL] Make trait StringComparison as BinaryPredicate and fix ↵	Liang-Chi Hsieh	2015-04-03	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	unit tests of string data source Filter Now trait `StringComparison` is a `BinaryExpression`. In fact, it should be a `BinaryPredicate`. By making `StringComparison` as `BinaryPredicate`, we can throw error when a `expressions.Predicate` can't translate to a data source `Filter` in function `selectFilters`. Without this modification, because we will wrap a `Filter` outside the scanned results in `pruneFilterProjectRaw`, we can't detect about something is wrong in translating predicates to filters in `selectFilters`. The unit test of #5285 demonstrates such problem. In that pr, even `expressions.Contains` is not properly translated to `sources.StringContains`, the filtering is still performed by the `Filter` and so the test passes. Of course, by doing this modification, all `expressions.Predicate` classes need to have its data source `Filter` correspondingly. There is a small bug in `FilteredScanSuite` for doing `StringEndsWith` filter. This pr also fixes it. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5309 from viirya/translate_predicate and squashes the following commits: b176385 [Liang-Chi Hsieh] Address comment. 275a493 [Liang-Chi Hsieh] More properly test for StringStartsWith, StringEndsWith and StringContains. caf2347 [Liang-Chi Hsieh] Make trait StringComparison as BinaryPredicate and throw error when Predicate can't translate to data source Filter.
*	[SPARK-5203][SQL] fix union with different decimal type	guowei2	2015-04-04	3	-69/+156
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When union non-decimal types with decimals, we use the following rules: - FIRST `intTypeToFixed`, then fixed union decimals with precision/scale p1/s2 and p2/s2 will be promoted to DecimalType(max(p1, p2), max(s1, s2)) - FLOAT and DOUBLE cause fixed-length decimals to turn into DOUBLE (this is the same as Hive, but note that unlimited decimals are considered bigger than doubles in WidenTypes) Author: guowei2 <guowei2@asiainfo.com> Closes #4004 from guowei2/SPARK-5203 and squashes the following commits: ff50f5f [guowei2] fix code style 11df1bf [guowei2] fix decimal union with double, double->Decimal(15,15) 0f345f9 [guowei2] fix structType merge with decimal 101ed4d [guowei2] fix build error after rebase 0b196e4 [guowei2] code style fe2c2ca [guowei2] handle union decimal precision in 'DecimalPrecision' 421d840 [guowei2] fix union types for decimal precision ef2c661 [guowei2] fix union with different decimal type
*	[SPARK-6428] Turn on explicit type checking for public methods.	Reynold Xin	2015-04-03	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This builds on my earlier pull requests and turns on the explicit type checking in scalastyle. Author: Reynold Xin <rxin@databricks.com> Closes #5342 from rxin/SPARK-6428 and squashes the following commits: 7b531ab [Reynold Xin] import ordering 2d9a8a5 [Reynold Xin] jl e668b1c [Reynold Xin] override 9b9e119 [Reynold Xin] Parenthesis. 82e0cf5 [Reynold Xin] [SPARK-6428] Turn on explicit type checking for public methods.
*	[SPARK-6243][SQL] The Operation of match did not conside the scenarios that ↵	DoingDone9	2015-04-02	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	order.dataType does not match NativeType It did not conside that order.dataType does not match NativeType. So i add "case other => ..." for other cenarios. Author: DoingDone9 <799203320@qq.com> Closes #4959 from DoingDone9/case_ and squashes the following commits: 6278846 [DoingDone9] Update rows.scala cb1852d [DoingDone9] Merge pull request #2 from apache/master c3f046f [DoingDone9] Merge pull request #1 from apache/master
*	[SPARK-6672][SQL] convert row to catalyst in createDataFrame(RDD[Row], ...)	Xiangrui Meng	2015-04-02	1	-0/+5
\| \| \| \| \| \| \| \| \| \| \| \|	We assume that `RDD[Row]` contains Scala types. So we need to convert them into catalyst types in createDataFrame. liancheng Author: Xiangrui Meng <meng@databricks.com> Closes #5329 from mengxr/SPARK-6672 and squashes the following commits: 2d52644 [Xiangrui Meng] set needsConversion = false in jsonRDD 06896e4 [Xiangrui Meng] add createDataFrame without conversion 4a3767b [Xiangrui Meng] convert Row to catalyst
*	[SPARK-6663] [SQL] use Literal.create instread of constructor	Davies Liu	2015-04-01	10	-170/+176
\| \| \| \| \| \| \| \| \| \| \|	In order to do inbound checking and type conversion, we should use Literal.create() instead of constructor. Author: Davies Liu <davies@databricks.com> Closes #5320 from davies/literal and squashes the following commits: 1667604 [Davies Liu] fix style and add comment 5f8c0fd [Davies Liu] use Literal.create instread of constructor
*	[SPARK-5371][SQL] Propagate types after function conversion, before futher ↵	Michael Armbrust	2015-03-31	2	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \|	resolution Before it was possible for a query to flip back and forth from a resolved state, allowing resolution to propagate up before coercion had stabilized. The issue was that `ResolvedReferences` would run after `FunctionArgumentConversion`, but before `PropagateTypes` had run. This PR ensures we correctly `PropagateTypes` after any coercion has applied. Author: Michael Armbrust <michael@databricks.com> Closes #5278 from marmbrus/unionNull and squashes the following commits: dc3581a [Michael Armbrust] [SPARK-5371][SQL] Propogate types after function conversion / before futher resolution
*	[SPARK-6145][SQL] fix ORDER BY on nested fields	Michael Armbrust	2015-03-31	5	-43/+162
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This PR is based on work by cloud-fan in #4904, but with two differences: - We isolate the logic for Sort's special handling into `ResolveSortReferences` - We avoid creating UnresolvedGetField expressions during resolution. Instead we either resolve GetField or we return None. This avoids us going down the wrong path early on. Author: Michael Armbrust <michael@databricks.com> Closes #5189 from marmbrus/nestedOrderBy and squashes the following commits: b8cae45 [Michael Armbrust] fix another test 0f36a11 [Michael Armbrust] WIP 91820cd [Michael Armbrust] Fix bug.
*	[SPARK-6542][SQL] add CreateStruct	Xiangrui Meng	2015-03-31	3	-23/+73
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Similar to `CreateArray`, we can add `CreateStruct` to create nested columns. marmbrus Author: Xiangrui Meng <meng@databricks.com> Closes #5195 from mengxr/SPARK-6542 and squashes the following commits: 3795c57 [Xiangrui Meng] update error message ae7ac3e [Xiangrui Meng] move unit test to a separate suite 85dd559 [Xiangrui Meng] use NamedExpr c78e31a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-6542 85f3106 [Xiangrui Meng] add CreateStruct
*	[SPARK-6119][SQL] DataFrame support for missing data handling	Reynold Xin	2015-03-30	1	-1/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API. Author: Reynold Xin <rxin@databricks.com> Closes #5274 from rxin/df-missing-value and squashes the following commits: 4ee1b98 [Reynold Xin] Improve error reporting in Python. 33a330c [Reynold Xin] Remove replace for now. bc4fdbb [Reynold Xin] Added documentation for replace. d56f5a5 [Reynold Xin] Added replace for Scala/Java. 2385d00 [Reynold Xin] Feedback from Xiangrui on "how". 914a374 [Reynold Xin] fill with map. 185c67e [Reynold Xin] Allow specifying column subsets in fill. 749eb47 [Reynold Xin] fillna 249b94e [Reynold Xin] Removing undefined functions. 6a73c68 [Reynold Xin] Missing file. 67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python)
*	[SPARK-6592][SQL] fix filter for scaladoc to generate API doc for Row class ↵	CodingCat	2015-03-30	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	under catalyst dir https://issues.apache.org/jira/browse/SPARK-6592 The current impl in SparkBuild.scala filter all classes under catalyst directory, however, we have a corner case that Row class is a public API under that directory we need to include Row into the scaladoc while still excluding other classes of catalyst project Thanks for the help on this patch from rxin and liancheng Author: CodingCat <zhunansjtu@gmail.com> Closes #5252 from CodingCat/SPARK-6592 and squashes the following commits: 02098a4 [CodingCat] ignore collection, enable types (except those protected classes) f7af2cb [CodingCat] commit 3ab4403 [CodingCat] fix filter for scaladoc to generate API doc for Row.scala under catalyst directory
*	[SPARK-6595][SQL] MetastoreRelation should be a MultiInstanceRelation	Michael Armbrust	2015-03-30	2	-2/+10
\| \| \| \| \| \| \| \| \| \|	Now that we have `DataFrame`s it is possible to have multiple copies in a single query plan. As such, it needs to inherit from `MultiInstanceRelation` or self joins will break. I also add better debugging errors when our self join handling fails in case there are future bugs. Author: Michael Armbrust <michael@databricks.com> Closes #5251 from marmbrus/multiMetaStore and squashes the following commits: 4272f6d [Michael Armbrust] [SPARK-6595][SQL] MetastoreRelation should be MuliInstanceRelation