aboutsummaryrefslogtreecommitdiff
path: root/sql/core
Commit message (Collapse)AuthorAgeFilesLines
...
* [SPARK-7240][SQL] Single pass covariance calculation for dataframesBurak Yavuz2015-05-014-3/+114
| | | | | | | | | | | | | | | | | | | | | | Added the calculation of covariance between two columns to DataFrames. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5825 from brkyvz/df-cov and squashes the following commits: cb18046 [Burak Yavuz] changed to sample covariance f2e862b [Burak Yavuz] fixed failed test 51e39b8 [Burak Yavuz] moved implementation 0c6a759 [Burak Yavuz] addressed math comments 8456eca [Burak Yavuz] fix pyStyle3 aa2ad29 [Burak Yavuz] fix pyStyle2 4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov e3b0b85 [Burak Yavuz] addressed comments v0.1 a7115f1 [Burak Yavuz] fix python style 7dc6dbc [Burak Yavuz] reorder imports 408cb77 [Burak Yavuz] initial commit
* [SPARK-7274] [SQL] Create Column expression for array/struct creation.Reynold Xin2015-05-012-1/+124
| | | | | | | | | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5802 from rxin/SPARK-7274 and squashes the following commits: 19aecaa [Reynold Xin] Fixed unicode tests. bfc1538 [Reynold Xin] Export all Python functions. 2517b8c [Reynold Xin] Code review. 23da335 [Reynold Xin] Fixed Python bug. 132002e [Reynold Xin] Fixed tests. 56fce26 [Reynold Xin] Added Python support. b0d591a [Reynold Xin] Fixed debug error. 86926a6 [Reynold Xin] Added test suite. 7dbb9ab [Reynold Xin] Ok one more. 470e2f5 [Reynold Xin] One more MLlib ... e2d14f0 [Reynold Xin] [SPARK-7274][SQL] Create Column expression for array/struct creation.
* [SPARK-4550] In sort-based shuffle, store map outputs in serialized formSandy Ryza2015-04-301-9/+29
| | | | | | | | | | | | | | | | | | | Refer to the JIRA for the design doc and some perf results. I wanted to call out some of the more possibly controversial changes up front: * Map outputs are only stored in serialized form when Kryo is in use. I'm still unsure whether Java-serialized objects can be relocated. At the very least, Java serialization writes out a stream header which causes problems with the current approach, so I decided to leave investigating this to future work. * The shuffle now explicitly operates on key-value pairs instead of any object. Data is written to shuffle files in alternating keys and values instead of key-value tuples. `BlockObjectWriter.write` now accepts a key argument and a value argument instead of any object. * The map output buffer can hold a max of Integer.MAX_VALUE bytes. Though this wouldn't be terribly difficult to change. * When spilling occurs, the objects that still in memory at merge time end up serialized and deserialized an extra time. Author: Sandy Ryza <sandy@cloudera.com> Closes #4450 from sryza/sandy-spark-4550 and squashes the following commits: 8c70dd9 [Sandy Ryza] Fix serialization 9c16fe6 [Sandy Ryza] Fix a couple tests and move getAutoReset to KryoSerializerInstance 6c54e06 [Sandy Ryza] Fix scalastyle d8462d8 [Sandy Ryza] SPARK-4550
* [SPARK-7248] implemented random number generators for DataFramesBurak Yavuz2015-04-304-3/+54
| | | | | | | | | | | | | | | | | | | Adds the functions `rand` (Uniform Dist) and `randn` (Normal Dist.) as expressions to DataFrames. cc mengxr rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5819 from brkyvz/df-rng and squashes the following commits: 50d69d4 [Burak Yavuz] add seed for test that failed 4234c3a [Burak Yavuz] fix Rand expression 13cad5c [Burak Yavuz] couple fixes 7d53953 [Burak Yavuz] waiting for hive tests b453716 [Burak Yavuz] move radn with seed down 03637f0 [Burak Yavuz] fix broken hive func c5909eb [Burak Yavuz] deleted old implementation of Rand 6d43895 [Burak Yavuz] implemented random generators
* Revert "[SPARK-5213] [SQL] Pluggable SQL Parser Support"Patrick Wendell2015-04-303-93/+17
| | | | This reverts commit 3ba5aaab8266822545ac82b9e733fd25cc215a77.
* [SPARK-7123] [SQL] support table.star in sqlcontextscwf2015-04-301-0/+10
| | | | | | | | | | | | Run following sql get error `SELECT r.* FROM testData l join testData2 r on (l.key = r.a)` Author: scwf <wangfei1@huawei.com> Closes #5690 from scwf/tablestar and squashes the following commits: 3b2e2b6 [scwf] support table.star
* [SPARK-5213] [SQL] Pluggable SQL Parser SupportCheng Hao2015-04-303-17/+93
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR aims to make the SQL Parser Pluggable, and user can register it's own parser via Spark SQL CLI. ``` # add the jar into the classpath $hchengmydesktop:spark>bin/spark-sql --jars sql99.jar -- switch to "hiveql" dialect spark-sql>SET spark.sql.dialect=hiveql; spark-sql>SELECT * FROM src LIMIT 1; -- switch to "sql" dialect spark-sql>SET spark.sql.dialect=sql; spark-sql>SELECT * FROM src LIMIT 1; -- switch to a custom dialect spark-sql>SET spark.sql.dialect=com.xxx.xxx.SQL99Dialect; spark-sql>SELECT * FROM src LIMIT 1; -- register the non-exist SQL dialect spark-sql> SET spark.sql.dialect=NotExistedClass; spark-sql> SELECT * FROM src LIMIT 1; -- Exception will be thrown and switch to default sql dialect ("sql" for SQLContext and "hiveql" for HiveContext) ``` Author: Cheng Hao <hao.cheng@intel.com> Closes #4015 from chenghao-intel/sqlparser and squashes the following commits: 493775c [Cheng Hao] update the code as feedback 81a731f [Cheng Hao] remove the unecessary comment aab0b0b [Cheng Hao] polish the code a little bit 49b9d81 [Cheng Hao] shrink the comment for rebasing
* [SPARK-6913][SQL] Fixed "java.sql.SQLException: No suitable driver found"Vyacheslav Baranov2015-04-303-4/+62
| | | | | | | | | | | | | | | | | | Fixed `java.sql.SQLException: No suitable driver found` when loading DataFrame into Spark SQL if the driver is supplied with `--jars` argument. The problem is in `java.sql.DriverManager` class that can't access drivers loaded by Spark ClassLoader. Wrappers that forward requests are created for these drivers. Also, it's not necessary any more to include JDBC drivers in `--driver-class-path` in local mode, specifying in `--jars` argument is sufficient. Author: Vyacheslav Baranov <slavik.baranov@gmail.com> Closes #5782 from SlavikBaranov/SPARK-6913 and squashes the following commits: 510c43f [Vyacheslav Baranov] [SPARK-6913] Fixed review comments b2a727c [Vyacheslav Baranov] [SPARK-6913] Fixed thread race on driver registration c8294ae [Vyacheslav Baranov] [SPARK-6913] Fixed "No suitable driver found" when using using JDBC driver added with SparkContext.addJar
* [SPARK-7093] [SQL] Using newPredicate in NestedLoopJoin to enable code ↵scwf2015-04-302-8/+2
| | | | | | | | | | | | | generation Using newPredicate in NestedLoopJoin instead of InterpretedPredicate to make it can make use of code generation Author: scwf <wangfei1@huawei.com> Closes #5665 from scwf/NLP and squashes the following commits: d19dd31 [scwf] improvement a887c02 [scwf] improve for NLP boundCondition
* [SPARK-7280][SQL] Add "drop" column/s on a data framerakeshchalasani2015-04-302-4/+45
| | | | | | | | | | | | | | Takes a column name/s and returns a new DataFrame that drops a column/s. Author: rakeshchalasani <vnit.rakesh@gmail.com> Closes #5818 from rakeshchalasani/SPARK-7280 and squashes the following commits: ce2ec09 [rakeshchalasani] Minor edit 45c06f1 [rakeshchalasani] Change withColumnRename and format changes f68945a [rakeshchalasani] Minor fix 0b9104d [rakeshchalasani] Drop one column at a time 289afd2 [rakeshchalasani] [SPARK-7280][SQL] Add "drop" column/s on a data frame
* [SPARK-7242][SQL][MLLIB] Frequent items for DataFramesBurak Yavuz2015-04-305-5/+256
| | | | | | | | | | | | | | | | | | | | | | | | | | | Finding frequent items with possibly false positives, using the algorithm described in `http://www.cs.umd.edu/~samir/498/karp.pdf`. public API under: ``` df.stat.freqItems(cols: Array[String], support: Double = 0.001): DataFrame ``` The output is a local DataFrame having the input column names with `-freqItems` appended to it. This is a single pass algorithm that may return false positives, but no false negatives. cc mengxr rxin Let's get the implementations in, I can add python API in a follow up PR. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5799 from brkyvz/freq-items and squashes the following commits: a6ec82c [Burak Yavuz] addressed comments v? 39b1bba [Burak Yavuz] removed toSeq 0915e23 [Burak Yavuz] addressed comments v2.1 3a5c177 [Burak Yavuz] addressed comments v2.0 482e741 [Burak Yavuz] removed old import 38e784d [Burak Yavuz] addressed comments v1.0 8279d4d [Burak Yavuz] added default value for support 3d82168 [Burak Yavuz] made base implementation
* [Build] Enable MiMa checks for SQLJosh Rosen2015-04-303-4/+5
| | | | | | | | | | | | | | Now that 1.3 has been released, we should enable MiMa checks for the `sql` subproject. Author: Josh Rosen <joshrosen@databricks.com> Closes #5727 from JoshRosen/enable-more-mima-checks and squashes the following commits: 3ad302b [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks 0c48e4d [Josh Rosen] Merge remote-tracking branch 'origin/master' into enable-more-mima-checks e276cee [Josh Rosen] Fix SQL MiMa checks via excludes and private[sql] 44d0d01 [Josh Rosen] Add back 'launcher' exclude 1aae027 [Josh Rosen] Enable MiMa checks for launcher and sql projects.
* [SPARK-7196][SQL] Support precision and scale of decimal type for JDBCLiang-Chi Hsieh2015-04-302-2/+10
| | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-7196 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5777 from viirya/jdbc_precision and squashes the following commits: f40f5e6 [Liang-Chi Hsieh] Support precision and scale for NUMERIC type. 49acbf9 [Liang-Chi Hsieh] Add unit test. a509e19 [Liang-Chi Hsieh] Support precision and scale of decimal type for JDBC.
* [SPARK-7156][SQL] Addressed follow up comments for randomSplitBurak Yavuz2015-04-291-1/+1
| | | | | | | | | | | | | small fixes regarding comments in PR #5761 cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5795 from brkyvz/split-followup and squashes the following commits: 369c522 [Burak Yavuz] changed wording a little 1ea456f [Burak Yavuz] Addressed follow up comments
* [SQL] [Minor] Print detail query execution info when spark answer is not rightwangfei2015-04-291-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Print detail query execution info including parsed/analyzed/optimized/Physical plan for query when spak answer is not rignt. ``` Results do not match for query: == Parsed Logical Plan == 'Aggregate ['x.str], ['x.str,SUM('x.strCount) AS c1#46] 'Join Inner, Some(('x.str = 'y.str)) 'UnresolvedRelation [df], Some(x) 'UnresolvedRelation [df], Some(y) == Analyzed Logical Plan == Aggregate [str#44], [str#44,SUM(strCount#45L) AS c1#46L] Join Inner, Some((str#44 = str#51)) Subquery x Subquery df Aggregate [str#44], [str#44,COUNT(str#44) AS strCount#45L] Project [_1#41 AS int#43,_2#42 AS str#44] LocalRelation [_1#41,_2#42], [[1,1],[2,2],[3,3]] Subquery y Subquery df Aggregate [str#51], [str#51,COUNT(str#51) AS strCount#47L] Project [_1#41 AS int#50,_2#42 AS str#51] LocalRelation [_1#41,_2#42], [[1,1],[2,2],[3,3]] == Optimized Logical Plan == Aggregate [str#44], [str#44,SUM(strCount#45L) AS c1#46L] Project [str#44,strCount#45L] Join Inner, Some((str#44 = str#51)) Aggregate [str#44], [str#44,COUNT(str#44) AS strCount#45L] LocalRelation [str#44], [[1],[2],[3]] Aggregate [str#51], [str#51] LocalRelation [str#51], [[1],[2],[3]] == Physical Plan == Aggregate false, [str#44], [str#44,CombineSum(PartialSum#53L) AS c1#46L] Aggregate true, [str#44], [str#44,SUM(strCount#45L) AS PartialSum#53L] Project [str#44,strCount#45L] BroadcastHashJoin [str#44], [str#51], BuildRight Aggregate false, [str#44], [str#44,Coalesce(SUM(PartialCount#55L),0) AS strCount#45L] Exchange (HashPartitioning [str#44], 5), [] Aggregate true, [str#44], [str#44,COUNT(str#44) AS PartialCount#55L] LocalTableScan [str#44], [[1],[2],[3]] Aggregate false, [str#51], [str#51] Exchange (HashPartitioning [str#51], 5), [] Aggregate true, [str#51], [str#51] LocalTableScan [str#51], [[1],[2],[3]] Code Generation: false == RDD == == Results == !== Correct Answer - 3 == == Spark Answer - 3 == [1,1] [1,1] ![2,3] [2,1] [3,1] [3,1] ``` Author: wangfei <wangfei1@huawei.com> Closes #5774 from scwf/checkanswer and squashes the following commits: 5be6f78 [wangfei] print detail query execution info when Spark Answer is not right
* [SPARK-7229] [SQL] SpecificMutableRow should take integer type as internal ↵Cheng Hao2015-04-291-0/+9
| | | | | | | | | | | representation for Date Author: Cheng Hao <hao.cheng@intel.com> Closes #5772 from chenghao-intel/specific_row and squashes the following commits: 2cd064d [Cheng Hao] scala style issue 60347a2 [Cheng Hao] SpecificMutableRow should take integer type as internal representation for DateType
* [SPARK-7156][SQL] support RandomSplit in DataFramesBurak Yavuz2015-04-294-5/+74
| | | | | | | | | | | | | | | | | This is built on top of kaka1992 's PR #5711 using Logical plans. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5761 from brkyvz/random-sample and squashes the following commits: a1fb0aa [Burak Yavuz] remove unrelated file 69669c3 [Burak Yavuz] fix broken test 1ddb3da [Burak Yavuz] copy base 6000328 [Burak Yavuz] added python api and fixed test 3c11d1b [Burak Yavuz] fixed broken test f400ade [Burak Yavuz] fix build errors 2384266 [Burak Yavuz] addressed comments v0.1 e98ebac [Burak Yavuz] [SPARK-7156][SQL] support RandomSplit in DataFrames
* [SQL][Minor] fix java doc for DataFrame.aggWenchen Fan2015-04-291-11/+6
| | | | | | | | Author: Wenchen Fan <cloud0fan@outlook.com> Closes #5712 from cloud-fan/minor and squashes the following commits: be23064 [Wenchen Fan] fix java doc for DataFrame.agg
* [SPARK-7076][SPARK-7077][SPARK-7080][SQL] Use managed memory for aggregationsJosh Rosen2015-04-294-0/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds managed-memory-based aggregation to Spark SQL / DataFrames. Instead of working with Java objects, this new aggregation path uses `sun.misc.Unsafe` to manipulate raw memory. This reduces the memory footprint for aggregations, resulting in fewer spills, OutOfMemoryErrors, and garbage collection pauses. As a result, this allows for higher memory utilization. It can also result in better cache locality since objects will be stored closer together in memory. This feature can be eanbled by setting `spark.sql.unsafe.enabled=true`. For now, this feature is only supported when codegen is enabled and only supports aggregations for which the grouping columns are primitive numeric types or strings and aggregated values are numeric. ### Managing memory with sun.misc.Unsafe This patch supports both on- and off-heap managed memory. - In on-heap mode, memory addresses are identified by the combination of a base Object and an offset within that object. - In off-heap mode, memory is addressed directly with 64-bit long addresses. To support both modes, functions that manipulate memory accept both `baseObject` and `baseOffset` fields. In off-heap mode, we simply pass `null` as `baseObject`. We allocate memory in large chunks, so memory fragmentation and allocation speed are not significant bottlenecks. By default, we use on-heap mode. To enable off-heap mode, set `spark.unsafe.offHeap=true`. To track allocated memory, this patch extends `SparkEnv` with an `ExecutorMemoryManager` and supplies each `TaskContext` with a `TaskMemoryManager`. These classes work together to track allocations and detect memory leaks. ### Compact tuple format This patch introduces `UnsafeRow`, a compact row layout. In this format, each tuple has three parts: a null bit set, fixed length values, and variable-length values: ![image](https://cloud.githubusercontent.com/assets/50748/7328538/2fdb65ce-ea8b-11e4-9743-6c0f02bb7d1f.png) - Rows are always 8-byte word aligned (so their sizes will always be a multiple of 8 bytes) - The bit set is used for null tracking: - Position _i_ is set if and only if field _i_ is null - The bit set is aligned to an 8-byte word boundary. - Every field appears as an 8-byte word in the fixed-length values part: - If a field is null, we zero out the values. - If a field is variable-length, the word stores a relative offset (w.r.t. the base of the tuple) that points to the beginning of the field's data in the variable-length part. - Each variable-length data type can have its own encoding: - For strings, the first word stores the length of the string and is followed by UTF-8 encoded bytes. If necessary, the end of the string is padded with empty bytes in order to ensure word-alignment. For example, a tuple that consists 3 fields of type (int, string, string), with value (null, “data”, “bricks”) would look like this: ![image](https://cloud.githubusercontent.com/assets/50748/7328526/1e21959c-ea8b-11e4-9a28-a4350fe4a7b5.png) This format allows us to compare tuples for equality by directly comparing their raw bytes. This also enables fast hashing of tuples. ### Hash map for performing aggregations This patch introduces `UnsafeFixedWidthAggregationMap`, a hash map for performing aggregations where the aggregation result columns are fixed-with. This map's keys and values are `Row` objects. `UnsafeFixedWidthAggregationMap` is implemented on top of `BytesToBytesMap`, an append-only map which supports byte-array keys and values. `BytesToBytesMap` stores pointers to key and value tuples. For each record with a new key, we copy the key and create the aggregation value buffer for that key and put them in a buffer. The hash table then simply stores pointers to the key and value. For each record with an existing key, we simply run the aggregation function to update the values in place. This map is implemented using open hashing with triangular sequence probing. Each entry stores two words in a long array: the first word stores the address of the key and the second word stores the relative offset from the key tuple to the value tuple, as well as the key's 32-bit hashcode. By storing the full hashcode, we reduce the number of equality checks that need to be performed to handle position collisions ()since the chance of hashcode collision is much lower than position collision). `UnsafeFixedWidthAggregationMap` allows regular Spark SQL `Row` objects to be used when probing the map. Internally, it encodes these rows into `UnsafeRow` format using `UnsafeRowConverter`. This conversion has a small overhead that can be eliminated in the future once we use UnsafeRows in other operators. <!-- Reviewable:start --> [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5725) <!-- Reviewable:end --> Author: Josh Rosen <joshrosen@databricks.com> Closes #5725 from JoshRosen/unsafe and squashes the following commits: eeee512 [Josh Rosen] Add converters for Null, Boolean, Byte, and Short columns. 81f34f8 [Josh Rosen] Follow 'place children last' convention for GeneratedAggregate 1bc36cc [Josh Rosen] Refactor UnsafeRowConverter to avoid unnecessary boxing. 017b2dc [Josh Rosen] Remove BytesToBytesMap.finalize() 50e9671 [Josh Rosen] Throw memory leak warning even in case of error; add warning about code duplication 70a39e4 [Josh Rosen] Split MemoryManager into ExecutorMemoryManager and TaskMemoryManager: 6e4b192 [Josh Rosen] Remove an unused method from ByteArrayMethods. de5e001 [Josh Rosen] Fix debug vs. trace in logging message. a19e066 [Josh Rosen] Rename unsafe Java test suites to match Scala test naming convention. 78a5b84 [Josh Rosen] Add logging to MemoryManager ce3c565 [Josh Rosen] More comments, formatting, and code cleanup. 529e571 [Josh Rosen] Measure timeSpentResizing in nanoseconds instead of milliseconds. 3ca84b2 [Josh Rosen] Only zero the used portion of groupingKeyConversionScratchSpace 162caf7 [Josh Rosen] Fix test compilation b45f070 [Josh Rosen] Don't redundantly store the offset from key to value, since we can compute this from the key size. a8e4a3f [Josh Rosen] Introduce MemoryManager interface; add to SparkEnv. 0925847 [Josh Rosen] Disable MiMa checks for new unsafe module cde4132 [Josh Rosen] Add missing pom.xml 9c19fc0 [Josh Rosen] Add configuration options for heap vs. offheap 6ffdaa1 [Josh Rosen] Null handling improvements in UnsafeRow. 31eaabc [Josh Rosen] Lots of TODO and doc cleanup. a95291e [Josh Rosen] Cleanups to string handling code afe8dca [Josh Rosen] Some Javadoc cleanup f3dcbfe [Josh Rosen] More mod replacement 854201a [Josh Rosen] Import and comment cleanup 06e929d [Josh Rosen] More warning cleanup ef6b3d3 [Josh Rosen] Fix a bunch of FindBugs and IntelliJ inspections 29a7575 [Josh Rosen] Remove debug logging 49aed30 [Josh Rosen] More long -> int conversion. b26f1d3 [Josh Rosen] Fix bug in murmur hash implementation. 765243d [Josh Rosen] Enable optional performance metrics for hash map. 23a440a [Josh Rosen] Bump up default hash map size 628f936 [Josh Rosen] Use ints intead of longs for indexing. 92d5a06 [Josh Rosen] Address a number of minor code review comments. 1f4b716 [Josh Rosen] Merge Unsafe code into the regular GeneratedAggregate, guarded by a configuration flag; integrate planner support and re-enable all tests. d85eeff [Josh Rosen] Add basic sanity test for UnsafeFixedWidthAggregationMap bade966 [Josh Rosen] Comment update (bumping to refresh GitHub cache...) b3eaccd [Josh Rosen] Extract aggregation map into its own class. d2bb986 [Josh Rosen] Update to implement new Row methods added upstream 58ac393 [Josh Rosen] Use UNSAFE allocator in GeneratedAggregate (TODO: make this configurable) 7df6008 [Josh Rosen] Optimizations related to zeroing out memory: c1b3813 [Josh Rosen] Fix bug in UnsafeMemoryAllocator.free(): 738fa33 [Josh Rosen] Add feature flag to guard UnsafeGeneratedAggregate c55bf66 [Josh Rosen] Free buffer once iterator has been fully consumed. 62ab054 [Josh Rosen] Optimize for fact that get() is only called on String columns. c7f0b56 [Josh Rosen] Reuse UnsafeRow pointer in UnsafeRowConverter ae39694 [Josh Rosen] Add finalizer as "cleanup method of last resort" c754ae1 [Josh Rosen] Now that the store*() contract has been stregthened, we can remove an extra lookup f764d13 [Josh Rosen] Simplify address + length calculation in Location. 079f1bf [Josh Rosen] Some clarification of the BytesToBytesMap.lookup() / set() contract. 1a483c5 [Josh Rosen] First version that passes some aggregation tests: fc4c3a8 [Josh Rosen] Sketch how the converters will be used in UnsafeGeneratedAggregate 53ba9b7 [Josh Rosen] Start prototyping Java Row -> UnsafeRow converters 1ff814d [Josh Rosen] Add reminder to free memory on iterator completion 8a8f9df [Josh Rosen] Add skeleton for GeneratedAggregate integration. 5d55cef [Josh Rosen] Add skeleton for Row implementation. f03e9c1 [Josh Rosen] Play around with Unsafe implementations of more string methods. ab68e08 [Josh Rosen] Begin merging the UTF8String implementations. 480a74a [Josh Rosen] Initial import of code from Databricks unsafe utils repo.
* [SPARK-7188] added python support for math DataFrame functionsBurak Yavuz2015-04-292-305/+116
| | | | | | | | | | | | | | | | | | | Adds support for the math functions for DataFrames in PySpark. rxin I love Davies. Author: Burak Yavuz <brkyvz@gmail.com> Closes #5750 from brkyvz/python-math-udfs and squashes the following commits: 7c4f563 [Burak Yavuz] removed is_math 3c4adde [Burak Yavuz] cleanup imports d5dca3f [Burak Yavuz] moved math functions to mathfunctions 25e6534 [Burak Yavuz] addressed comments v2.0 d3f7e0f [Burak Yavuz] addressed comments and added tests 7b7d7c4 [Burak Yavuz] remove tests for removed methods 33c2c15 [Burak Yavuz] fixed python style 3ee0c05 [Burak Yavuz] added python functions
* [SPARK-7215] made coalesce and repartition a part of the query planBurak Yavuz2015-04-283-9/+19
| | | | | | | | | | | | | | | | | Coalesce and repartition now show up as part of the query plan, rather than resulting in a new `DataFrame`. cc rxin Author: Burak Yavuz <brkyvz@gmail.com> Closes #5762 from brkyvz/df-repartition and squashes the following commits: b1e76dd [Burak Yavuz] added documentation on repartitions 5807e35 [Burak Yavuz] renamed coalescepartitions fa4509f [Burak Yavuz] rename coalesce 2c349b5 [Burak Yavuz] address comments f2e6af1 [Burak Yavuz] add ticks 686c90b [Burak Yavuz] made coalesce and repartition a part of the query plan
* [SPARK-6352] [SQL] Custom parquet output committerPei-Lun Lee2015-04-283-0/+116
| | | | | | | | | | | | | | | | | | | | | | | | Add new config "spark.sql.parquet.output.committer.class" to allow custom parquet output committer and an output committer class specific to use on s3. Fix compilation error introduced by https://github.com/apache/spark/pull/5042. Respect ParquetOutputFormat.ENABLE_JOB_SUMMARY flag. Author: Pei-Lun Lee <pllee@appier.com> Closes #5525 from ypcat/spark-6352 and squashes the following commits: 54c6b15 [Pei-Lun Lee] error handling 472870e [Pei-Lun Lee] add back custom parquet output committer ddd0f69 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ece5c5 [Pei-Lun Lee] compatibility with hadoop 1.x 8413fcd [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 fe65915 [Pei-Lun Lee] add support for parquet config parquet.enable.summary-metadata e17bf47 [Pei-Lun Lee] Merge branch 'master' of https://github.com/apache/spark into spark-6352 9ae7545 [Pei-Lun Lee] [SPARL-6352] [SQL] Change to allow custom parquet output committer. 0d540b9 [Pei-Lun Lee] [SPARK-6352] [SQL] add license c42468c [Pei-Lun Lee] [SPARK-6352] [SQL] add test case 0fc03ca [Pei-Lun Lee] [SPARK-6532] [SQL] hide class DirectParquetOutputCommitter 769bd67 [Pei-Lun Lee] DirectParquetOutputCommitter f75e261 [Pei-Lun Lee] DirectParquetOutputCommitter
* [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.Reynold Xin2015-04-284-4/+82
| | | | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5709 from rxin/inc-id and squashes the following commits: 7853611 [Reynold Xin] private sql. a9fda0d [Reynold Xin] Missed a few numbers. 343d896 [Reynold Xin] Self review feedback. a7136cb [Reynold Xin] [SPARK-7135][SQL] DataFrame expression for monotonically increasing IDs.
* [SPARK-6829] Added math functions for DataFramesBurak Yavuz2015-04-274-1/+804
| | | | | | | | | | | | | | | | | | | | | | | | Implemented almost all math functions found in scala.math (max, min and abs were already present). cc mengxr marmbrus Author: Burak Yavuz <brkyvz@gmail.com> Closes #5616 from brkyvz/math-udfs and squashes the following commits: fb27153 [Burak Yavuz] reverted exception message 836a098 [Burak Yavuz] fixed test and addressed small comment e5f0d13 [Burak Yavuz] addressed code review v2.2 b26c5fb [Burak Yavuz] addressed review v2.1 2761f08 [Burak Yavuz] addressed review v2 6588a5b [Burak Yavuz] fixed merge conflicts b084e10 [Burak Yavuz] Addressed code review 029e739 [Burak Yavuz] fixed atan2 test 534cc11 [Burak Yavuz] added more tests, addressed comments fa68dbe [Burak Yavuz] added double specific test data 937d5a5 [Burak Yavuz] use doubles instead of ints 8e28fff [Burak Yavuz] Added apache header 7ec8f7f [Burak Yavuz] Added math functions for DataFrames
* [SPARK-7145] [CORE] commons-lang (2.x) classes used instead of commons-lang3 ↵Sean Owen2015-04-271-1/+5
| | | | | | | | | | | | (3.x); commons-io used without dependency Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava Author: Sean Owen <sowen@cloudera.com> Closes #5703 from srowen/SPARK-7145 and squashes the following commits: 21fbe03 [Sean Owen] Remove use of commons-lang in favor of commons-lang3 classes; remove commons-io use in favor of Guava
* [SQL][Minor] rename DataTypeParser.apply to DataTypeParser.parsewangfei2015-04-261-1/+1
| | | | | | | | | | | rename DataTypeParser.apply to DataTypeParser.parse to make it more clear and readable. /cc rxin Author: wangfei <wangfei1@huawei.com> Closes #5710 from scwf/apply and squashes the following commits: c319977 [wangfei] rename apply to parse
* [SPARK-7152][SQL] Add a Column expression for partition ID.Reynold Xin2015-04-264-10/+89
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5705 from rxin/df-pid and squashes the following commits: 401018f [Reynold Xin] [SPARK-7152][SQL] Add a Column expression for partition ID.
* [SPARK-7055][SQL]Use correct ClassLoader for JDBC Driver in JDBCRDD.getConnectorVinod K C2015-04-231-1/+2
| | | | | | | | Author: Vinod K C <vinod.kc@huawei.com> Closes #5633 from vinodkc/use_correct_classloader_driverload and squashes the following commits: 73c5380 [Vinod K C] Use correct ClassLoader for JDBC Driver
* [HOTFIX] [SQL] Fix compilation for scala 2.11.Prashant Sharma2015-04-231-1/+1
| | | | | | | | Author: Prashant Sharma <prashant.s@imaginea.com> Closes #5652 from ScrapCodes/hf/compilation-fix-scala-2.11 and squashes the following commits: 819ff06 [Prashant Sharma] [HOTFIX] Fix compilation for scala 2.11.
* [SPARK-7069][SQL] Rename NativeType -> AtomicType.Reynold Xin2015-04-2317-66/+66
| | | | | | | | | | Also renamed JvmType to InternalType. Author: Reynold Xin <rxin@databricks.com> Closes #5651 from rxin/native-to-atomic-type and squashes the following commits: cbd4028 [Reynold Xin] [SPARK-7069][SQL] Rename NativeType -> AtomicType.
* [SPARK-7068][SQL] Remove PrimitiveTypeReynold Xin2015-04-224-14/+18
| | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5646 from rxin/remove-primitive-type and squashes the following commits: 01b673d [Reynold Xin] [SPARK-7068][SQL] Remove PrimitiveType
* [HOTFIX][SQL] Fix broken cached testLiang-Chi Hsieh2015-04-221-7/+14
| | | | | | | | | | | Added in #5475. Pointed as broken in #5639. /cc marmbrus Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5640 from viirya/fix_cached_test and squashes the following commits: c0cf69a [Liang-Chi Hsieh] Fix broken cached test.
* [SPARK-6967] [SQL] fix date type convertion in jdbcrddDaoyuan Wang2015-04-222-2/+9
| | | | | | | | | | | This pr convert java.sql.Date type into Int for JDBCRDD. Author: Daoyuan Wang <daoyuan.wang@intel.com> Closes #5590 from adrian-wang/datebug and squashes the following commits: f897b81 [Daoyuan Wang] add a test case 3c9184c [Daoyuan Wang] fix date type convertion in jdbcrdd
* [SPARK-7059][SQL] Create a DataFrame join API to facilitate equijoin.Reynold Xin2015-04-222-11/+66
| | | | | | | | | Author: Reynold Xin <rxin@databricks.com> Closes #5638 from rxin/joinUsing and squashes the following commits: 13e9cc9 [Reynold Xin] Code review + Python. b1bd914 [Reynold Xin] [SPARK-7059][SQL] Create a DataFrame join API to facilitate equijoin and self join.
* [SPARK-7039][SQL]JDBCRDD: Add support on type NVARCHARszheng792015-04-221-0/+1
| | | | | | | | | | | | | | | | | | Issue: https://issues.apache.org/jira/browse/SPARK-7039 Add support to column type NVARCHAR in Sql Server java.sql.Types: http://docs.oracle.com/javase/7/docs/api/java/sql/Types.html Author: szheng79 <szheng.code@gmail.com> Closes #5618 from szheng79/patch-1 and squashes the following commits: 10da99c [szheng79] Update JDBCRDD.scala eab0bd8 [szheng79] Add support on type NVARCHAR
* [SQL] Rename some apply functions.Reynold Xin2015-04-226-16/+16
| | | | | | | | | | I was looking at the code gen code and got confused by a few of use cases of apply, in particular apply on objects. So I went ahead and changed a few of them. Hopefully slightly more clear with a proper verb. Author: Reynold Xin <rxin@databricks.com> Closes #5624 from rxin/apply-rename and squashes the following commits: ee45034 [Reynold Xin] [SQL] Rename some apply functions.
* [SPARK-5817] [SQL] Fix bug of udtf with column namesCheng Hao2015-04-213-24/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's a bug while do query like: ```sql select d from (select explode(array(1,1)) d from src limit 1) t ``` And it will throws exception like: ``` org.apache.spark.sql.AnalysisException: cannot resolve 'd' given input columns _c0; line 1 pos 7 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:48) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$apply$3$$anonfun$apply$1.applyOrElse(CheckAnalysis.scala:45) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$transformUp$1.apply(TreeNode.scala:250) at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:50) at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:249) at org.apache.spark.sql.catalyst.plans.QueryPlan.org$apache$spark$sql$catalyst$plans$QueryPlan$$transformExpressionUp$1(QueryPlan.scala:103) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2$$anonfun$apply$2.apply(QueryPlan.scala:117) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at scala.collection.TraversableLike$class.map(TraversableLike.scala:244) at scala.collection.AbstractTraversable.map(Traversable.scala:105) at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$2.apply(QueryPlan.scala:116) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) ``` To solve the bug, it requires code refactoring for UDTF The major changes are about: * Simplifying the UDTF development, UDTF will manage the output attribute names any more, instead, the `logical.Generate` will handle that properly. * UDTF will be asked for the output schema (data types) during the logical plan analyzing. Author: Cheng Hao <hao.cheng@intel.com> Closes #4602 from chenghao-intel/explode_bug and squashes the following commits: c2a5132 [Cheng Hao] add back resolved for Alias 556e982 [Cheng Hao] revert the unncessary change 002c361 [Cheng Hao] change the rule of resolved for Generate 04ae500 [Cheng Hao] add qualifier only for generator output 5ee5d2c [Cheng Hao] prepend the new qualifier d2e8b43 [Cheng Hao] Update the code as feedback ca5e7f4 [Cheng Hao] shrink the commits
* [SPARK-6996][SQL] Support map types in java beansPunya Biswal2015-04-213-59/+160
| | | | | | | | | | | | liancheng mengxr this is similar to #5146. Author: Punya Biswal <pbiswal@palantir.com> Closes #5578 from punya/feature/SPARK-6996 and squashes the following commits: d56c3e0 [Punya Biswal] Fix imports c7e308b [Punya Biswal] Support java iterable types in POJOs 5e00685 [Punya Biswal] Support map types in java beans
* [SPARK-6969][SQL] Refresh the cached table when REFRESH TABLE is usedYin Huai2015-04-211-0/+17
| | | | | | | | | | | | https://issues.apache.org/jira/browse/SPARK-6969 Author: Yin Huai <yhuai@databricks.com> Closes #5583 from yhuai/refreshTableRefreshDataCache and squashes the following commits: 1e5142b [Yin Huai] Add todo. 92b2498 [Yin Huai] Minor updates. 367df92 [Yin Huai] Recache data in the command of REFRESH TABLE.
* [SPARK-6994] Allow to fetch field values by name in sql.Rowvidmantas zemleris2015-04-211-0/+10
| | | | | | | | | | | | | It looked weird that up to now there was no way in Spark's Scala API to access fields of `DataFrame/sql.Row` by name, only by their index. This tries to solve this issue. Author: vidmantas zemleris <vidmantas@vinted.com> Closes #5573 from vidma/features/row-with-named-fields and squashes the following commits: 6145ae3 [vidmantas zemleris] [SPARK-6994][SQL] Allow to fetch field values by name on Row 9564ebb [vidmantas zemleris] [SPARK-6994][SQL] Add fieldIndex to schema (StructType)
* [SPARK-6635][SQL] DataFrame.withColumn should replace columns with identical ↵Liang-Chi Hsieh2015-04-202-1/+21
| | | | | | | | | | | | | column names JIRA https://issues.apache.org/jira/browse/SPARK-6635 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5541 from viirya/replace_with_column and squashes the following commits: b539c7b [Liang-Chi Hsieh] For comment. 72f35b1 [Liang-Chi Hsieh] DataFrame.withColumn can replace original column with identical column name.
* [SPARK-6368][SQL] Build a specialized serializer for Exchange operator.Yin Huai2015-04-204-6/+673
| | | | | | | | | | | | | | | | | | | | | | | JIRA: https://issues.apache.org/jira/browse/SPARK-6368 Author: Yin Huai <yhuai@databricks.com> Closes #5497 from yhuai/serializer2 and squashes the following commits: da562c5 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2 50e0c3d [Yin Huai] When no filed is emitted to shuffle, use SparkSqlSerializer for now. 9f1ed92 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2 6d07678 [Yin Huai] Address comments. 4273b8c [Yin Huai] Enabled SparkSqlSerializer2. 09e587a [Yin Huai] Remove TODO. 791b96a [Yin Huai] Use UTF8String. 60a1487 [Yin Huai] Merge remote-tracking branch 'upstream/master' into serializer2 3e09655 [Yin Huai] Use getAs for Date column. 43b9fb4 [Yin Huai] Test. 8297732 [Yin Huai] Fix test. c9373c8 [Yin Huai] Support DecimalType. 2379eeb [Yin Huai] ASF header. 39704ab [Yin Huai] Specialized serializer for Exchange.
* [SPARK-6807] [SparkR] Merge recent SparkR-pkg changescafreeman2015-04-171-3/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This PR pulls in recent changes in SparkR-pkg, including cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField. Author: cafreeman <cfreeman@alteryx.com> Author: Davies Liu <davies@databricks.com> Author: Zongheng Yang <zongheng.y@gmail.com> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Sun Rui <rui.sun@intel.com> Closes #5436 from davies/R3 and squashes the following commits: c2b09be [Davies Liu] SQLTypes -> schema a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3 168b7fe [Davies Liu] sort generics b1fe460 [Davies Liu] fix conflict in README.md e74c04e [Davies Liu] fix schema.R 4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5 41f8184 [Davies Liu] rm man ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3 1bdcb63 [Zongheng Yang] Updates to README.md. 5a553e7 [cafreeman] Use object attribute instead of argument 71372d9 [cafreeman] Update docs and examples 8526d2e71 [cafreeman] Remove `tojson` functions 6ef5f2d [cafreeman] Fix spacing 7741d66 [cafreeman] Rename the SQL DataType function 141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream 9387402 [Davies Liu] fix style 40199eb [Shivaram Venkataraman] Move except into sorted position 07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD. 7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey ed66c81 [cafreeman] Update `subtract` to work with `generics.R` f3ba785 [cafreeman] Fixed duplicate export 275deb4 [cafreeman] Update `NAMESPACE` and tests 1a3b63d [cafreeman] new version of `CreateDF` 836c4bf [cafreeman] Update `createDataFrame` and `toDF` be5d5c1 [cafreeman] refactor schema functions 40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5 20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4 c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master 136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats cd66603 [cafreeman] new line at EOF 8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep 7dd81b7 [cafreeman] Documentation 0e2a94f [cafreeman] Define functions for schema and fields
* [SPARK-6972][SQL] Add Coalesce to DataFrameMichael Armbrust2015-04-163-0/+24
| | | | | | | | Author: Michael Armbrust <michael@databricks.com> Closes #5545 from marmbrus/addCoalesce and squashes the following commits: 9fdf3f6 [Michael Armbrust] [SPARK-6972][SQL] Add Coalesce to DataFrame
* [SPARK-6966][SQL] Use correct ClassLoader for JDBC DriverMichael Armbrust2015-04-161-1/+2
| | | | | | | | | | | Otherwise we cannot add jars with drivers after the fact. Author: Michael Armbrust <michael@databricks.com> Closes #5543 from marmbrus/jdbcClassloader and squashes the following commits: d9930f3 [Michael Armbrust] fix imports 73d0614 [Michael Armbrust] [SPARK-6966][SQL] Use correct ClassLoader for JDBC Driver
* [SPARK-6899][SQL] Fix type mismatch when using codegen with Average on ↵Liang-Chi Hsieh2015-04-161-0/+9
| | | | | | | | | | | | DecimalType JIRA https://issues.apache.org/jira/browse/SPARK-6899 Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #5517 from viirya/fix_codegen_average and squashes the following commits: 8ae5f65 [Liang-Chi Hsieh] Add the case of DecimalType.Unlimited to Average.
* [SPARK-6911] [SQL] improve accessor for nested typesDavies Liu2015-04-163-8/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support access columns by index in Python: ``` >>> df[df[0] > 3].collect() [Row(age=5, name=u'Bob')] ``` Access items in ArrayType or MapType ``` >>> df.select(df.l.getItem(0), df.d.getItem("key")).show() >>> df.select(df.l[0], df.d["key"]).show() ``` Access field in StructType ``` >>> df.select(df.r.getField("b")).show() >>> df.select(df.r.a).show() ``` Author: Davies Liu <davies@databricks.com> Closes #5513 from davies/access and squashes the following commits: e04d5a0 [Davies Liu] Update run-tests-jenkins 7ada9eb [Davies Liu] update timeout d125ac4 [Davies Liu] check column name, improve scala tests 6b62540 [Davies Liu] fix test db15b42 [Davies Liu] Merge branch 'master' of github.com:apache/spark into access 6c32e79 [Davies Liu] add scala tests 11f1df3 [Davies Liu] improve accessor for nested types
* SPARK-6927 [SQL] Sorting Error when codegen on云峤2015-04-161-0/+20
| | | | | | | | | | | Fix this error by adding BinaryType comparor in GenerateOrdering. JIRA https://issues.apache.org/jira/browse/SPARK-6927 Author: 云峤 <chensong.cs@alibaba-inc.com> Closes #5524 from kaka1992/fix-codegen-sort and squashes the following commits: d7e2afe [云峤] fix codegen sorting error
* [SPARK-5277][SQL] - SparkSqlSerializer doesn't always register user ↵Max Seiden2015-04-154-12/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | specified KryoRegistrators [SPARK-5277][SQL] - SparkSqlSerializer doesn't always register user specified KryoRegistrators There were a few places where new SparkSqlSerializer instances were created with new, empty SparkConfs resulting in user specified registrators sometimes not getting initialized. The fix is to try and pull a conf from the SparkEnv, and construct a new conf (that loads defaults) if one cannot be found. The changes touched: 1) SparkSqlSerializer's resource pool (this appears to fix the issue in the comment) 2) execution.Exchange (for all of the partitioners) 3) execution.Limit (for the HashPartitioner) A few tests were added to ColumnTypeSuite, ensuring that a custom registrator and serde is initialized and used when in-memory columns are written. Author: Max Seiden <max@platfora.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #5237 from mhseiden/sql_udt_kryo and squashes the following commits: 3175c2f [Max Seiden] [SPARK-5277][SQL] - address code review comments e5011fb [Max Seiden] [SPARK-5277][SQL] - SparkSqlSerializer does not register user specified KryoRegistrators
* [SPARK-2213] [SQL] sort merge join for spark sqlDaoyuan Wang2015-04-158-31/+351
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Thanks for the initial work from Ishiihara in #3173 This PR introduce a new join method of sort merge join, which firstly ensure that keys of same value are in the same partition, and inside each partition the Rows are sorted by key. Then we can run down both sides together, find matched rows using [sort merge join](http://en.wikipedia.org/wiki/Sort-merge_join). In this way, we don't have to store the whole hash table of one side as hash join, thus we have less memory usage. Also, this PR would benefit from #3438 , making the sorting phrase much more efficient. We introduced a new configuration of "spark.sql.planner.sortMergeJoin" to switch between this(`true`) and ShuffledHashJoin(`false`), probably we want the default value of it be `false` at first. Author: Daoyuan Wang <daoyuan.wang@intel.com> Author: Michael Armbrust <michael@databricks.com> This patch had conflicts when merged, resolved by Committer: Michael Armbrust <michael@databricks.com> Closes #5208 from adrian-wang/smj and squashes the following commits: 2493b9f [Daoyuan Wang] fix style 5049d88 [Daoyuan Wang] propagate rowOrdering for RangePartitioning f91a2ae [Daoyuan Wang] yin's comment: use external sort if option is enabled, add comments f515cd2 [Daoyuan Wang] yin's comment: outputOrdering, join suite refine ec8061b [Daoyuan Wang] minor change 413fd24 [Daoyuan Wang] Merge pull request #3 from marmbrus/pr/5208 952168a [Michael Armbrust] add type 5492884 [Michael Armbrust] copy when ordering 7ddd656 [Michael Armbrust] Cleanup addition of ordering requirements b198278 [Daoyuan Wang] inherit ordering in project c8e82a3 [Daoyuan Wang] fix style 6e897dd [Daoyuan Wang] hide boundReference from manually construct RowOrdering for key compare in smj 8681d73 [Daoyuan Wang] refactor Exchange and fix copy for sorting 2875ef2 [Daoyuan Wang] fix changed configuration 61d7f49 [Daoyuan Wang] add omitted comment 00a4430 [Daoyuan Wang] fix bug 078d69b [Daoyuan Wang] address comments: add comments, do sort in shuffle, and others 3af6ba5 [Daoyuan Wang] use buffer for only one side 171001f [Daoyuan Wang] change default outputordering 47455c9 [Daoyuan Wang] add apache license ... a28277f [Daoyuan Wang] fix style 645c70b [Daoyuan Wang] address comments using sort 068c35d [Daoyuan Wang] fix new style and add some tests 925203b [Daoyuan Wang] address comments 07ce92f [Daoyuan Wang] fix ArrayIndexOutOfBound 42fca0e [Daoyuan Wang] code clean e3ec096 [Daoyuan Wang] fix comment style.. 2edd235 [Daoyuan Wang] fix outputpartitioning 57baa40 [Daoyuan Wang] fix sort eval bug 303b6da [Daoyuan Wang] fix several errors 95db7ad [Daoyuan Wang] fix brackets for if-statement 4464f16 [Daoyuan Wang] fix error 880d8e9 [Daoyuan Wang] sort merge join for spark sql